Port of https://github.com/emqx/emqx/pull/10154 for `release-50`
Fixes https://emqx.atlassian.net/browse/EMQX-9099
Originally, the `resume_interval`, which is what defines how often a
buffer worker will attempt to retry its inflight window, was set to
the same as the `health_check_interval`. This had the problem that,
with default values, `health_check_interval = request_timeout`. This
meant that, if a buffer worker with those configs were ever blocked,
all requests would have timed out by the time it retried them.
Here we change the default `resume_interval` to a reasonable value
dependent on `health_check_interval` and `request_timeout`, and also
expose that as a hidden parameter for fine tuning if necessary.
Fixes https://emqx.atlassian.net/browse/EMQX-9130
Since buffer workers always support async calls ("outer calls"), we
should decouple those two call modes (inner and outer), and avoid
exposing the inner call configuration to user to avoid complexity.
For bridges that currently only allow sync query modes, we should
allow them to be configured with async. That means basically all
bridge types except Kafka Producer.
This commit makes it possible to use the ${var} syntax to refer to
variables in the payload of the message in the collection field.
This makes it possible to select which collection to insert into
dynamically.
Fixes:
https://emqx.atlassian.net/browse/EMQX-9246
Fixes https://emqx.atlassian.net/browse/EMQX-9392
The returned information does not allow to diagnose the issue (i.e.: a
connection issue due to the wrong host and port, the wrong password
failing authn). However, such information is printed to the logs.
This changes the returned error to the API so that the user is hinted
at looking at the logs for further investigation of the error.
To make the configuration more intuitive and avoid exposing more
parameters to the user, we should:
1) Remove reset_by_subscriber as an enum constructor for
`offset_reset_policy`, as that might make the consumer hang
indefinitely without manual action.
2) Set the `begin_offset` `brod_consumer` parameter to `earliest` or
`latest` depending on the value of `offset_reset_policy`, as that’s
probably the user’s intention.
Fixes https://emqx.atlassian.net/browse/EMQX-9129
Currently, if an user configures a bridge with query mode sync, then
all calls to the underlying driver/connector ("inner calls") will
always be synchronous, regardless of its support for async calls.
Since buffer workers always support async queries ("outer calls"), we
should decouple those two call modes (inner and outer), and avoid
exposing the inner call configuration to user to avoid complexity.
There are two situations when we want to force synchronous calls to
the underlying connector even if it supports async:
1) When using `simple_sync_query`, since we are bypassing the buffer
workers;
2) When retrying the inflight window, to avoid overwhelming the
driver.
Those tests in the `flaky` test are really flaky and require lots of
CI retries.
Apparently, the flakiness comes from race conditions from restarting
bridges with the same name too fast between test cases. Previously,
all test cases were sharing the same bridge name (the module name).
Also removes the previously added alarm for request timeout.
There are situations where having a short request timeout and a long
health check interval make sense, so we don't want to alarm the user
for those situations. Instead, we automatically attempt to set a
reasonable `resume_interval` value.
Metrics should only be exposed via the /bridges/:id/metrics endpoint,
and not in other operations such as getting the list of all bridges, or
in the response when a bridge has been created. This commit removes all
traces of metrics for the non-dedicated API endpoints.
Also hexencode non-utf8 binaries. This is essentially an heuristic.
We don't know column types in runtime, and there's no simple way
to find them out. Since we're already doing full binary scan during
escaping it should be cheap to bail out on non-utf8 strings and
hexencode them instead.
Also introduce separate function to highlight that this escaping
is MySQL-specific.
So that mysql client won't attempt to prepare them automatically, thus
trashing the server's prepared statements table and making interaction
overall heavier.
This commit adds a Clickhouse bridge to EMQX 5. The bridge is similar to
the Clickhouse bridge in the 4.4, but adds the possibility to use
different formats (such as JSON) for values to be inserted.
This will inevitably fail: it's not generally possible to update
different keys through the same cluster connection, one or more
update will fail with `MOVED` status. This testcase should serve
as a regression test later.
some telemetry events from wolff are discarded:
* dropped:
this is double counted in wolff,
we now only subscribe to the dropped_queue_full event
* retried_failed:
it has different meanings in wolff,
in wolff, it means it's the 2nd (or onward) produce attempt
in EMQX, it means it's eventually failed after some retries
* retried_success
since we are going to handle the success counters in callbac
this having this reported from wolff will only make things
harder to understand
* failed
wolff never fails (unelss drop which is a different counter)
With this, we avoid performing work or replying to callers that are no
longer waiting on a result.
Also introduces two new counters:
- `dropped.expired` :: happens when a request expires before being
sent downstream
- `late_reply` :: when a response is receive from downstream, but the
caller is no longer for a reply because the request has expired, and
the caller might even have retried it.
Related: https://emqx.atlassian.net/browse/EMQX-8692
This should also correctly account for `retried.*` metrics for sync
requests.
Also fixes cases where race conditions for retrying async requests
could potentially lead to inconsistent metrics.
Fixes more cases where a stale reference to `replayq` was being held
accidentally after a `pop`.