Those tests in the `flaky` test are really flaky and require lots of
CI retries.
Apparently, the flakiness comes from race conditions from restarting
bridges with the same name too fast between test cases. Previously,
all test cases were sharing the same bridge name (the module name).
Also removes the previously added alarm for request timeout.
There are situations where having a short request timeout and a long
health check interval make sense, so we don't want to alarm the user
for those situations. Instead, we automatically attempt to set a
reasonable `resume_interval` value.
This commit makes sure the inflight window setting is present for the
clickhouse bridge. It also changes emqx_resource_schema that previously
removed the inflight window setting from resources with query mode
`always_sync`. We don't need to do that because all bridges that uses
the buffer worker queue will get async call handling even if the bridge
don't support the async callback.
Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>
Fixes https://emqx.atlassian.net/browse/EMQX-9099
The default value for `request_timeout` is 15 seconds, and the default
resume interval is also 15 seconds (the health check timeout, if
`resume_interval` is not explicitly given). This means that, in
practice, if a buffer worker ever gets into the blocked state, then
almost all requests will timeout.
Proposed improvement:
- `request_timeout` should by default be twice as much as
health_check_interval.
- Emit a alarm if `request_timeout` is not greater than
`health_check_interval`.
Kafka Producer and Consumer bridges rely on this prefix for detecting
a dry run and avoid leaking atoms. At some point, this prefix was
changed, effectively disabling the check in Kafka Producer.
The resource manager may be busy at times, so this change ensures that
getting resource instance state will not block. Currently, no users of
`emqx_resource:get_instance/1` do seem to be relying on state being
"as-actual-as-possible" guarantee it was providing.
The current schema allows `infinity` for `request_timeout`, so we have
to take that into account. It's not currently possible to set
`batch_time = infinity`, so there's no need to treat that case.
To avoid message loss due to misconfigurations, we adjust `batch_time`
based on `request_timeout`. If `batch_time` > `request_timeout`, all
requests will timeout before being sent if the message rate is low.
Even worse if `pool_size` is high. We cap `batch_time` at
`request_timeout div 2` as a rule of thumb.
This commit adds a Clickhouse bridge to EMQX 5. The bridge is similar to
the Clickhouse bridge in the 4.4, but adds the possibility to use
different formats (such as JSON) for values to be inserted.
This is a new issue introduced in the previous fix commits
after handling the partial expiry correctly, the
IsFullBefore check is no longer the state before the reply
is received but the state after a partially-expired batch
is shrinked.
The fix is simple, move the check to the entry-point of
where async reply callback enters, then send an async
'flush' notification regardless of the handling result.
Prior to this fix there were two metrics issues
1. if a batch is all requests expired when receiving a reply
it only bumped 1 instead of the batch size for 'late_reply'
2. when a batch is partially delivered (or expired), the
dropped requests were not decremented from the inflight size gauge
This testcase should verify that the buffer will retry all inflight
queries failed with recoverable errors + flush all outstanding queries.
Co-authored-by: ieQu1 <99872536+ieQu1@users.noreply.github.com>
the request body can be potentially very large
the reply context is sent to the async call handler and kept
in its memory until the async reply is received from bridge
target service.
this commit tries to minimize the size of the reply context
by replacing the request body with `[]`.
This reduces the log level from error to warning in places that are
connected to the influxdb bridge. Transient errors for external
resources should not render an error log.
some telemetry events from wolff are discarded:
* dropped:
this is double counted in wolff,
we now only subscribe to the dropped_queue_full event
* retried_failed:
it has different meanings in wolff,
in wolff, it means it's the 2nd (or onward) produce attempt
in EMQX, it means it's eventually failed after some retries
* retried_success
since we are going to handle the success counters in callbac
this having this reported from wolff will only make things
harder to understand
* failed
wolff never fails (unelss drop which is a different counter)
With this, we avoid performing work or replying to callers that are no
longer waiting on a result.
Also introduces two new counters:
- `dropped.expired` :: happens when a request expires before being
sent downstream
- `late_reply` :: when a response is receive from downstream, but the
caller is no longer for a reply because the request has expired, and
the caller might even have retried it.
Some requests should not be retried during the blocked state. For
example, if some async requests are just taking some time to process,
we should avoid retrying them periodically, lest risk overloading the
downstream further.
Related: https://emqx.atlassian.net/browse/EMQX-8692
This should also correctly account for `retried.*` metrics for sync
requests.
Also fixes cases where race conditions for retrying async requests
could potentially lead to inconsistent metrics.
Fixes more cases where a stale reference to `replayq` was being held
accidentally after a `pop`.
https://emqx.atlassian.net/browse/EMQX-8700
Fixes a few errors in the usage of `replayq` queues.
- Close `replayq` when `emqx_resource_worker` terminates.
- Do not keep old references to `replayq` after any `pop`s.
- Clear `replayq`'s data directories when removing a resource.
Both emqx_resource_managers and emqx_resource_workers leaked atoms as they
created an unique atoms to use as registered names. This is fixed by
removing the need to register the names.
Fixes: https://emqx.atlassian.net/browse/EMQX-8583
This makes the buffer/resource workers always use `replayq` for
queuing, along with collecting multiple requests in a single call.
This is done to avoid long message queues for the buffer workers and
rely on `replayq`'s capabilities of offloading to disk and detecting
overflow.
Also, this deprecates the `enable_batch` and `enable_queue` resource
creation options, as: i) queuing is now always enables; ii) batch_size
> 1 <=> batch_enabled. The corresponding metric
`dropped.queue_not_enabled` is dropped, along with `batching`. The
batching is too ephemeral, especially considering a default batch time
of 20 ms, and is not shown in the dashboard, so it was removed.
To avoid confusion for the users as to what persistence guarantees we
offer when buffering bridges/resources, we will always enable offload
mode for `replayq`. With this, when the buffer size is above the max
segment size, it'll flush the queue to disk, but on recovery after a
restart it'll clean the existing segments rather than resuming from
them.