Also removes the previously added alarm for request timeout.
There are situations where having a short request timeout and a long
health check interval make sense, so we don't want to alarm the user
for those situations. Instead, we automatically attempt to set a
reasonable `resume_interval` value.
This commit makes sure the inflight window setting is present for the
clickhouse bridge. It also changes emqx_resource_schema that previously
removed the inflight window setting from resources with query mode
`always_sync`. We don't need to do that because all bridges that uses
the buffer worker queue will get async call handling even if the bridge
don't support the async callback.
Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>
Fixes https://emqx.atlassian.net/browse/EMQX-9099
The default value for `request_timeout` is 15 seconds, and the default
resume interval is also 15 seconds (the health check timeout, if
`resume_interval` is not explicitly given). This means that, in
practice, if a buffer worker ever gets into the blocked state, then
almost all requests will timeout.
Proposed improvement:
- `request_timeout` should by default be twice as much as
health_check_interval.
- Emit a alarm if `request_timeout` is not greater than
`health_check_interval`.
Kafka Producer and Consumer bridges rely on this prefix for detecting
a dry run and avoid leaking atoms. At some point, this prefix was
changed, effectively disabling the check in Kafka Producer.
The resource manager may be busy at times, so this change ensures that
getting resource instance state will not block. Currently, no users of
`emqx_resource:get_instance/1` do seem to be relying on state being
"as-actual-as-possible" guarantee it was providing.
The current schema allows `infinity` for `request_timeout`, so we have
to take that into account. It's not currently possible to set
`batch_time = infinity`, so there's no need to treat that case.
To avoid message loss due to misconfigurations, we adjust `batch_time`
based on `request_timeout`. If `batch_time` > `request_timeout`, all
requests will timeout before being sent if the message rate is low.
Even worse if `pool_size` is high. We cap `batch_time` at
`request_timeout div 2` as a rule of thumb.
This commit adds a Clickhouse bridge to EMQX 5. The bridge is similar to
the Clickhouse bridge in the 4.4, but adds the possibility to use
different formats (such as JSON) for values to be inserted.
This is a new issue introduced in the previous fix commits
after handling the partial expiry correctly, the
IsFullBefore check is no longer the state before the reply
is received but the state after a partially-expired batch
is shrinked.
The fix is simple, move the check to the entry-point of
where async reply callback enters, then send an async
'flush' notification regardless of the handling result.
Prior to this fix there were two metrics issues
1. if a batch is all requests expired when receiving a reply
it only bumped 1 instead of the batch size for 'late_reply'
2. when a batch is partially delivered (or expired), the
dropped requests were not decremented from the inflight size gauge
This testcase should verify that the buffer will retry all inflight
queries failed with recoverable errors + flush all outstanding queries.
Co-authored-by: ieQu1 <99872536+ieQu1@users.noreply.github.com>
the request body can be potentially very large
the reply context is sent to the async call handler and kept
in its memory until the async reply is received from bridge
target service.
this commit tries to minimize the size of the reply context
by replacing the request body with `[]`.
This reduces the log level from error to warning in places that are
connected to the influxdb bridge. Transient errors for external
resources should not render an error log.
some telemetry events from wolff are discarded:
* dropped:
this is double counted in wolff,
we now only subscribe to the dropped_queue_full event
* retried_failed:
it has different meanings in wolff,
in wolff, it means it's the 2nd (or onward) produce attempt
in EMQX, it means it's eventually failed after some retries
* retried_success
since we are going to handle the success counters in callbac
this having this reported from wolff will only make things
harder to understand
* failed
wolff never fails (unelss drop which is a different counter)
With this, we avoid performing work or replying to callers that are no
longer waiting on a result.
Also introduces two new counters:
- `dropped.expired` :: happens when a request expires before being
sent downstream
- `late_reply` :: when a response is receive from downstream, but the
caller is no longer for a reply because the request has expired, and
the caller might even have retried it.