Fixes https://emqx.atlassian.net/browse/EMQX-10074
Otherwise, requests from those async workers, now retriable, might not
be retried until the buffer worker blocks for other reasons, which
might take a long time.
Fixes https://emqx.atlassian.net/browse/EMQX-9905
Since calling `telemetry` is costly in a hot path, we instead collect
metrics inside the buffer workers state and periodically flush them,
rather than immediately as events happen.
The previous commit uncovered another bug that was hidden by it:
`maybe_flush_after_async_reply` was sending a message to the wrong
PID. It was sending a message to `self()` meaning to target a buffer
worker, but `self()` in that context is never the buffer worker, it's
the connector's worker.
This change also revealed a race condition where the buffer workers
could stop flushing messages. So we piggy-backed on the atomic update
of the table size count to check if the buffer worker should be poked
to continue flushing. This allows us to get rid of
`maybe_flush_after_async_reply` altogether.
Fixes https://emqx.atlassian.net/browse/EMQX-9902
When the buffer worker inflight window is full, we don’t need to set a
timer to flush the messages again because there’s no more room, and
one of the inflight windows will flush the buffer worker by calling
`flush_worker`.
Currently, we do set the timer on such situation, and this fact
combined with the default batch time of 0 yields a busy loop situation
where the CPU spins a lot while inflight messages do not return.
(Almost?) fixes https://emqx.atlassian.net/browse/EMQX-9637
During the course of performance tests comparing the performance of
e5.0.3 and e4.4.16 regarding the webhook bridge in sync mode, we
observed that the throughput in e5.0.3 (sync) was much lower than in
e4.4.16: ~ 9 k msgs / s vs. ~ 50 k msgs / s, respectively.
Analyzing `observer_cli` output, we noticed that a lot of the time
both buffer workers and ehttpc processes was spent in `ets:info/2`.
That function was called to check the size of the inflight table when
updating metrics and checking if the inflight table was full. Other
uses of `ets:info/2` were contained inside the arguments to some
`?tp/2` macro usages (https://github.com/kafka4beam/snabbkaffe/pull/60).
By using a specific record to track the size of the table, we managed
to improve the bridge performance to ~ 45 k msgs / s in sync mode.
Fixes https://emqx.atlassian.net/browse/EMQX-9635
During a sync call from process `A` to a buffer worker `B`, its call
to the underlying resource `C` can be very slow. In those cases, `A`
will receive a timeout response and expect no more messages from `B`
nor `C`. However, prior to this fix, if `B` is stuck in a long sync
call to `C` and then gets its response after `A` timed out, `B` would
still send the late response to `A`, polluting its mailbox.
Fixes https://emqx.atlassian.net/browse/EMQX-9635
During a sync call from process `A` to a buffer worker `B`, its call
to the underlying resource `C` can be very slow. In those cases, `A`
will receive a timeout response and expect no more messages from `B`
nor `C`. However, prior to this fix, if `B` is stuck in a long sync
call to `C` and then gets its response after `A` timed out, `B` would
still send the late response to `A`, polluting its mailbox.
Fixes https://emqx.atlassian.net/browse/EMQX-9367
For better user experience and performance for the average bridge, we
should change the default queue mode to `memory_only`, as was the
behavior of most bridges in e4.x. This leads to better performance
when message rate is high enough and the remote resource is not
keeping up with EMQX.
Also, we set the default segment size to equal max queue bytes.
Fixes https://emqx.atlassian.net/browse/EMQX-9129
Currently, if an user configures a bridge with query mode sync, then
all calls to the underlying driver/connector ("inner calls") will
always be synchronous, regardless of its support for async calls.
Since buffer workers always support async queries ("outer calls"), we
should decouple those two call modes (inner and outer), and avoid
exposing the inner call configuration to user to avoid complexity.
There are two situations when we want to force synchronous calls to
the underlying connector even if it supports async:
1) When using `simple_sync_query`, since we are bypassing the buffer
workers;
2) When retrying the inflight window, to avoid overwhelming the
driver.
Also removes the previously added alarm for request timeout.
There are situations where having a short request timeout and a long
health check interval make sense, so we don't want to alarm the user
for those situations. Instead, we automatically attempt to set a
reasonable `resume_interval` value.
Fixes https://emqx.atlassian.net/browse/EMQX-9099
The default value for `request_timeout` is 15 seconds, and the default
resume interval is also 15 seconds (the health check timeout, if
`resume_interval` is not explicitly given). This means that, in
practice, if a buffer worker ever gets into the blocked state, then
almost all requests will timeout.
Proposed improvement:
- `request_timeout` should by default be twice as much as
health_check_interval.
- Emit a alarm if `request_timeout` is not greater than
`health_check_interval`.
The current schema allows `infinity` for `request_timeout`, so we have
to take that into account. It's not currently possible to set
`batch_time = infinity`, so there's no need to treat that case.
To avoid message loss due to misconfigurations, we adjust `batch_time`
based on `request_timeout`. If `batch_time` > `request_timeout`, all
requests will timeout before being sent if the message rate is low.
Even worse if `pool_size` is high. We cap `batch_time` at
`request_timeout div 2` as a rule of thumb.
This commit adds a Clickhouse bridge to EMQX 5. The bridge is similar to
the Clickhouse bridge in the 4.4, but adds the possibility to use
different formats (such as JSON) for values to be inserted.
This is a new issue introduced in the previous fix commits
after handling the partial expiry correctly, the
IsFullBefore check is no longer the state before the reply
is received but the state after a partially-expired batch
is shrinked.
The fix is simple, move the check to the entry-point of
where async reply callback enters, then send an async
'flush' notification regardless of the handling result.