Fixes https://emqx.atlassian.net/browse/EMQX-10074
Otherwise, requests from those async workers, now retriable, might not
be retried until the buffer worker blocks for other reasons, which
might take a long time.
Using `simple_one_for_one` has a potential race condition issue where
we read the PID of the resource manager before trying to remove a
resource, and then that PID changes because it was either dead at
first, or it crashed and changed, and later we use this stale PID to
try to remove it from the supervisor. Under such circumstances, the
restarting child might linger in the supervisor, leaking resources.
By using the resource ID itself as a child ID (and using `one_for_one`
restart strategy), we ensure the child is truly removed.
Depending on timing, `t_write_timeout` was getting stuck while
checking the resource health, and the previous request timeout options
were making a response to never be sent if that process took too long.
Fixes https://emqx.atlassian.net/browse/EMQX-9905
Since calling `telemetry` is costly in a hot path, we instead collect
metrics inside the buffer workers state and periodically flush them,
rather than immediately as events happen.
Otherwise, depending on the test execution order, tests might
sometimes fail.
Moreover, ensure that applications describe their dependecies
correctly and avoid starting irrelevant apps in tests.
Update is implemented as remove + create.
If a dleete call is made while the create is in progress
the remove call is likely to timeout too.
This causes the follwing creation to falsely succeed,
because there is alreay a running child under the supervisor.
As a result, the resource is permanently removed after
resource_manager eventually handles the remove call.
the timeout shutdown in child spec may
significantly slow down the deletion of a resource
this commit chagnes the shutdown to brutal kill
also, the pool worker removal code has been delete
because it's not necessary since the entier pool is
going to be force-delete later anyway
Fixes https://emqx.atlassian.net/browse/EMQX-9864
Setting a very large interval can cause `erlang:start_timer` to crash.
Also, setting auto_restart_interval or health_check_interval to "0s"
causes the state machine to be in loop as time 0 is handled separately:
| state_timeout() = timeout() | integer()
| (...)
| If Time is relative and 0 no timer is actually started, instead the the
| time-out event is enqueued to ensure that it gets processed before any
| not yet received external event.
from "https://www.erlang.org/doc/man/gen_statem.html#type-state_timeout"
Therefore, both fields are now validated against the range [1ms, 1h],
which doesn't cause above issues.
The previous commit uncovered another bug that was hidden by it:
`maybe_flush_after_async_reply` was sending a message to the wrong
PID. It was sending a message to `self()` meaning to target a buffer
worker, but `self()` in that context is never the buffer worker, it's
the connector's worker.
This change also revealed a race condition where the buffer workers
could stop flushing messages. So we piggy-backed on the atomic update
of the table size count to check if the buffer worker should be poked
to continue flushing. This allows us to get rid of
`maybe_flush_after_async_reply` altogether.
Fixes https://emqx.atlassian.net/browse/EMQX-9902
When the buffer worker inflight window is full, we don’t need to set a
timer to flush the messages again because there’s no more room, and
one of the inflight windows will flush the buffer worker by calling
`flush_worker`.
Currently, we do set the timer on such situation, and this fact
combined with the default batch time of 0 yields a busy loop situation
where the CPU spins a lot while inflight messages do not return.
Some performance tests indicate that calling `telemetry` is costly in
hot paths. Since increasing a counter by 0 is a no-op, we should
avoid calling `telemetry` if the amount to increase is 0.
(Almost?) fixes https://emqx.atlassian.net/browse/EMQX-9637
During the course of performance tests comparing the performance of
e5.0.3 and e4.4.16 regarding the webhook bridge in sync mode, we
observed that the throughput in e5.0.3 (sync) was much lower than in
e4.4.16: ~ 9 k msgs / s vs. ~ 50 k msgs / s, respectively.
Analyzing `observer_cli` output, we noticed that a lot of the time
both buffer workers and ehttpc processes was spent in `ets:info/2`.
That function was called to check the size of the inflight table when
updating metrics and checking if the inflight table was full. Other
uses of `ets:info/2` were contained inside the arguments to some
`?tp/2` macro usages (https://github.com/kafka4beam/snabbkaffe/pull/60).
By using a specific record to track the size of the table, we managed
to improve the bridge performance to ~ 45 k msgs / s in sync mode.
Also use it instead of a custom ETS table for simplicity and
better consistency. This has drawbacks though: expect slightly
increased load on gproc gen_server due to how `gproc:set_value/2`
works.
Before this change, a separate `manager_id` / `instance_id` was used
as resource manager id, which made connector interface somewhat
inconsistent: part of function calls to connector implementation
used instance id as first argument while the rest used resource id
itself.
Fixes https://emqx.atlassian.net/browse/EMQX-9635
During a sync call from process `A` to a buffer worker `B`, its call
to the underlying resource `C` can be very slow. In those cases, `A`
will receive a timeout response and expect no more messages from `B`
nor `C`. However, prior to this fix, if `B` is stuck in a long sync
call to `C` and then gets its response after `A` timed out, `B` would
still send the late response to `A`, polluting its mailbox.
Fixes https://emqx.atlassian.net/browse/EMQX-9635
During a sync call from process `A` to a buffer worker `B`, its call
to the underlying resource `C` can be very slow. In those cases, `A`
will receive a timeout response and expect no more messages from `B`
nor `C`. However, prior to this fix, if `B` is stuck in a long sync
call to `C` and then gets its response after `A` timed out, `B` would
still send the late response to `A`, polluting its mailbox.
Fixes https://emqx.atlassian.net/browse/EMQX-9367
For better user experience and performance for the average bridge, we
should change the default queue mode to `memory_only`, as was the
behavior of most bridges in e4.x. This leads to better performance
when message rate is high enough and the remote resource is not
keeping up with EMQX.
Also, we set the default segment size to equal max queue bytes.
Port of https://github.com/emqx/emqx/pull/10154 for `release-50`
Fixes https://emqx.atlassian.net/browse/EMQX-9099
Originally, the `resume_interval`, which is what defines how often a
buffer worker will attempt to retry its inflight window, was set to
the same as the `health_check_interval`. This had the problem that,
with default values, `health_check_interval = request_timeout`. This
meant that, if a buffer worker with those configs were ever blocked,
all requests would have timed out by the time it retried them.
Here we change the default `resume_interval` to a reasonable value
dependent on `health_check_interval` and `request_timeout`, and also
expose that as a hidden parameter for fine tuning if necessary.
Fixes https://emqx.atlassian.net/browse/EMQX-9130
Since buffer workers always support async calls ("outer calls"), we
should decouple those two call modes (inner and outer), and avoid
exposing the inner call configuration to user to avoid complexity.
For bridges that currently only allow sync query modes, we should
allow them to be configured with async. That means basically all
bridge types except Kafka Producer.
Fixes https://emqx.atlassian.net/browse/EMQX-9392
The returned information does not allow to diagnose the issue (i.e.: a
connection issue due to the wrong host and port, the wrong password
failing authn). However, such information is printed to the logs.
This changes the returned error to the API so that the user is hinted
at looking at the logs for further investigation of the error.
Fixes https://emqx.atlassian.net/browse/EMQX-9129
Currently, if an user configures a bridge with query mode sync, then
all calls to the underlying driver/connector ("inner calls") will
always be synchronous, regardless of its support for async calls.
Since buffer workers always support async queries ("outer calls"), we
should decouple those two call modes (inner and outer), and avoid
exposing the inner call configuration to user to avoid complexity.
There are two situations when we want to force synchronous calls to
the underlying connector even if it supports async:
1) When using `simple_sync_query`, since we are bypassing the buffer
workers;
2) When retrying the inflight window, to avoid overwhelming the
driver.
Those tests in the `flaky` test are really flaky and require lots of
CI retries.
Apparently, the flakiness comes from race conditions from restarting
bridges with the same name too fast between test cases. Previously,
all test cases were sharing the same bridge name (the module name).
Also removes the previously added alarm for request timeout.
There are situations where having a short request timeout and a long
health check interval make sense, so we don't want to alarm the user
for those situations. Instead, we automatically attempt to set a
reasonable `resume_interval` value.
This commit makes sure the inflight window setting is present for the
clickhouse bridge. It also changes emqx_resource_schema that previously
removed the inflight window setting from resources with query mode
`always_sync`. We don't need to do that because all bridges that uses
the buffer worker queue will get async call handling even if the bridge
don't support the async callback.
Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>
Fixes https://emqx.atlassian.net/browse/EMQX-9099
The default value for `request_timeout` is 15 seconds, and the default
resume interval is also 15 seconds (the health check timeout, if
`resume_interval` is not explicitly given). This means that, in
practice, if a buffer worker ever gets into the blocked state, then
almost all requests will timeout.
Proposed improvement:
- `request_timeout` should by default be twice as much as
health_check_interval.
- Emit a alarm if `request_timeout` is not greater than
`health_check_interval`.
Kafka Producer and Consumer bridges rely on this prefix for detecting
a dry run and avoid leaking atoms. At some point, this prefix was
changed, effectively disabling the check in Kafka Producer.
The resource manager may be busy at times, so this change ensures that
getting resource instance state will not block. Currently, no users of
`emqx_resource:get_instance/1` do seem to be relying on state being
"as-actual-as-possible" guarantee it was providing.
The current schema allows `infinity` for `request_timeout`, so we have
to take that into account. It's not currently possible to set
`batch_time = infinity`, so there's no need to treat that case.
To avoid message loss due to misconfigurations, we adjust `batch_time`
based on `request_timeout`. If `batch_time` > `request_timeout`, all
requests will timeout before being sent if the message rate is low.
Even worse if `pool_size` is high. We cap `batch_time` at
`request_timeout div 2` as a rule of thumb.
This commit adds a Clickhouse bridge to EMQX 5. The bridge is similar to
the Clickhouse bridge in the 4.4, but adds the possibility to use
different formats (such as JSON) for values to be inserted.
This is a new issue introduced in the previous fix commits
after handling the partial expiry correctly, the
IsFullBefore check is no longer the state before the reply
is received but the state after a partially-expired batch
is shrinked.
The fix is simple, move the check to the entry-point of
where async reply callback enters, then send an async
'flush' notification regardless of the handling result.
Prior to this fix there were two metrics issues
1. if a batch is all requests expired when receiving a reply
it only bumped 1 instead of the batch size for 'late_reply'
2. when a batch is partially delivered (or expired), the
dropped requests were not decremented from the inflight size gauge
This testcase should verify that the buffer will retry all inflight
queries failed with recoverable errors + flush all outstanding queries.
Co-authored-by: ieQu1 <99872536+ieQu1@users.noreply.github.com>