Commit Graph

107 Commits

Author SHA1 Message Date
Zhongwen Deng 4f396a36a9 Merge remote-tracking branch 'upstream/master' into release-50 2023-05-08 14:58:03 +08:00
Thales Macedo Garitezi 8aa7c014e7 perf(buffer_worker): avoid calling `ets:info/2`
(Almost?) fixes https://emqx.atlassian.net/browse/EMQX-9637

During the course of performance tests comparing the performance of
e5.0.3 and e4.4.16 regarding the webhook bridge in sync mode, we
observed that the throughput in e5.0.3 (sync) was much lower than in
e4.4.16: ~ 9 k msgs / s vs. ~ 50 k msgs / s, respectively.

Analyzing `observer_cli` output, we noticed that a lot of the time
both buffer workers and ehttpc processes was spent in `ets:info/2`.
That function was called to check the size of the inflight table when
updating metrics and checking if the inflight table was full.  Other
uses of `ets:info/2` were contained inside the arguments to some
`?tp/2` macro usages (https://github.com/kafka4beam/snabbkaffe/pull/60).

By using a specific record to track the size of the table, we managed
to improve the bridge performance to ~ 45 k msgs / s in sync mode.
2023-05-02 17:05:32 -03:00
Andrew Mayorov 670709f746
feat(resource): ensure uniqueness through `gproc`
Also use it instead of a custom ETS table for simplicity and
better consistency. This has drawbacks though: expect slightly
increased load on gproc gen_server due to how `gproc:set_value/2`
works.
2023-05-02 17:29:22 +03:00
Thales Macedo Garitezi c53741a08c fix(buffer_worker): avoid sending late reply messages to callers
Fixes https://emqx.atlassian.net/browse/EMQX-9635

During a sync call from process `A` to a buffer worker `B`, its call
to the underlying resource `C` can be very slow.  In those cases, `A`
will receive a timeout response and expect no more messages from `B`
nor `C`.  However, prior to this fix, if `B` is stuck in a long sync
call to `C` and then gets its response after `A` timed out, `B` would
still send the late response to `A`, polluting its mailbox.
2023-04-26 13:18:28 -03:00
Thales Macedo Garitezi d78312e10e test(resource): fix flaky test 2023-04-26 09:25:33 -03:00
Serge Tupchii b5eda9f0d1 perf(emqx_resource): don't reactivate alarms on reoccurring errors
Avoid unnecessary calls to activate an alarm if it has been already activated.

Fixes: EMQX-9529/#10357
2023-04-20 16:37:33 +03:00
Thales Macedo Garitezi cb995e2033 fix(buffer_worker): avoid sending late reply messages to callers
Fixes https://emqx.atlassian.net/browse/EMQX-9635

During a sync call from process `A` to a buffer worker `B`, its call
to the underlying resource `C` can be very slow.  In those cases, `A`
will receive a timeout response and expect no more messages from `B`
nor `C`.  However, prior to this fix, if `B` is stuck in a long sync
call to `C` and then gets its response after `A` timed out, `B` would
still send the late response to `A`, polluting its mailbox.
2023-04-19 18:27:10 -03:00
Thales Macedo Garitezi e073bc90bc refactor(buffer_worker): rename `s/queue/buffer/g` 2023-04-14 11:37:19 -03:00
Thales Macedo Garitezi 14ed4a7ada feat(buffer_worker): set default queue mode to `memory_only`
Fixes https://emqx.atlassian.net/browse/EMQX-9367

For better user experience and performance for the average bridge, we
should change the default queue mode to `memory_only`, as was the
behavior of most bridges in e4.x.  This leads to better performance
when message rate is high enough and the remote resource is not
keeping up with EMQX.

Also, we set the default segment size to equal max queue bytes.
2023-04-14 11:37:19 -03:00
Andrew Mayorov e70deae1c3
feat(resource): ask for metrics only when needed 2023-04-11 12:00:19 +03:00
Kjell Winblad 8e0d315b7b
Merge pull request #10197 from kjellwinblad/0321-fix-inflight-window-hand-over-to-kjell
fix: add inflight window setting to the clickhouse bridge
2023-03-29 09:38:24 +02:00
Thales Macedo Garitezi f8d5d53908 feat(buffer_worker): decouple query mode from underlying connector call mode
Fixes https://emqx.atlassian.net/browse/EMQX-9129

Currently, if an user configures a bridge with query mode sync, then
all calls to the underlying driver/connector ("inner calls") will
always be synchronous, regardless of its support for async calls.

Since buffer workers always support async queries ("outer calls"), we
should decouple those two call modes (inner and outer), and avoid
exposing the inner call configuration to user to avoid complexity.

There are two situations when we want to force synchronous calls to
the underlying connector even if it supports async:

1) When using `simple_sync_query`, since we are bypassing the buffer
workers;
2) When retrying the inflight window, to avoid overwhelming the
driver.
2023-03-23 13:40:31 -03:00
Kjell Winblad 35474578ca refactor: rename async_inflight_window to inflight_window everywhere 2023-03-23 14:21:57 +01:00
Thales Macedo Garitezi 91a57faa95
Merge pull request #10128 from thalesmg/ocsp-v50-mkII
feat: add ocsp stapling support to mqtt ssl listener (5.0)
2023-03-16 13:10:48 -03:00
Thales Macedo Garitezi 164440fe83 test(resource): fix flaky test
Sometimes this test might retry more times, so we check the prefix
of the trace only.
2023-03-15 14:25:55 -03:00
Andrew Mayorov 29907875bf
test(bufworker): set `batch_time` for batch-related testcases
By default it's `0` since e9d3fc51. This made a couple of tests prone
to flapping.
2023-03-15 19:17:30 +03:00
Andrew Mayorov e411c5d5f8
refactor(resman): work with state cache atomically
Also ensure that cache entries are always consistent with `Data`,
so that most of the code could rely on reading the cached entry
most of the time.
2023-03-15 19:17:30 +03:00
Thales Macedo Garitezi 422597a441 test: fix flaky tests 2023-03-14 16:08:47 -03:00
Andrew Mayorov c883e4b36a
test: drop custom `loop_wait` in favor of snabkaffe's `?retry` 2023-02-24 18:16:35 +03:00
Zaiming (Stone) Shi c97d17cc91 test: refactor to loop wait for counters 2023-02-24 09:02:03 +01:00
Zaiming (Stone) Shi 7a6465e2cf fix(buffer_worker): ensure flush timer reset in blocked state 2023-02-23 21:06:38 +01:00
Zaiming (Stone) Shi 3a6dbbdd05 refactor(buffer_worker): ensure flsh message is never missed 2023-02-23 20:11:00 +01:00
Zaiming (Stone) Shi dbfdeec5e9 fix(buffer_worker): log unknown async replies 2023-02-23 12:55:49 +01:00
Zaiming (Stone) Shi 036f69cd6e test: ensure batch size > 1 is covered in expiration test 2023-02-22 23:26:04 +01:00
Zaiming (Stone) Shi bf8becd521 test: make sure gauge return to 0 in test cases 2023-02-22 23:07:12 +01:00
Zaiming (Stone) Shi fc614e16e5 fix(bridge): update inflight items after partial expiry 2023-02-22 22:05:56 +01:00
Erik Timan 2442a4dea7 test(emqx_resource): add regression test for recursive flushing 2023-02-16 14:17:16 +01:00
Andrew Mayorov d8d06a260f
test(buffer): add test on inflight overflow w/ async queries
This testcase should verify that the buffer will retry all inflight
queries failed with recoverable errors + flush all outstanding queries.

Co-authored-by: ieQu1 <99872536+ieQu1@users.noreply.github.com>
2023-02-08 14:08:04 +03:00
Zaiming (Stone) Shi 13ef30c46c
Merge pull request #9884 from savonarola/resource-fixes
fix(resources): fix resource lifecycle
2023-02-02 12:02:34 +01:00
Ilya Averyanov 14f528cc86 fix(resources): fix resource lifecycle
* do not resume all buffer workers on successful healthcheck
* do not pass undefined state to resource healthcheck callback
2023-02-01 18:26:13 +02:00
Andrew Mayorov 5fd7f65a1f
test(bufworker): make testcase simpler to follow
The confusion was due to the fact that subsequent query was missing
`async_reply_fun` and thus, was not accumulating in the results.
2023-02-01 16:52:47 +03:00
Andrew Mayorov ff473e0f1b
test(bufworker): fix testcase flapping due to data races 2023-02-01 12:57:46 +03:00
Zaiming (Stone) Shi b3ad9e97d2
Merge pull request #9870 from keynslug/fix/mqtt-connection-loss-feedback
feat(mqtt-bridge): avoid middleman process
2023-01-31 19:12:18 +01:00
Andrew Mayorov c76311c9c3
fix(buffer): count inflight batches properly 2023-01-31 18:30:42 +03:00
Zaiming (Stone) Shi d47941601d refactor(buffer_worker): rename trace points 2023-01-28 11:52:11 +01:00
Zaiming (Stone) Shi fc38ea9571 refactor(buffer_worker): do not keep request body in reply context
the request body can be potentially very large
the reply context is sent to the async call handler and kept
in its memory until the async reply is received from bridge
target service.

this commit tries to minimize the size of the reply context
by replacing the request body with `[]`.
2023-01-27 17:12:55 +01:00
Stefan Strigler 2d62de5188 test: fix expected result from timeout error 2023-01-27 11:43:48 +01:00
Zaiming (Stone) Shi 1f799dfd59 fix: reply with {error, buffer_overflow} when discarded 2023-01-26 17:15:36 +01:00
Thales Macedo Garitezi 6fa6c679bb feat(buffer_worker): add expiration time to requests
With this, we avoid performing work or replying to callers that are no
longer waiting on a result.

Also introduces two new counters:

- `dropped.expired` :: happens when a request expires before being
  sent downstream
- `late_reply` :: when a response is receive from downstream, but the
  caller is no longer for a reply because the request has expired, and
  the caller might even have retried it.
2023-01-20 11:36:52 -03:00
Thales Macedo Garitezi 47f796dd12 refactor: rename `emqx_resource_worker` -> `emqx_resource_buffer_worker`
To make it more clear that it's purpose is serve as a buffering layer.
2023-01-18 16:15:34 -03:00
Thales Macedo Garitezi 5c2ac0ac81 chore: don't cancel inflight items upon worker death; retry them 2023-01-17 19:50:30 -03:00
Thales Macedo Garitezi fa01deb3eb chore: retry as much as possible, don't reply to caller too soon 2023-01-17 16:49:15 -03:00
Thales Macedo Garitezi b5aaef084c refactor: enter running state directly
now that we don't have the possibility of dirty disk queues (we always
use volatile replayq), we will never resume old work.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 006b4bda97 feat(buffer_worker): monitor async workers and cancel their inflight requests upon death 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 731ac6567a fix(buffer_worker): don't retry all kinds of inflight requests
Some requests should not be retried during the blocked state.  For
example, if some async requests are just taking some time to process,
we should avoid retrying them periodically, lest risk overloading the
downstream further.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 5dd24a64c3 refactor(buffer_worker): check if inflight is full before flushing 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 81fc561ed5 fix(buffer_worker): check for overflow after enqueuing new requests 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 32a9e60313 feat(buffer_worker): also use the inflight table for sync requests
Related: https://emqx.atlassian.net/browse/EMQX-8692

This should also correctly account for `retried.*` metrics for sync
requests.

Also fixes cases where race conditions for retrying async requests
could potentially lead to inconsistent metrics.

Fixes more cases where a stale reference to `replayq` was being held
accidentally after a `pop`.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi c383558467 fix(buffer): fix `replayq` usages in buffer workers (5.0)
https://emqx.atlassian.net/browse/EMQX-8700

Fixes a few errors in the usage of `replayq` queues.

- Close `replayq` when `emqx_resource_worker` terminates.
- Do not keep old references to `replayq` after any `pop`s.
- Clear `replayq`'s data directories when removing a resource.
2023-01-17 16:48:48 -03:00
Kjell Winblad 734e6b9c96 chore: fix flaky test cases, log labels and review comments
Co-authored-by: Thales Macedo Garitezi <thalesmg@gmail.com>
2023-01-13 11:05:02 +01:00