yuanbiao/emqx - emqx

Commit Graph

Author	SHA1	Message	Date
Zhongwen Deng	4f396a36a9	Merge remote-tracking branch 'upstream/master' into release-50	2023-05-08 14:58:03 +08:00
Thales Macedo Garitezi	8aa7c014e7	perf(buffer_worker): avoid calling `ets:info/2` (Almost?) fixes https://emqx.atlassian.net/browse/EMQX-9637 During the course of performance tests comparing the performance of e5.0.3 and e4.4.16 regarding the webhook bridge in sync mode, we observed that the throughput in e5.0.3 (sync) was much lower than in e4.4.16: ~ 9 k msgs / s vs. ~ 50 k msgs / s, respectively. Analyzing `observer_cli` output, we noticed that a lot of the time both buffer workers and ehttpc processes was spent in `ets:info/2`. That function was called to check the size of the inflight table when updating metrics and checking if the inflight table was full. Other uses of `ets:info/2` were contained inside the arguments to some `?tp/2` macro usages (https://github.com/kafka4beam/snabbkaffe/pull/60). By using a specific record to track the size of the table, we managed to improve the bridge performance to ~ 45 k msgs / s in sync mode.	2023-05-02 17:05:32 -03:00
Andrew Mayorov	670709f746	feat(resource): ensure uniqueness through `gproc` Also use it instead of a custom ETS table for simplicity and better consistency. This has drawbacks though: expect slightly increased load on gproc gen_server due to how `gproc:set_value/2` works.	2023-05-02 17:29:22 +03:00
Thales Macedo Garitezi	c53741a08c	fix(buffer_worker): avoid sending late reply messages to callers Fixes https://emqx.atlassian.net/browse/EMQX-9635 During a sync call from process `A` to a buffer worker `B`, its call to the underlying resource `C` can be very slow. In those cases, `A` will receive a timeout response and expect no more messages from `B` nor `C`. However, prior to this fix, if `B` is stuck in a long sync call to `C` and then gets its response after `A` timed out, `B` would still send the late response to `A`, polluting its mailbox.	2023-04-26 13:18:28 -03:00
Thales Macedo Garitezi	d78312e10e	test(resource): fix flaky test	2023-04-26 09:25:33 -03:00
Serge Tupchii	b5eda9f0d1	perf(emqx_resource): don't reactivate alarms on reoccurring errors Avoid unnecessary calls to activate an alarm if it has been already activated. Fixes: EMQX-9529/#10357	2023-04-20 16:37:33 +03:00
Thales Macedo Garitezi	cb995e2033	fix(buffer_worker): avoid sending late reply messages to callers Fixes https://emqx.atlassian.net/browse/EMQX-9635 During a sync call from process `A` to a buffer worker `B`, its call to the underlying resource `C` can be very slow. In those cases, `A` will receive a timeout response and expect no more messages from `B` nor `C`. However, prior to this fix, if `B` is stuck in a long sync call to `C` and then gets its response after `A` timed out, `B` would still send the late response to `A`, polluting its mailbox.	2023-04-19 18:27:10 -03:00
Thales Macedo Garitezi	e073bc90bc	refactor(buffer_worker): rename `s/queue/buffer/g`	2023-04-14 11:37:19 -03:00
Thales Macedo Garitezi	14ed4a7ada	feat(buffer_worker): set default queue mode to `memory_only` Fixes https://emqx.atlassian.net/browse/EMQX-9367 For better user experience and performance for the average bridge, we should change the default queue mode to `memory_only`, as was the behavior of most bridges in e4.x. This leads to better performance when message rate is high enough and the remote resource is not keeping up with EMQX. Also, we set the default segment size to equal max queue bytes.	2023-04-14 11:37:19 -03:00
Andrew Mayorov	e70deae1c3	feat(resource): ask for metrics only when needed	2023-04-11 12:00:19 +03:00
Kjell Winblad	8e0d315b7b	Merge pull request #10197 from kjellwinblad/0321-fix-inflight-window-hand-over-to-kjell fix: add inflight window setting to the clickhouse bridge	2023-03-29 09:38:24 +02:00
Thales Macedo Garitezi	f8d5d53908	feat(buffer_worker): decouple query mode from underlying connector call mode Fixes https://emqx.atlassian.net/browse/EMQX-9129 Currently, if an user configures a bridge with query mode sync, then all calls to the underlying driver/connector ("inner calls") will always be synchronous, regardless of its support for async calls. Since buffer workers always support async queries ("outer calls"), we should decouple those two call modes (inner and outer), and avoid exposing the inner call configuration to user to avoid complexity. There are two situations when we want to force synchronous calls to the underlying connector even if it supports async: 1) When using `simple_sync_query`, since we are bypassing the buffer workers; 2) When retrying the inflight window, to avoid overwhelming the driver.	2023-03-23 13:40:31 -03:00
Kjell Winblad	35474578ca	refactor: rename async_inflight_window to inflight_window everywhere	2023-03-23 14:21:57 +01:00
Thales Macedo Garitezi	91a57faa95	Merge pull request #10128 from thalesmg/ocsp-v50-mkII feat: add ocsp stapling support to mqtt ssl listener (5.0)	2023-03-16 13:10:48 -03:00
Thales Macedo Garitezi	164440fe83	test(resource): fix flaky test Sometimes this test might retry more times, so we check the prefix of the trace only.	2023-03-15 14:25:55 -03:00
Andrew Mayorov	29907875bf	test(bufworker): set `batch_time` for batch-related testcases By default it's `0` since `e9d3fc51`. This made a couple of tests prone to flapping.	2023-03-15 19:17:30 +03:00
Andrew Mayorov	e411c5d5f8	refactor(resman): work with state cache atomically Also ensure that cache entries are always consistent with `Data`, so that most of the code could rely on reading the cached entry most of the time.	2023-03-15 19:17:30 +03:00
Thales Macedo Garitezi	422597a441	test: fix flaky tests	2023-03-14 16:08:47 -03:00
Andrew Mayorov	c883e4b36a	test: drop custom `loop_wait` in favor of snabkaffe's `?retry`	2023-02-24 18:16:35 +03:00
Zaiming (Stone) Shi	c97d17cc91	test: refactor to loop wait for counters	2023-02-24 09:02:03 +01:00
Zaiming (Stone) Shi	7a6465e2cf	fix(buffer_worker): ensure flush timer reset in blocked state	2023-02-23 21:06:38 +01:00
Zaiming (Stone) Shi	3a6dbbdd05	refactor(buffer_worker): ensure flsh message is never missed	2023-02-23 20:11:00 +01:00
Zaiming (Stone) Shi	dbfdeec5e9	fix(buffer_worker): log unknown async replies	2023-02-23 12:55:49 +01:00
Zaiming (Stone) Shi	036f69cd6e	test: ensure batch size > 1 is covered in expiration test	2023-02-22 23:26:04 +01:00
Zaiming (Stone) Shi	bf8becd521	test: make sure gauge return to 0 in test cases	2023-02-22 23:07:12 +01:00
Zaiming (Stone) Shi	fc614e16e5	fix(bridge): update inflight items after partial expiry	2023-02-22 22:05:56 +01:00
Erik Timan	2442a4dea7	test(emqx_resource): add regression test for recursive flushing	2023-02-16 14:17:16 +01:00
Andrew Mayorov	d8d06a260f	test(buffer): add test on inflight overflow w/ async queries This testcase should verify that the buffer will retry all inflight queries failed with recoverable errors + flush all outstanding queries. Co-authored-by: ieQu1 <99872536+ieQu1@users.noreply.github.com>	2023-02-08 14:08:04 +03:00
Zaiming (Stone) Shi	13ef30c46c	Merge pull request #9884 from savonarola/resource-fixes fix(resources): fix resource lifecycle	2023-02-02 12:02:34 +01:00
Ilya Averyanov	14f528cc86	fix(resources): fix resource lifecycle * do not resume all buffer workers on successful healthcheck * do not pass undefined state to resource healthcheck callback	2023-02-01 18:26:13 +02:00
Andrew Mayorov	5fd7f65a1f	test(bufworker): make testcase simpler to follow The confusion was due to the fact that subsequent query was missing `async_reply_fun` and thus, was not accumulating in the results.	2023-02-01 16:52:47 +03:00
Andrew Mayorov	ff473e0f1b	test(bufworker): fix testcase flapping due to data races	2023-02-01 12:57:46 +03:00
Zaiming (Stone) Shi	b3ad9e97d2	Merge pull request #9870 from keynslug/fix/mqtt-connection-loss-feedback feat(mqtt-bridge): avoid middleman process	2023-01-31 19:12:18 +01:00
Andrew Mayorov	c76311c9c3	fix(buffer): count inflight batches properly	2023-01-31 18:30:42 +03:00
Zaiming (Stone) Shi	d47941601d	refactor(buffer_worker): rename trace points	2023-01-28 11:52:11 +01:00
Zaiming (Stone) Shi	fc38ea9571	refactor(buffer_worker): do not keep request body in reply context the request body can be potentially very large the reply context is sent to the async call handler and kept in its memory until the async reply is received from bridge target service. this commit tries to minimize the size of the reply context by replacing the request body with `[]`.	2023-01-27 17:12:55 +01:00
Stefan Strigler	2d62de5188	test: fix expected result from timeout error	2023-01-27 11:43:48 +01:00
Zaiming (Stone) Shi	1f799dfd59	fix: reply with {error, buffer_overflow} when discarded	2023-01-26 17:15:36 +01:00
Thales Macedo Garitezi	6fa6c679bb	feat(buffer_worker): add expiration time to requests With this, we avoid performing work or replying to callers that are no longer waiting on a result. Also introduces two new counters: - `dropped.expired` :: happens when a request expires before being sent downstream - `late_reply` :: when a response is receive from downstream, but the caller is no longer for a reply because the request has expired, and the caller might even have retried it.	2023-01-20 11:36:52 -03:00
Thales Macedo Garitezi	47f796dd12	refactor: rename `emqx_resource_worker` -> `emqx_resource_buffer_worker` To make it more clear that it's purpose is serve as a buffering layer.	2023-01-18 16:15:34 -03:00
Thales Macedo Garitezi	5c2ac0ac81	chore: don't cancel inflight items upon worker death; retry them	2023-01-17 19:50:30 -03:00
Thales Macedo Garitezi	fa01deb3eb	chore: retry as much as possible, don't reply to caller too soon	2023-01-17 16:49:15 -03:00
Thales Macedo Garitezi	b5aaef084c	refactor: enter running state directly now that we don't have the possibility of dirty disk queues (we always use volatile replayq), we will never resume old work.	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	006b4bda97	feat(buffer_worker): monitor async workers and cancel their inflight requests upon death	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	731ac6567a	fix(buffer_worker): don't retry all kinds of inflight requests Some requests should not be retried during the blocked state. For example, if some async requests are just taking some time to process, we should avoid retrying them periodically, lest risk overloading the downstream further.	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	5dd24a64c3	refactor(buffer_worker): check if inflight is full before flushing	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	81fc561ed5	fix(buffer_worker): check for overflow after enqueuing new requests	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	32a9e60313	feat(buffer_worker): also use the inflight table for sync requests Related: https://emqx.atlassian.net/browse/EMQX-8692 This should also correctly account for `retried.*` metrics for sync requests. Also fixes cases where race conditions for retrying async requests could potentially lead to inconsistent metrics. Fixes more cases where a stale reference to `replayq` was being held accidentally after a `pop`.	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	c383558467	fix(buffer): fix `replayq` usages in buffer workers (5.0) https://emqx.atlassian.net/browse/EMQX-8700 Fixes a few errors in the usage of `replayq` queues. - Close `replayq` when `emqx_resource_worker` terminates. - Do not keep old references to `replayq` after any `pop`s. - Clear `replayq`'s data directories when removing a resource.	2023-01-17 16:48:48 -03:00
Kjell Winblad	734e6b9c96	chore: fix flaky test cases, log labels and review comments Co-authored-by: Thales Macedo Garitezi <thalesmg@gmail.com>	2023-01-13 11:05:02 +01:00

1 2 3

107 Commits