yuanbiao/emqx - emqx

Commit Graph

Author	SHA1	Message	Date
Thales Macedo Garitezi	dc5e3b939c	refactor(resource_manager): use macros and better differentiate status from state Internally in `emqx_resource_manager`, there seems to be many points where the `gen_statem` states are conflated with resource status, since their names coincide. While that works for now, introducing a new `gen_statem` state, an internal state, shouldn't necessarily imply a new, externally facing resource status. Here we also introduce the usage of some macros to avoid the pitfalls of making a typo in a state/status name.	2023-12-01 18:23:05 -03:00
Ivan Dyachkov	ec10c51073	Merge remote-tracking branch 'upstream/release-53' into 1129-sync-r53	2023-11-30 19:51:12 +01:00
Zaiming (Stone) Shi	14644988e0	chore: change triple-quotes to single-quotes	2023-11-29 16:15:18 +01:00
JianBo He	c8b5c51bbc	chore: fix failed test cases	2023-11-28 09:53:46 +08:00
Kjell Winblad	9dc3a169b3	feat: split bridges into a connector part and a bridge part Co-authored-by: Thales Macedo Garitezi <thalesmg@gmail.com> Co-authored-by: Stefan Strigler <stefan.strigler@emqx.io> Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com> Several bridges should be able to share a connector pool defined by a single connector. The connectors should be possible to enable and disable similar to how one can disable and enable bridges. There should also be an API for checking the status of a connector and for add/edit/delete connectors similar to the current bridge API. Issues: https://emqx.atlassian.net/browse/EMQX-10805	2023-10-30 14:48:47 +01:00
zhongwencool	dd687d9582	fix: dialyzer warning	2023-09-20 22:41:26 +08:00
zhongwencool	c26a18e949	fix: always return ok when remove local resource	2023-09-20 18:02:42 +08:00
Paulo Zulato	dfcede8794	fix: increment matched counter when bridge is unhealthy Fixes https://emqx.atlassian.net/browse/EMQX-10767	2023-08-30 10:52:53 -03:00
Paulo Zulato	cc3ba18734	fix: increment dropped message counter when bridge is unhealthy Fixes https://emqx.atlassian.net/browse/EMQX-10767	2023-08-28 19:47:11 -03:00
Paulo Zulato	42877e282d	fix: flatten error message on resource validator Fixes https://emqx.atlassian.net/browse/EMQX-10864	2023-08-25 13:53:52 -03:00
Andrew Mayorov	86d787eced	chore: bump hocon to 0.39.10 Which comes with a fix for slightly more user-friendly validation error messages.	2023-06-21 21:25:43 +02:00
Thales Macedo Garitezi	1d791d7a8c	fix(resource): validate maximum worker pool size Fixes https://emqx.atlassian.net/browse/EMQX-10297	2023-06-20 14:26:42 -03:00
Thales Macedo Garitezi	13746c2cdf	fix(resource): check status when (re)starting a resource Fixes https://emqx.atlassian.net/browse/EMQX-10290	2023-06-19 18:01:02 -03:00
Thales Macedo Garitezi	cc8631223e	fix(schema): avoid `function_clause` error when compacting errors Fixes https://emqx.atlassian.net/browse/EMQX-10168 The bridge probe API displayed the typecheck errors for the new timeout duration types correctly, but when an user tried to create the bridge anyway a `function_clause` error was raised when trying to compact hocon errors: ``` 09:47:19.045 [warning] [exception: :error, path: '/bridges', reason: {:case_clause, {:error, {:config_update_crashed, :function_clause}}}, stacktrace: [{:emqx_bridge_api, :create_or_update_bridge, 4, [file: '/home/thales/dev/emqx/emqx/apps/emqx_bridge/src/emqx_bridge_api.erl', line: 602]}, {:minirest_handler, :apply_callback, 3, [file: '/home/thales/dev/emqx/emqx/deps/minirest/src/minirest_handler.erl', line: 111]}, {:minirest_handler, :handle, 2, [file: '/home/thales/dev/emqx/emqx/deps/minirest/src/minirest_handler.erl', line: 44]}, {:minirest_handler, :init, 2, [file: '/home/thales/dev/emqx/emqx/deps/minirest/src/minirest_handler.erl', line: 27]}, {:cowboy_handler, :execute, 2, [file: '/home/thales/dev/emqx/emqx/deps/cowboy/src/cowboy_handler.erl', line: 41]}, {:cowboy_stream_h, :execute, 3, [file: '/home/thales/dev/emqx/emqx/deps/cowboy/src/cowboy_stream_h.erl', line: 318]}, {:cowboy_stream_h, :request_process, 3, [file: '/home/thales/dev/emqx/emqx/deps/cowboy/src/cowboy_stream_h.erl', line: 302]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 240]}]] ``` This fixes the issue so that both APIs return more friendly error messages.	2023-06-07 10:34:58 -03:00
Thales Macedo Garitezi	46393343e2	chore: use `timeout_duration` types for timer fields Fixes https://emqx.atlassian.net/browse/EMQX-10020	2023-06-05 11:46:38 -03:00
Thales Macedo Garitezi	99796224d8	refactor(resource): rename `request_timeout` -> `request_ttl` See https://emqx.atlassian.net/wiki/spaces/P/pages/612368639/open+e5.1+remove+auto+restart+interval+from+buffer+worker+resource+options	2023-06-01 13:01:53 -03:00
Thales Macedo Garitezi	10425eb925	feat(resource): deprecate `auto_restart_interval` in favor of `health_check_interval` See: https://emqx.atlassian.net/wiki/spaces/P/pages/612368639/open+e5.1+remove+auto+restart+interval+from+buffer+worker+resource+options Current problem: In 5.0.x, we have two timer options that control the state changing of buffer worker resources: auto_restart_interval and health_check_interval. - auto_restart_interval controls how often the resource attempts to transition from disconnected to connected. - health_check_interval controls how often the resource is checked and potentially moved from connected to disconnected or connecting. The existence of two independent timers for very similar purposes is confusing to users, QA and even developers. Also, an intimately related configuration is request_timeout, which can interact badly with auto_restart_interval if the latter is poorly configured: requests may always expire if request_timeout < auto_restart_interval and if the resource enters the disconnected state. For health_check_interval, we attempt to derive a sane default that gives requests a chance to retry (if request timeout is finite, then the resource retries requests with a period of min(health_check_interval, request_timeout / 3). Another problem with the separate auto_restart_interval is that its default value (60 s) is too high when compared to the default request timeout and health check, leading to the problems described above if not tuned. Proposed solution: We propose to drop auto_restart_interval in favor of health_check_interval, which will be used for both disconnected -> connected and connected -> {disconnected, connecting} transition checks. With that, the resource will attempt to reconnect at the same interval as the health check, which currently is 15 s. Also, as two smaller changes to accompany this one: - Increase the default request_timeout from 15 s to 45 s. - Rename request_timeout to request_ttl.	2023-06-01 11:20:06 -03:00
Thales Macedo Garitezi	6be8ff378e	fix(buffer_worker): make buffer worker enter `blocked` state when async worker dies Fixes https://emqx.atlassian.net/browse/EMQX-10074 Otherwise, requests from those async workers, now retriable, might not be retried until the buffer worker blocks for other reasons, which might take a long time.	2023-05-30 15:34:22 -03:00
Thales Macedo Garitezi	db60dcbada	test(buffer_worker): add assertion for inflight count after batch expiration Fixes https://emqx.atlassian.net/browse/EMQX-9829	2023-05-25 16:11:37 -03:00
Thales Macedo Garitezi	7d798c10e9	perf(buffer_worker): flush metrics periodically inside buffer worker process Fixes https://emqx.atlassian.net/browse/EMQX-9905 Since calling `telemetry` is costly in a hot path, we instead collect metrics inside the buffer workers state and periodically flush them, rather than immediately as events happen.	2023-05-22 09:11:23 -03:00
Thales Macedo Garitezi	85089a3210	fix(buffer_worker): correctly flush the buffer workers when inflight table room is made The previous commit uncovered another bug that was hidden by it: `maybe_flush_after_async_reply` was sending a message to the wrong PID. It was sending a message to `self()` meaning to target a buffer worker, but `self()` in that context is never the buffer worker, it's the connector's worker. This change also revealed a race condition where the buffer workers could stop flushing messages. So we piggy-backed on the atomic update of the table size count to check if the buffer worker should be poked to continue flushing. This allows us to get rid of `maybe_flush_after_async_reply` altogether.	2023-05-16 17:15:42 -03:00
Zhongwen Deng	4f396a36a9	Merge remote-tracking branch 'upstream/master' into release-50	2023-05-08 14:58:03 +08:00
Thales Macedo Garitezi	8aa7c014e7	perf(buffer_worker): avoid calling `ets:info/2` (Almost?) fixes https://emqx.atlassian.net/browse/EMQX-9637 During the course of performance tests comparing the performance of e5.0.3 and e4.4.16 regarding the webhook bridge in sync mode, we observed that the throughput in e5.0.3 (sync) was much lower than in e4.4.16: ~ 9 k msgs / s vs. ~ 50 k msgs / s, respectively. Analyzing `observer_cli` output, we noticed that a lot of the time both buffer workers and ehttpc processes was spent in `ets:info/2`. That function was called to check the size of the inflight table when updating metrics and checking if the inflight table was full. Other uses of `ets:info/2` were contained inside the arguments to some `?tp/2` macro usages (https://github.com/kafka4beam/snabbkaffe/pull/60). By using a specific record to track the size of the table, we managed to improve the bridge performance to ~ 45 k msgs / s in sync mode.	2023-05-02 17:05:32 -03:00
Andrew Mayorov	670709f746	feat(resource): ensure uniqueness through `gproc` Also use it instead of a custom ETS table for simplicity and better consistency. This has drawbacks though: expect slightly increased load on gproc gen_server due to how `gproc:set_value/2` works.	2023-05-02 17:29:22 +03:00
Thales Macedo Garitezi	c53741a08c	fix(buffer_worker): avoid sending late reply messages to callers Fixes https://emqx.atlassian.net/browse/EMQX-9635 During a sync call from process `A` to a buffer worker `B`, its call to the underlying resource `C` can be very slow. In those cases, `A` will receive a timeout response and expect no more messages from `B` nor `C`. However, prior to this fix, if `B` is stuck in a long sync call to `C` and then gets its response after `A` timed out, `B` would still send the late response to `A`, polluting its mailbox.	2023-04-26 13:18:28 -03:00
Thales Macedo Garitezi	d78312e10e	test(resource): fix flaky test	2023-04-26 09:25:33 -03:00
Serge Tupchii	b5eda9f0d1	perf(emqx_resource): don't reactivate alarms on reoccurring errors Avoid unnecessary calls to activate an alarm if it has been already activated. Fixes: EMQX-9529/#10357	2023-04-20 16:37:33 +03:00
Thales Macedo Garitezi	cb995e2033	fix(buffer_worker): avoid sending late reply messages to callers Fixes https://emqx.atlassian.net/browse/EMQX-9635 During a sync call from process `A` to a buffer worker `B`, its call to the underlying resource `C` can be very slow. In those cases, `A` will receive a timeout response and expect no more messages from `B` nor `C`. However, prior to this fix, if `B` is stuck in a long sync call to `C` and then gets its response after `A` timed out, `B` would still send the late response to `A`, polluting its mailbox.	2023-04-19 18:27:10 -03:00
Thales Macedo Garitezi	e073bc90bc	refactor(buffer_worker): rename `s/queue/buffer/g`	2023-04-14 11:37:19 -03:00
Thales Macedo Garitezi	14ed4a7ada	feat(buffer_worker): set default queue mode to `memory_only` Fixes https://emqx.atlassian.net/browse/EMQX-9367 For better user experience and performance for the average bridge, we should change the default queue mode to `memory_only`, as was the behavior of most bridges in e4.x. This leads to better performance when message rate is high enough and the remote resource is not keeping up with EMQX. Also, we set the default segment size to equal max queue bytes.	2023-04-14 11:37:19 -03:00
Andrew Mayorov	e70deae1c3	feat(resource): ask for metrics only when needed	2023-04-11 12:00:19 +03:00
Kjell Winblad	8e0d315b7b	Merge pull request #10197 from kjellwinblad/0321-fix-inflight-window-hand-over-to-kjell fix: add inflight window setting to the clickhouse bridge	2023-03-29 09:38:24 +02:00
Thales Macedo Garitezi	f8d5d53908	feat(buffer_worker): decouple query mode from underlying connector call mode Fixes https://emqx.atlassian.net/browse/EMQX-9129 Currently, if an user configures a bridge with query mode sync, then all calls to the underlying driver/connector ("inner calls") will always be synchronous, regardless of its support for async calls. Since buffer workers always support async queries ("outer calls"), we should decouple those two call modes (inner and outer), and avoid exposing the inner call configuration to user to avoid complexity. There are two situations when we want to force synchronous calls to the underlying connector even if it supports async: 1) When using `simple_sync_query`, since we are bypassing the buffer workers; 2) When retrying the inflight window, to avoid overwhelming the driver.	2023-03-23 13:40:31 -03:00
Kjell Winblad	35474578ca	refactor: rename async_inflight_window to inflight_window everywhere	2023-03-23 14:21:57 +01:00
Thales Macedo Garitezi	91a57faa95	Merge pull request #10128 from thalesmg/ocsp-v50-mkII feat: add ocsp stapling support to mqtt ssl listener (5.0)	2023-03-16 13:10:48 -03:00
Thales Macedo Garitezi	164440fe83	test(resource): fix flaky test Sometimes this test might retry more times, so we check the prefix of the trace only.	2023-03-15 14:25:55 -03:00
Andrew Mayorov	29907875bf	test(bufworker): set `batch_time` for batch-related testcases By default it's `0` since `e9d3fc51`. This made a couple of tests prone to flapping.	2023-03-15 19:17:30 +03:00
Andrew Mayorov	e411c5d5f8	refactor(resman): work with state cache atomically Also ensure that cache entries are always consistent with `Data`, so that most of the code could rely on reading the cached entry most of the time.	2023-03-15 19:17:30 +03:00
Thales Macedo Garitezi	422597a441	test: fix flaky tests	2023-03-14 16:08:47 -03:00
Andrew Mayorov	c883e4b36a	test: drop custom `loop_wait` in favor of snabkaffe's `?retry`	2023-02-24 18:16:35 +03:00
Zaiming (Stone) Shi	c97d17cc91	test: refactor to loop wait for counters	2023-02-24 09:02:03 +01:00
Zaiming (Stone) Shi	7a6465e2cf	fix(buffer_worker): ensure flush timer reset in blocked state	2023-02-23 21:06:38 +01:00
Zaiming (Stone) Shi	3a6dbbdd05	refactor(buffer_worker): ensure flsh message is never missed	2023-02-23 20:11:00 +01:00
Zaiming (Stone) Shi	dbfdeec5e9	fix(buffer_worker): log unknown async replies	2023-02-23 12:55:49 +01:00
Zaiming (Stone) Shi	036f69cd6e	test: ensure batch size > 1 is covered in expiration test	2023-02-22 23:26:04 +01:00
Zaiming (Stone) Shi	bf8becd521	test: make sure gauge return to 0 in test cases	2023-02-22 23:07:12 +01:00
Zaiming (Stone) Shi	fc614e16e5	fix(bridge): update inflight items after partial expiry	2023-02-22 22:05:56 +01:00
Erik Timan	2442a4dea7	test(emqx_resource): add regression test for recursive flushing	2023-02-16 14:17:16 +01:00
Andrew Mayorov	d8d06a260f	test(buffer): add test on inflight overflow w/ async queries This testcase should verify that the buffer will retry all inflight queries failed with recoverable errors + flush all outstanding queries. Co-authored-by: ieQu1 <99872536+ieQu1@users.noreply.github.com>	2023-02-08 14:08:04 +03:00
Zaiming (Stone) Shi	13ef30c46c	Merge pull request #9884 from savonarola/resource-fixes fix(resources): fix resource lifecycle	2023-02-02 12:02:34 +01:00

1 2 3

128 Commits