yuanbiao/emqx - emqx

Commit Graph

Author	SHA1	Message	Date
Thales Macedo Garitezi	a7f4f81c38	Merge pull request #10887 from thalesmg/fix-async-worker-down-buffer-worker-20230530-v50 fix: block buffer workers so they may retry requests	2023-05-30 17:39:18 -03:00
Andrew Mayorov	a2688325e5	Merge pull request #10754 from fix/EMQX-10056/mqtt feat(mqttconn): employ ecpool instead of single worker	2023-05-30 23:28:10 +03:00
Thales Macedo Garitezi	8c565abc84	test(cassandra): fix flaky test	2023-05-30 15:42:53 -03:00
Thales Macedo Garitezi	6be8ff378e	fix(buffer_worker): make buffer worker enter `blocked` state when async worker dies Fixes https://emqx.atlassian.net/browse/EMQX-10074 Otherwise, requests from those async workers, now retriable, might not be retried until the buffer worker blocks for other reasons, which might take a long time.	2023-05-30 15:34:22 -03:00
Andrew Mayorov	a5fc26736d	refactor(mqttconn): split ingress/egress into 2 separate pools Each with a more refined set of responsibilities, at the cost of slight code duplication. Also provide two different config fields for each pool size.	2023-05-30 17:21:44 +03:00
Thales Macedo Garitezi	75fcac9711	Merge pull request #10826 from thalesmg/test-partial-batch-expired-inflight-v50 test(buffer_worker): add assertion for inflight count after batch expiration	2023-05-30 09:05:59 -03:00
Zaiming (Stone) Shi	91cdc69976	Merge pull request #10867 from zmstone/0530-merge-release-50-to-master 0530 merge release 50 to master	2023-05-30 09:54:57 +02:00
Zaiming (Stone) Shi	9529919046	chore: bump app versions	2023-05-30 08:08:29 +02:00
Thales Macedo Garitezi	67e182e0c9	Merge pull request #10813 from thalesmg/refactor-kafka-on-stop-v50 feat(kafka): ensure allocated resources are removed on failures	2023-05-29 16:49:29 -03:00
Zaiming (Stone) Shi	36e268c933	chore: bump app versions	2023-05-26 16:05:37 +02:00
Zaiming (Stone) Shi	cc5b4d3748	Merge remote-tracking branch 'origin/release-50' into 0526-ci-delete-otp-24-from-standalone-app-test	2023-05-26 15:58:16 +02:00
Thales Macedo Garitezi	32e6213ce3	fix(resource_manager_sup): use `one_for_one` instead of `simple_one_for_one` Using `simple_one_for_one` has a potential race condition issue where we read the PID of the resource manager before trying to remove a resource, and then that PID changes because it was either dead at first, or it crashed and changed, and later we use this stale PID to try to remove it from the supervisor. Under such circumstances, the restarting child might linger in the supervisor, leaking resources. By using the resource ID itself as a child ID (and using `one_for_one` restart strategy), we ensure the child is truly removed.	2023-05-25 18:07:43 -03:00
Thales Macedo Garitezi	42b37690c7	refactor(pulsar): use macros for allocatable resources	2023-05-25 16:38:09 -03:00
Thales Macedo Garitezi	db60dcbada	test(buffer_worker): add assertion for inflight count after batch expiration Fixes https://emqx.atlassian.net/browse/EMQX-9829	2023-05-25 16:11:37 -03:00
Thales Macedo Garitezi	18d57ba3eb	Merge pull request #10812 from thalesmg/test-flakiness-20230524 test: attempts to reduce flakiness (pgsql, cassandra)	2023-05-25 09:29:13 -03:00
JianBo He	de7f1c8aec	test: add tests for auto_restart_interval	2023-05-25 17:15:19 +08:00
JianBo He	71b636e321	fix: fix auto_restart_interval checker	2023-05-25 12:04:23 +08:00
Paulo Zulato	122ebcac24	fix: add user-friendly message when interval is out of range	2023-05-24 15:46:00 -03:00
Thales Macedo Garitezi	7f88521836	test(pgsql): reduce flakiness Depending on timing, `t_write_timeout` was getting stuck while checking the resource health, and the previous request timeout options were making a response to never be sent if that process took too long.	2023-05-24 15:41:25 -03:00
Thales Macedo Garitezi	fd2940cd77	feat(pulsar): ensure allocated resources are removed on failures (v5.0) Fixes https://emqx.atlassian.net/browse/EMQX-9937	2023-05-24 12:29:00 -03:00
Zaiming (Stone) Shi	732a7be187	Merge remote-tracking branch 'origin/release-50'	2023-05-22 17:46:54 +02:00
Thales Macedo Garitezi	0559d6f639	refactor(buffer_worker): use static fn for bumping counters	2023-05-22 09:12:08 -03:00
Thales Macedo Garitezi	c74c93388e	refactor: rename some variables and sum type constructors for clarity	2023-05-22 09:11:23 -03:00
Thales Macedo Garitezi	7d798c10e9	perf(buffer_worker): flush metrics periodically inside buffer worker process Fixes https://emqx.atlassian.net/browse/EMQX-9905 Since calling `telemetry` is costly in a hot path, we instead collect metrics inside the buffer workers state and periodically flush them, rather than immediately as events happen.	2023-05-22 09:11:23 -03:00
Andrew Mayorov	ba6b208df2	fix(clickhouse): start app in tests Otherwise, depending on the test execution order, tests might sometimes fail. Moreover, ensure that applications describe their dependecies correctly and avoid starting irrelevant apps in tests.	2023-05-19 23:08:40 +03:00
Zaiming (Stone) Shi	cb76e5a241	docs: add changelog for 10755	2023-05-19 20:41:26 +02:00
Zaiming (Stone) Shi	0d8ffc0d59	fix(resource-manager): ensure no false creation Update is implemented as remove + create. If a dleete call is made while the create is in progress the remove call is likely to timeout too. This causes the follwing creation to falsely succeed, because there is alreay a running child under the supervisor. As a result, the resource is permanently removed after resource_manager eventually handles the remove call.	2023-05-19 18:55:16 +02:00
Zaiming (Stone) Shi	f5e5c59763	refactor(resource-manager-sup): do not force kill resource manager the shutdown timeout is now set to infinity so it will never force kill a resource manager, otherwise there will be resource leaks	2023-05-19 18:55:16 +02:00
Zaiming (Stone) Shi	21de0f8274	fix(buffer-worker-sup): fast stop the timeout shutdown in child spec may significantly slow down the deletion of a resource this commit chagnes the shutdown to brutal kill also, the pool worker removal code has been delete because it's not necessary since the entier pool is going to be force-delete later anyway	2023-05-19 18:55:16 +02:00
firest	baeb96a6e4	chore: update changes	2023-05-19 15:36:18 +08:00
firest	0eea8438bf	fix(resource): make some logging of the resource manager more secure	2023-05-19 15:28:19 +08:00
Paulo Zulato	5d289ade56	fix: validate range for some bridge options Fixes https://emqx.atlassian.net/browse/EMQX-9864 Setting a very large interval can cause `erlang:start_timer` to crash. Also, setting auto_restart_interval or health_check_interval to "0s" causes the state machine to be in loop as time 0 is handled separately: \| state_timeout() = timeout() \| integer() \| (...) \| If Time is relative and 0 no timer is actually started, instead the the \| time-out event is enqueued to ensure that it gets processed before any \| not yet received external event. from "https://www.erlang.org/doc/man/gen_statem.html#type-state_timeout" Therefore, both fields are now validated against the range [1ms, 1h], which doesn't cause above issues.	2023-05-18 10:10:58 -03:00
Thales Macedo Garitezi	447b76464b	Merge branch 'release-50' into merge-r50-into-v50-a	2023-05-17 14:50:18 -03:00
Thales Macedo Garitezi	85089a3210	fix(buffer_worker): correctly flush the buffer workers when inflight table room is made The previous commit uncovered another bug that was hidden by it: `maybe_flush_after_async_reply` was sending a message to the wrong PID. It was sending a message to `self()` meaning to target a buffer worker, but `self()` in that context is never the buffer worker, it's the connector's worker. This change also revealed a race condition where the buffer workers could stop flushing messages. So we piggy-backed on the atomic update of the table size count to check if the buffer worker should be poked to continue flushing. This allows us to get rid of `maybe_flush_after_async_reply` altogether.	2023-05-16 17:15:42 -03:00
Thales Macedo Garitezi	657df05ad9	fix(buffer_worker): avoid setting flush timer when inflight is full Fixes https://emqx.atlassian.net/browse/EMQX-9902 When the buffer worker inflight window is full, we don’t need to set a timer to flush the messages again because there’s no more room, and one of the inflight windows will flush the buffer worker by calling `flush_worker`. Currently, we do set the timer on such situation, and this fact combined with the default batch time of 0 yields a busy loop situation where the CPU spins a lot while inflight messages do not return.	2023-05-16 11:28:58 -03:00
zhongwencool	a953b951fe	Merge branch 'master' into sync-release-50-to-master	2023-05-12 18:01:58 +08:00
Thales Macedo Garitezi	64dc9ed46a	perf(metrics): avoid increasing counters by 0 Some performance tests indicate that calling `telemetry` is costly in hot paths. Since increasing a counter by 0 is a no-op, we should avoid calling `telemetry` if the amount to increase is 0.	2023-05-11 15:13:37 -03:00
Kjell Winblad	70cf1533db	feat: add RabbitMQ bridge	2023-05-09 14:32:26 +02:00
Zaiming (Stone) Shi	13dcb5732f	Merge remote-tracking branch 'origin/release-50' into 0508-prepare-for-e5.0.4	2023-05-08 21:29:35 +02:00
Thales Macedo Garitezi	eba627b365	fix(buffer_worker): fix inflight count when updating inflight item	2023-05-08 09:27:51 -03:00
Zhongwen Deng	4f396a36a9	Merge remote-tracking branch 'upstream/master' into release-50	2023-05-08 14:58:03 +08:00
Thales Macedo Garitezi	8aa7c014e7	perf(buffer_worker): avoid calling `ets:info/2` (Almost?) fixes https://emqx.atlassian.net/browse/EMQX-9637 During the course of performance tests comparing the performance of e5.0.3 and e4.4.16 regarding the webhook bridge in sync mode, we observed that the throughput in e5.0.3 (sync) was much lower than in e4.4.16: ~ 9 k msgs / s vs. ~ 50 k msgs / s, respectively. Analyzing `observer_cli` output, we noticed that a lot of the time both buffer workers and ehttpc processes was spent in `ets:info/2`. That function was called to check the size of the inflight table when updating metrics and checking if the inflight table was full. Other uses of `ets:info/2` were contained inside the arguments to some `?tp/2` macro usages (https://github.com/kafka4beam/snabbkaffe/pull/60). By using a specific record to track the size of the table, we managed to improve the bridge performance to ~ 45 k msgs / s in sync mode.	2023-05-02 17:05:32 -03:00
Andrew Mayorov	670709f746	feat(resource): ensure uniqueness through `gproc` Also use it instead of a custom ETS table for simplicity and better consistency. This has drawbacks though: expect slightly increased load on gproc gen_server due to how `gproc:set_value/2` works.	2023-05-02 17:29:22 +03:00
Andrew Mayorov	4575167607	feat(resource): drop `manager_id()` type	2023-05-02 17:29:20 +03:00
Andrew Mayorov	aaef95b1da	feat(resman): stop adding uniqueness to manager ids Before this change, a separate `manager_id` / `instance_id` was used as resource manager id, which made connector interface somewhat inconsistent: part of function calls to connector implementation used instance id as first argument while the rest used resource id itself.	2023-05-02 17:28:26 +03:00
Thales Macedo Garitezi	7853a4c36e	chore: bump app vsns	2023-04-27 11:58:28 -03:00
Thales Macedo Garitezi	567413389c	Merge pull request #10519 from thalesmg/fix-flaky-res-test-v50 test(resource): fix flaky test	2023-04-27 09:33:40 -03:00
Thales Macedo Garitezi	c53741a08c	fix(buffer_worker): avoid sending late reply messages to callers Fixes https://emqx.atlassian.net/browse/EMQX-9635 During a sync call from process `A` to a buffer worker `B`, its call to the underlying resource `C` can be very slow. In those cases, `A` will receive a timeout response and expect no more messages from `B` nor `C`. However, prior to this fix, if `B` is stuck in a long sync call to `C` and then gets its response after `A` timed out, `B` would still send the late response to `A`, polluting its mailbox.	2023-04-26 13:18:28 -03:00
Thales Macedo Garitezi	d78312e10e	test(resource): fix flaky test	2023-04-26 09:25:33 -03:00
zhongwencool	9d893b49eb	Merge branch 'master' into sync-release-50-to-master	2023-04-26 10:54:46 +08:00

1 2 3 4 5 ...

457 Commits