yuanbiao/emqx - emqx

Commit Graph

Author	SHA1	Message	Date
Zaiming (Stone) Shi	b3e486041b	Merge pull request #9853 from zmstone/0127-refactor-buffer-worker-no-need-to-keep-request-for-reply-callback 0127 refactor buffer worker no need to keep request for reply callback	2023-01-31 08:44:01 +01:00
Stefan Strigler	27881064dc	fix: increase dropped.queue_full by number of messages	2023-01-30 11:37:35 +01:00
Zaiming (Stone) Shi	d47941601d	refactor(buffer_worker): rename trace points	2023-01-28 11:52:11 +01:00
Zaiming (Stone) Shi	7f66c6a9e2	Merge pull request #9840 from olcai/redact-influxdb-tokens fix: redact influxdb tokens in logs and reduce log level	2023-01-28 11:47:36 +01:00
Zaiming (Stone) Shi	fc38ea9571	refactor(buffer_worker): do not keep request body in reply context the request body can be potentially very large the reply context is sent to the async call handler and kept in its memory until the async reply is received from bridge target service. this commit tries to minimize the size of the reply context by replacing the request body with `[]`.	2023-01-27 17:12:55 +01:00
Zaiming (Stone) Shi	578271ea3d	refactor: use lists:map instead of lc for safty	2023-01-27 15:15:46 +01:00
Zaiming (Stone) Shi	f793807bc1	refactor(buffer_worker): rename function batch_reply_after_query to handle_async_batch_reply	2023-01-27 15:04:28 +01:00
Zaiming (Stone) Shi	262c3a2869	refactor(buffer_worker): rename function from reply_after_query to handle_async_reply	2023-01-27 15:03:18 +01:00
Zaiming (Stone) Shi	52b75ada04	Merge pull request #9832 from sstrigler/EMQX-8774-failure-to-handle-timeout-error-in-resource-worker EMQX 8774 failure to handle timeout error in resource worker	2023-01-27 14:36:44 +01:00
Zaiming (Stone) Shi	514609bcf7	Merge pull request #9850 from zmstone/0127-fix-influxdb-bridge-atom-leak 0127 fix influxdb bridge atom leak	2023-01-27 14:30:20 +01:00
Zaiming (Stone) Shi	d53106145f	fix: stop resource when resource manager terminates	2023-01-27 12:39:05 +01:00
Stefan Strigler	2d62de5188	test: fix expected result from timeout error	2023-01-27 11:43:48 +01:00
Stefan Strigler	a180bd9aa5	fix: catch error, not exit	2023-01-27 11:40:06 +01:00
Stefan Strigler	b7e3f9d5a6	fix: try-case-of rather than try-of try-of catches only what happens within but not after	2023-01-27 11:40:06 +01:00
Zaiming (Stone) Shi	db2f631a8a	refactor(buffer_worker): simplify caller reply	2023-01-27 11:33:45 +01:00
Zaiming (Stone) Shi	d4fab92b72	refactor(buffer_worker): no need to keep request for REPLY macro	2023-01-27 10:41:30 +01:00
Zaiming (Stone) Shi	1f799dfd59	fix: reply with {error, buffer_overflow} when discarded	2023-01-26 17:15:36 +01:00
Zaiming (Stone) Shi	ed28789164	refactor(buffer_worker): no need to return after collect into buf queue	2023-01-26 14:50:40 +01:00
Zaiming (Stone) Shi	25b4821adc	refactor: move the the per-message overflow log from error to info level	2023-01-26 14:48:43 +01:00
Zaiming (Stone) Shi	bb26632c8a	fix(buffer_worker): fix a wrong assertion the assertion is to ensure queue items are not binary but should not assert the queue itself	2023-01-26 14:33:16 +01:00
Erik Timan	805d08e823	fix: reduce log level from error to warning in several places This reduces the log level from error to warning in places that are connected to the influxdb bridge. Transient errors for external resources should not render an error log.	2023-01-25 14:49:50 +01:00
Zaiming (Stone) Shi	5fdf7fd24c	fix(kafka): use async callback to bump success counters some telemetry events from wolff are discarded: * dropped: this is double counted in wolff, we now only subscribe to the dropped_queue_full event * retried_failed: it has different meanings in wolff, in wolff, it means it's the 2nd (or onward) produce attempt in EMQX, it means it's eventually failed after some retries * retried_success since we are going to handle the success counters in callbac this having this reported from wolff will only make things harder to understand * failed wolff never fails (unelss drop which is a different counter)	2023-01-24 21:12:36 +01:00
Erik Timan	9d20431257	fix(emqx_resource): fix crash while flushing queue We used next_event for flushing the queue in emqx_resource, but this leads to a crash. We now call flush_worker/1 instead.	2023-01-24 14:13:35 +01:00
Erik Timan	28718edbfd	chore: bump application VSNs	2023-01-24 14:12:34 +01:00
Zaiming (Stone) Shi	8fde169abb	Merge pull request #9821 from thalesmg/buffer-worker-expiry-v50 feat(buffer_worker): add expiration time to requests	2023-01-24 13:54:04 +01:00
Thales Macedo Garitezi	ca4a262b75	refactor: re-organize dealing with unrecoverable errors	2023-01-20 12:00:17 -03:00
Thales Macedo Garitezi	6fa6c679bb	feat(buffer_worker): add expiration time to requests With this, we avoid performing work or replying to callers that are no longer waiting on a result. Also introduces two new counters: - `dropped.expired` :: happens when a request expires before being sent downstream - `late_reply` :: when a response is receive from downstream, but the caller is no longer for a reply because the request has expired, and the caller might even have retried it.	2023-01-20 11:36:52 -03:00
Zaiming (Stone) Shi	1c3e055b13	Merge pull request #9822 from JimMoen/fix-schema-typo chore: i18n typo fix	2023-01-20 11:11:18 +01:00
JimMoen	16f45a60fd	chore: i18n typo fix	2023-01-20 11:50:01 +08:00
Thales Macedo Garitezi	47f796dd12	refactor: rename `emqx_resource_worker` -> `emqx_resource_buffer_worker` To make it more clear that it's purpose is serve as a buffering layer.	2023-01-18 16:15:34 -03:00
Ilya Averyanov	44a6e5ed15	chore(resources): add missing parameters to emqx_resource schema	2023-01-18 14:33:45 +02:00
Zaiming (Stone) Shi	d4f3b4c8c2	Merge remote-tracking branch 'origin/master' into fix-buffer-clear-replayq-on-delete-v50	2023-01-18 11:39:47 +01:00
Ivan Dyachkov	430b0a03d4	Merge pull request #9780 from id/fix-ensure-no-colon-in-filenames fix: ensure no colon in filenames	2023-01-18 09:36:16 +01:00
Zaiming (Stone) Shi	faf5916ed6	test: relax recoverable/unrecoverable error check for now, treat all other errors unrecoverable	2023-01-18 07:52:28 +01:00
Thales Macedo Garitezi	5c2ac0ac81	chore: don't cancel inflight items upon worker death; retry them	2023-01-17 19:50:30 -03:00
Thales Macedo Garitezi	087b667263	fix(buffer_worker): allow signalling unrecoverable errors	2023-01-17 19:50:30 -03:00
Thales Macedo Garitezi	4ed7bff33f	chore: fix dialyzer warnings	2023-01-17 16:49:16 -03:00
Thales Macedo Garitezi	fa01deb3eb	chore: retry as much as possible, don't reply to caller too soon	2023-01-17 16:49:15 -03:00
Thales Macedo Garitezi	b82009bc29	refactor: use monotonic times as refs and store initial times when creating ets with this, we may measure latencies in the future.	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	3ba65c4377	feat: poke the buffer workers when inflight is no longer full if max inflight = 1, then we only make progress based on the state timer, since the callbacks were not poking the buffer workers.	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	b5aaef084c	refactor: enter running state directly now that we don't have the possibility of dirty disk queues (we always use volatile replayq), we will never resume old work.	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	bd0e2a74ba	refactor: rename inflight_name field to inflight_tid	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	006b4bda97	feat(buffer_worker): monitor async workers and cancel their inflight requests upon death	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	731ac6567a	fix(buffer_worker): don't retry all kinds of inflight requests Some requests should not be retried during the blocked state. For example, if some async requests are just taking some time to process, we should avoid retrying them periodically, lest risk overloading the downstream further.	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	5425f3d88e	refactor: rm unused fn	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	5dd24a64c3	refactor(buffer_worker): check if inflight is full before flushing	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	344eeebe63	fix: always ack async replies The caller should decide if it should retry in that case, to avoid overwhelming the resource with retries.	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	bd95a95409	refactor: remove redundant `BlockWorker` arg, change boolean to ack/nack `BlockWorker` was always false (ack). Also, changed the return to something more semantic than a boolean to avoid [boolean blindness](https://runtimeverification.com/blog/code-smell-boolean-blindness/)	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	30a227bd38	refactor: rename `resume` state timeout to `unblock`	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	7401d6f0ce	refactor: rename ack fn	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	196bf1c5ba	feat: mass collect calls from mailbox also when blocked	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	d4724d6ce9	refactor: remove redundant function `retry_queue` does basically what the running state does, now that we refactored the buffer workers to always use the queue.	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	d6a9d0aa48	fix: set queuing to 0 after buffer worker termination	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	81fc561ed5	fix(buffer_worker): check for overflow after enqueuing new requests	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	4cb83d0c9a	fix: fix some expressions after refactoring	2023-01-17 16:48:48 -03:00
Zaiming (Stone) Shi	fecdbac9a8	refactor: rename a few functions	2023-01-17 16:48:48 -03:00
Zaiming (Stone) Shi	cdd8de11b0	chore: fix a typo in function name	2023-01-17 16:48:48 -03:00
Zaiming (Stone) Shi	618b97870b	refactor: call local function queue_count everywhere	2023-01-17 16:48:48 -03:00
Zaiming (Stone) Shi	249c4c1c79	refactor: use 'bufs' for resource worker replayq dir	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	af6807e863	refactor: cancel flush timer sooner Avoids the cancellation being delayed.	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	477c55d8ef	fix: sanitizy replayq dir filepath Colons (`:`) are not allowed in Windows.	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	4c04a01370	refactor(buffer_worker): remove `?Q_ITEM` wrapping and use lightweight size estimate	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	32a9e60313	feat(buffer_worker): also use the inflight table for sync requests Related: https://emqx.atlassian.net/browse/EMQX-8692 This should also correctly account for `retried.*` metrics for sync requests. Also fixes cases where race conditions for retrying async requests could potentially lead to inconsistent metrics. Fixes more cases where a stale reference to `replayq` was being held accidentally after a `pop`.	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	ff23d25e8b	chore(replayq): update replayq -> 0.3.6 and use `clean_start` for buffer workers So we can truly avoid resuming work after a node restart.	2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi	c383558467	fix(buffer): fix `replayq` usages in buffer workers (5.0) https://emqx.atlassian.net/browse/EMQX-8700 Fixes a few errors in the usage of `replayq` queues. - Close `replayq` when `emqx_resource_worker` terminates. - Do not keep old references to `replayq` after any `pop`s. - Clear `replayq`'s data directories when removing a resource.	2023-01-17 16:48:48 -03:00
Stefan Strigler	e54f2f83b3	test: use same default timeout as elsewhere	2023-01-17 15:29:19 +01:00
Ivan Dyachkov	676f017ec0	fix: ensure no colon in filenames	2023-01-16 21:27:01 +01:00
Stefan Strigler	e08c1d2229	Merge remote-tracking branch 'olcai/refactor-bridges-api' into dev/api-refactor	2023-01-13 15:49:52 +01:00
Stefan Strigler	1690a6dcfc	Merge branch 'master' into dev/api-refactor	2023-01-13 15:34:13 +01:00
Erik Timan	61e98900be	chore: bump app vsn of emqx_resource	2023-01-13 15:13:35 +01:00
Kjell Winblad	1ac03ab208	Merge pull request #9730 from kjellwinblad/kjell/fix/resource_atom_leak/EMQX-8583 fix: remove atom leaks	2023-01-13 14:38:28 +01:00
Kjell Winblad	734e6b9c96	chore: fix flaky test cases, log labels and review comments Co-authored-by: Thales Macedo Garitezi <thalesmg@gmail.com>	2023-01-13 11:05:02 +01:00
Ivan Dyachkov	b5d3e9d8b8	fix: remove time unit from duration fields description	2023-01-12 14:18:55 +01:00
Kjell Winblad	8c482e03d1	fix: remove atom leaks Both emqx_resource_managers and emqx_resource_workers leaked atoms as they created an unique atoms to use as registered names. This is fixed by removing the need to register the names. Fixes: https://emqx.atlassian.net/browse/EMQX-8583	2023-01-11 17:03:28 +01:00
Stefan Strigler	8ad8288195	feat: report error in create_dry_run	2023-01-11 14:22:37 +01:00
Zaiming (Stone) Shi	85a8eff90b	fix(emqx_resource_manager): do not start when disabled	2023-01-11 08:33:48 +01:00
Thales Macedo Garitezi	70eb5ffb58	refactor: remove unused function	2023-01-05 10:16:01 -03:00
Thales Macedo Garitezi	56437228dc	docs: improve descriptions Thanks to @qzhuyan for the corrections.	2023-01-05 10:16:01 -03:00
Thales Macedo Garitezi	fd360ac6c0	feat(buffer_worker): refactor buffer/resource workers to always use queue This makes the buffer/resource workers always use `replayq` for queuing, along with collecting multiple requests in a single call. This is done to avoid long message queues for the buffer workers and rely on `replayq`'s capabilities of offloading to disk and detecting overflow. Also, this deprecates the `enable_batch` and `enable_queue` resource creation options, as: i) queuing is now always enables; ii) batch_size > 1 <=> batch_enabled. The corresponding metric `dropped.queue_not_enabled` is dropped, along with `batching`. The batching is too ephemeral, especially considering a default batch time of 20 ms, and is not shown in the dashboard, so it was removed.	2023-01-05 10:15:09 -03:00
Thales Macedo Garitezi	bf3983e7c4	feat(buffer_worker): use offload mode for `replayq` To avoid confusion for the users as to what persistence guarantees we offer when buffering bridges/resources, we will always enable offload mode for `replayq`. With this, when the buffer size is above the max segment size, it'll flush the queue to disk, but on recovery after a restart it'll clean the existing segments rather than resuming from them.	2023-01-05 10:11:59 -03:00
Erik Timan	b9d012e072	refactor(emqx_resource): ingress bridge counter Unify code paths for resource metrics by removing emqx_resource:inc_received/1 and adding emqx_resource_metrics:received_inc/1 & friends.	2023-01-02 15:11:52 +01:00
Thales Macedo Garitezi	7e02eac3bc	Merge pull request #9619 from thalesmg/refactor-gauges-v50 refactor(metrics): use absolute gauge values rather than deltas (v5.0)	2023-01-02 10:56:47 -03:00
Zaiming (Stone) Shi	dbc10c2eed	chore: update copyright year 2023	2023-01-02 09:22:27 +01:00
Thales Macedo Garitezi	305ed68916	chore: bump app vsns	2022-12-30 16:51:24 -03:00
Thales Macedo Garitezi	8b060a75f1	refactor(metrics): use absolute gauge values rather than deltas https://emqx.atlassian.net/browse/EMQX-8548 Currently, we face several issues trying to keep resource metrics reasonable. For example, when a resource is re-created and has its metrics reset, but then its durable queue resumes its previous work and leads to strange (often negative) metrics. Instead using `counters` that are shared by more than one worker to manage gauges, we introduce an ETS table whose key is not only scoped by the Resource ID as before, but also by the worker ID. This way, when a worker starts/terminates, they should set their own gauges to their values (often 0 or `replayq:count` when resuming off a queue). With this scoping and initialization procedure, we'll hopefully avoid hitting those strange metrics scenarios and have better control over the gauges.	2022-12-30 16:51:24 -03:00
Zaiming (Stone) Shi	f93c22045d	fix: non-empty field should not be undefined	2022-12-24 11:41:45 +01:00
Zaiming (Stone) Shi	479e191dcf	refactor: refine worker pool config and doc worker pool is a buffer pool the description hinted connection pool which is wrong.	2022-12-20 09:02:51 +01:00
Zaiming (Stone) Shi	f611cbab45	chore: cap replayq seg size under total size	2022-12-19 23:16:05 +01:00
Andrew Mayorov	8a0ca38a77	fix: drop no longer supported dialyzer option	2022-12-16 13:45:05 +03:00
Zaiming (Stone) Shi	9e3da5b661	chore: bump app versions	2022-12-14 20:07:41 +01:00
Thales Macedo Garitezi	1cd91a24e9	feat(gcp_pubsub): implement GCP PubSub bridge (ee5.0)	2022-12-12 17:18:19 -03:00
Thales Macedo Garitezi	34e9056779	refactor: fix typo in variable name Might confuse people to think it's related to `replayq`.	2022-12-12 17:17:51 -03:00
Thales Macedo Garitezi	62eeb4b8e8	feat(resource): reset metrics when stopping a resource	2022-10-18 09:32:35 -03:00
Thales Macedo Garitezi	2d01726b22	fix: account calls when resource is not connected as matched	2022-10-13 15:32:04 -03:00
Thales Macedo Garitezi	1b2b629cdd	feat: emit telemetry events for all resource worker metrics	2022-10-13 15:32:04 -03:00
Thales Macedo Garitezi	f0ff32c031	test: fix tests after counter changes	2022-10-11 17:45:48 -03:00
Thales Macedo Garitezi	357e5919ce	chore: add copyright disclaimer	2022-10-11 09:51:16 -03:00
Kjell Winblad	57270fb8fc	feat: add support for counters and gauges to the Kafka Bridge This commit adds support for counters and gauges to the Kafka Brige. The Kafka bridge uses [Wolff](https://github.com/kafka4beam/wolff) for the Kafka connection. Wolff does its own batching and does not use the batching functionality in `emqx_resource_worker` that is used by other bridge types. Therefore, the counter events have to be generated by Wolff. We have added [telemetry](https://github.com/beam-telemetry/telemetry) events to Wolff that we hook into to change counters and gauges for the Kafka bridge. The counter called `matched` does not depend on specific functionality of any bridge type so the updates of this counter is moved higher up in the call chain then previously so that it also gets updated for Kafka bridges.	2022-10-10 14:40:57 -03:00
Zaiming (Stone) Shi	f6ac4c3a76	Merge pull request #8798 from zmstone/0815-feat-add-kafka-connector feat: Add Kafka connector	2022-09-24 22:57:50 +02:00
Shawn	b325633390	refactor(resource): resume from queue/inflight-window with async-sending and batching	2022-09-21 22:58:47 +08:00

1 2 3 4 5 ...

354 Commits