Commit Graph

370 Commits

Author SHA1 Message Date
Thales Macedo Garitezi 5c2ac0ac81 chore: don't cancel inflight items upon worker death; retry them 2023-01-17 19:50:30 -03:00
Thales Macedo Garitezi 087b667263 fix(buffer_worker): allow signalling unrecoverable errors 2023-01-17 19:50:30 -03:00
Thales Macedo Garitezi 4ed7bff33f chore: fix dialyzer warnings 2023-01-17 16:49:16 -03:00
Thales Macedo Garitezi fa01deb3eb chore: retry as much as possible, don't reply to caller too soon 2023-01-17 16:49:15 -03:00
Thales Macedo Garitezi b82009bc29 refactor: use monotonic times as refs and store initial times when creating ets
with this, we may measure latencies in the future.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 3ba65c4377 feat: poke the buffer workers when inflight is no longer full
if max inflight = 1, then we only make progress based on the state
timer, since the callbacks were not poking the buffer workers.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi b5aaef084c refactor: enter running state directly
now that we don't have the possibility of dirty disk queues (we always
use volatile replayq), we will never resume old work.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi bd0e2a74ba refactor: rename inflight_name field to inflight_tid 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 006b4bda97 feat(buffer_worker): monitor async workers and cancel their inflight requests upon death 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 731ac6567a fix(buffer_worker): don't retry all kinds of inflight requests
Some requests should not be retried during the blocked state.  For
example, if some async requests are just taking some time to process,
we should avoid retrying them periodically, lest risk overloading the
downstream further.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 5425f3d88e refactor: rm unused fn 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 5dd24a64c3 refactor(buffer_worker): check if inflight is full before flushing 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 344eeebe63 fix: always ack async replies
The caller should decide if it should retry in that case, to avoid
overwhelming the resource with retries.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi bd95a95409 refactor: remove redundant `BlockWorker` arg, change boolean to ack/nack
`BlockWorker` was always false (ack).  Also, changed the return to
something more semantic than a boolean to avoid [boolean
blindness](https://runtimeverification.com/blog/code-smell-boolean-blindness/)
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 30a227bd38 refactor: rename `resume` state timeout to `unblock` 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 7401d6f0ce refactor: rename ack fn 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 196bf1c5ba feat: mass collect calls from mailbox also when blocked 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi d4724d6ce9 refactor: remove redundant function
`retry_queue` does basically what the running state does, now that we
refactored the buffer workers to always use the queue.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi d6a9d0aa48 fix: set queuing to 0 after buffer worker termination 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 81fc561ed5 fix(buffer_worker): check for overflow after enqueuing new requests 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 4cb83d0c9a fix: fix some expressions after refactoring 2023-01-17 16:48:48 -03:00
Zaiming (Stone) Shi fecdbac9a8 refactor: rename a few functions 2023-01-17 16:48:48 -03:00
Zaiming (Stone) Shi cdd8de11b0 chore: fix a typo in function name 2023-01-17 16:48:48 -03:00
Zaiming (Stone) Shi 618b97870b refactor: call local function queue_count everywhere 2023-01-17 16:48:48 -03:00
Zaiming (Stone) Shi 249c4c1c79 refactor: use 'bufs' for resource worker replayq dir 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi af6807e863 refactor: cancel flush timer sooner
Avoids the cancellation being delayed.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 477c55d8ef fix: sanitizy replayq dir filepath
Colons (`:`) are not allowed in Windows.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 4c04a01370 refactor(buffer_worker): remove `?Q_ITEM` wrapping and use lightweight size estimate 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 32a9e60313 feat(buffer_worker): also use the inflight table for sync requests
Related: https://emqx.atlassian.net/browse/EMQX-8692

This should also correctly account for `retried.*` metrics for sync
requests.

Also fixes cases where race conditions for retrying async requests
could potentially lead to inconsistent metrics.

Fixes more cases where a stale reference to `replayq` was being held
accidentally after a `pop`.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi ff23d25e8b chore(replayq): update replayq -> 0.3.6 and use `clean_start` for buffer workers
So we can truly avoid resuming work after a node restart.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi c383558467 fix(buffer): fix `replayq` usages in buffer workers (5.0)
https://emqx.atlassian.net/browse/EMQX-8700

Fixes a few errors in the usage of `replayq` queues.

- Close `replayq` when `emqx_resource_worker` terminates.
- Do not keep old references to `replayq` after any `pop`s.
- Clear `replayq`'s data directories when removing a resource.
2023-01-17 16:48:48 -03:00
Stefan Strigler e54f2f83b3 test: use same default timeout as elsewhere 2023-01-17 15:29:19 +01:00
Ivan Dyachkov 676f017ec0 fix: ensure no colon in filenames 2023-01-16 21:27:01 +01:00
Stefan Strigler e08c1d2229 Merge remote-tracking branch 'olcai/refactor-bridges-api' into dev/api-refactor 2023-01-13 15:49:52 +01:00
Stefan Strigler 1690a6dcfc
Merge branch 'master' into dev/api-refactor 2023-01-13 15:34:13 +01:00
Erik Timan 61e98900be chore: bump app vsn of emqx_resource 2023-01-13 15:13:35 +01:00
Kjell Winblad 1ac03ab208
Merge pull request #9730 from kjellwinblad/kjell/fix/resource_atom_leak/EMQX-8583
fix: remove atom leaks
2023-01-13 14:38:28 +01:00
Kjell Winblad 734e6b9c96 chore: fix flaky test cases, log labels and review comments
Co-authored-by: Thales Macedo Garitezi <thalesmg@gmail.com>
2023-01-13 11:05:02 +01:00
Ivan Dyachkov b5d3e9d8b8 fix: remove time unit from duration fields description 2023-01-12 14:18:55 +01:00
Kjell Winblad 8c482e03d1 fix: remove atom leaks
Both emqx_resource_managers and emqx_resource_workers leaked atoms as they
created an unique atoms to use as registered names. This is fixed by
removing the need to register the names.

Fixes: https://emqx.atlassian.net/browse/EMQX-8583
2023-01-11 17:03:28 +01:00
Stefan Strigler 8ad8288195 feat: report error in create_dry_run 2023-01-11 14:22:37 +01:00
Zaiming (Stone) Shi 85a8eff90b fix(emqx_resource_manager): do not start when disabled 2023-01-11 08:33:48 +01:00
Thales Macedo Garitezi 70eb5ffb58 refactor: remove unused function 2023-01-05 10:16:01 -03:00
Thales Macedo Garitezi 56437228dc docs: improve descriptions
Thanks to @qzhuyan for the corrections.
2023-01-05 10:16:01 -03:00
Thales Macedo Garitezi fd360ac6c0 feat(buffer_worker): refactor buffer/resource workers to always use queue
This makes the buffer/resource workers always use `replayq` for
queuing, along with collecting multiple requests in a single call.
This is done to avoid long message queues for the buffer workers and
rely on `replayq`'s capabilities of offloading to disk and detecting
overflow.

Also, this deprecates the `enable_batch` and `enable_queue` resource
creation options, as: i) queuing is now always enables; ii) batch_size
> 1 <=> batch_enabled.  The corresponding metric
`dropped.queue_not_enabled` is dropped, along with `batching`.  The
batching is too ephemeral, especially considering a default batch time
of 20 ms, and is not shown in the dashboard, so it was removed.
2023-01-05 10:15:09 -03:00
Thales Macedo Garitezi bf3983e7c4 feat(buffer_worker): use offload mode for `replayq`
To avoid confusion for the users as to what persistence guarantees we
offer when buffering bridges/resources, we will always enable offload
mode for `replayq`.  With this, when the buffer size is above the max
segment size, it'll flush the queue to disk, but on recovery after a
restart it'll clean the existing segments rather than resuming from
them.
2023-01-05 10:11:59 -03:00
Erik Timan b9d012e072 refactor(emqx_resource): ingress bridge counter
Unify code paths for resource metrics by removing
emqx_resource:inc_received/1 and adding
emqx_resource_metrics:received_inc/1 & friends.
2023-01-02 15:11:52 +01:00
Thales Macedo Garitezi 7e02eac3bc
Merge pull request #9619 from thalesmg/refactor-gauges-v50
refactor(metrics): use absolute gauge values rather than deltas (v5.0)
2023-01-02 10:56:47 -03:00
Zaiming (Stone) Shi dbc10c2eed chore: update copyright year 2023 2023-01-02 09:22:27 +01:00
Thales Macedo Garitezi 305ed68916 chore: bump app vsns 2022-12-30 16:51:24 -03:00
Thales Macedo Garitezi 8b060a75f1 refactor(metrics): use absolute gauge values rather than deltas
https://emqx.atlassian.net/browse/EMQX-8548

Currently, we face several issues trying to keep resource metrics
reasonable.  For example, when a resource is re-created and has its
metrics reset, but then its durable queue resumes its previous work
and leads to strange (often negative) metrics.

Instead using `counters` that are shared by more than one worker to
manage gauges, we introduce an ETS table whose key is not only scoped
by the Resource ID as before, but also by the worker ID.  This way,
when a worker starts/terminates, they should set their own gauges to
their values (often 0 or `replayq:count` when resuming off a queue).
With this scoping and initialization procedure, we'll hopefully avoid
hitting those strange metrics scenarios and have better control over
the gauges.
2022-12-30 16:51:24 -03:00
Zaiming (Stone) Shi f93c22045d fix: non-empty field should not be undefined 2022-12-24 11:41:45 +01:00
Zaiming (Stone) Shi 479e191dcf refactor: refine worker pool config and doc
worker pool is a buffer pool
the description hinted connection pool which is wrong.
2022-12-20 09:02:51 +01:00
Zaiming (Stone) Shi f611cbab45 chore: cap replayq seg size under total size 2022-12-19 23:16:05 +01:00
Andrew Mayorov 8a0ca38a77
fix: drop no longer supported dialyzer option 2022-12-16 13:45:05 +03:00
Zaiming (Stone) Shi 9e3da5b661 chore: bump app versions 2022-12-14 20:07:41 +01:00
Thales Macedo Garitezi 1cd91a24e9 feat(gcp_pubsub): implement GCP PubSub bridge (ee5.0) 2022-12-12 17:18:19 -03:00
Thales Macedo Garitezi 34e9056779 refactor: fix typo in variable name
Might confuse people to think it's related to `replayq`.
2022-12-12 17:17:51 -03:00
Thales Macedo Garitezi 62eeb4b8e8 feat(resource): reset metrics when stopping a resource 2022-10-18 09:32:35 -03:00
Thales Macedo Garitezi 2d01726b22 fix: account calls when resource is not connected as matched 2022-10-13 15:32:04 -03:00
Thales Macedo Garitezi 1b2b629cdd feat: emit telemetry events for all resource worker metrics 2022-10-13 15:32:04 -03:00
Thales Macedo Garitezi f0ff32c031 test: fix tests after counter changes 2022-10-11 17:45:48 -03:00
Thales Macedo Garitezi 357e5919ce chore: add copyright disclaimer 2022-10-11 09:51:16 -03:00
Kjell Winblad 57270fb8fc feat: add support for counters and gauges to the Kafka Bridge
This commit adds support for counters and gauges to the Kafka Brige.
The Kafka bridge uses [Wolff](https://github.com/kafka4beam/wolff) for
the  Kafka connection. Wolff does its own batching and does not use the
batching functionality in `emqx_resource_worker` that is used by other
bridge types. Therefore, the counter events have to be generated by
Wolff. We have added
[telemetry](https://github.com/beam-telemetry/telemetry) events to Wolff
that we hook into to change counters and gauges for the Kafka bridge. The
counter called `matched` does not depend on specific functionality of
any bridge type so the updates of this counter is moved higher up in the
call chain then previously so that it also gets updated for Kafka
bridges.
2022-10-10 14:40:57 -03:00
Zaiming (Stone) Shi f6ac4c3a76
Merge pull request #8798 from zmstone/0815-feat-add-kafka-connector
feat: Add Kafka connector
2022-09-24 22:57:50 +02:00
Shawn b325633390 refactor(resource): resume from queue/inflight-window with async-sending and batching 2022-09-21 22:58:47 +08:00
Shawn 9aa7e826cb refactor(resource): fast resume resource worker if inflight msgs are ACKed 2022-09-17 00:34:30 +08:00
Shawn 8307f04c2e refactor(resource): save inflight size into the ETS table 2022-09-16 16:52:08 +08:00
Shawn d5d3972ff5 chore: add test cases for MQTT Bridge reconnecting 2022-09-15 10:19:33 +08:00
Shawn 4e211c12d3 fix(mqtt_bridge): return value of sending messages was discarded 2022-09-15 08:57:01 +08:00
Shawn 1c03c236f5 fix(mqtt_bridge): handle send_to_remote in idle state 2022-09-14 15:19:30 +08:00
Shawn f41adb0997 refactor: change some default values of resource_opts 2022-09-14 15:18:07 +08:00
Zaiming (Stone) Shi 0c1595be02 feat: Add Kafka connector 2022-09-13 19:46:56 +02:00
Shawn b9ae4ea276 refactor: rename some metrics for emqx_resource 2022-09-13 14:04:25 +08:00
Shawn 2b33ca6d49 fix: no error log print if insert bool values into mysql 2022-09-07 16:00:09 +08:00
Shawn 26234d38b9 fix: mark the async msg 'queuing' not 'sent.inflight' on recoverable_error 2022-09-02 18:41:43 +08:00
Shawn 83f21b4c65 refactor(resource): remove metrics 'sent.exception' 2022-09-02 12:46:53 +08:00
Shawn b45f3de8db refactor(resource): rename metrics batched,queued -> batching,queuing 2022-09-02 12:41:14 +08:00
Shawn 33c9c7d497 fix: incorrect message order when batch is enabled 2022-09-01 14:51:13 +08:00
Shawn 0ef0b68de4 refactor: change '{recoverable_error,Reason}' to '{error,{recoverable_error,Reason}}' 2022-08-31 18:25:00 +08:00
Shawn 73e19d84ee feat: use the new metrics to bridge APIs 2022-08-30 23:47:58 +08:00
Shawn 9e50866cd0 fix: rename queue_max_bytes -> max_queue_bytes 2022-08-30 17:18:54 +08:00
Shawn c4106c0d77 fix: resume the resource worker on health check success 2022-08-30 12:28:43 +08:00
Shawn 6fde37791c refactor: new metrics for resources 2022-08-30 10:14:10 +08:00
Shawn 1625b8eaeb fix(mysql_bridge): export the query_mode option to the APIs 2022-08-26 17:11:24 +08:00
Shawn 6b0ccfbc43 refactor: rename the error return resource_down -> recoverable_error 2022-08-26 17:11:12 +08:00
Shawn a896aa8b27 fix: incorrect replayq dir for the emqx_resource 2022-08-25 16:06:18 +08:00
Shawn 86577365e4 fix: use gen_statem:cast/3 for async query 2022-08-23 22:41:45 +08:00
JimMoen f0c2b53868 fix(bpapi): make bpapi static_checks happy 2022-08-22 10:51:44 +08:00
JimMoen 62ecf6f545 fix(resource): keep `auto_retry` in `disconnected` state
Automatic retries should be maintained even in `disconnected` state without any state transition.
2022-08-22 02:52:06 +08:00
JimMoen 7c4ea38c06 fix(resource): make some resource opts internal
Resource options `start_after_created` and `start_timeout` are internal opts.
Not provided to users anymore.
2022-08-22 02:22:57 +08:00
JimMoen 06363e63d9 fix(influxdb): connector use a fallbacke `pool_size` for influxdb client 2022-08-19 15:54:19 +08:00
Shawn 9e35032d78 fix: make resume_interval defaults to health_check_interval 2022-08-16 10:09:02 +08:00
Shawn de3a325953 fix: revert the changes in connector mysql 2022-08-16 09:06:13 +08:00
Xinyu Liu 2898966439
Merge branch 'dev/ee5.0' into resource_opts 2022-08-15 21:43:22 +08:00
Shawn 19d85d485b refactor(resource): add resource_opts level into config structure 2022-08-15 21:40:10 +08:00
Shawn d1de262f31 fix: inc 'actions.failed' if bridge query failed 2022-08-15 17:21:14 +08:00
Shawn 665ef4142d fix: unify the health check interval 2022-08-15 17:21:14 +08:00
JimMoen 68946f1f6c feat: influxdb support `async`/`batch_async` query 2022-08-15 14:02:17 +08:00
JimMoen b01ae8ece6 chore: refine influxdb bridge/connector i18n 2022-08-15 14:00:14 +08:00
JimMoen 594d071c05 feat(influxdb): add async callback 2022-08-12 18:26:47 +08:00
JimMoen fa5e8f1422 chore: refine i18n label 2022-08-12 16:39:03 +08:00
JimMoen 3678673124 fix: schema default value using raw type before convert 2022-08-12 16:38:46 +08:00
Shawn 0cdf4b47f1 feat: add more resource creation opts 2022-08-12 13:47:45 +08:00
Shawn c3c4ed02b4 fix: bump emqx_dashboard to 5.0.4 2022-08-12 00:24:58 +08:00
JimMoen 3a76a50382 fix: syntax error and compile error 2022-08-11 20:58:43 +08:00
Shawn 2872f0b668 fix(bridges): support create resources with options 2022-08-11 19:11:44 +08:00
JimMoen 0f6c371760 feat(influxdb): influxdb connector add `on_batch_query/3` callback 2022-08-11 18:12:41 +08:00
JimMoen 22a4ca311c feat(resource): resource batch/async/queue config schema 2022-08-11 16:59:18 +08:00
Shawn 6203a01320 feat: add inflight window to emqx_resource 2022-08-11 08:36:35 +08:00
Shawn 82550a585a fix: add test cases for query async 2022-08-10 00:45:34 +08:00
Shawn efd6c56dd9 fix: test cases for batch query sync 2022-08-10 00:45:34 +08:00
Shawn 145ff66a9a fix: issues found by dialyzer and elvis 2022-08-10 00:45:26 +08:00
Shawn 35fe70b887 feat: support aysnc callback to connector modules 2022-08-10 00:34:35 +08:00
Shawn f1419d52f1 fix(resource): remove resource at the end of each test 2022-08-10 00:34:35 +08:00
Shawn a2afdeeb48 feat: add test cases for batching query 2022-08-10 00:34:35 +08:00
Shawn 75adba0781 fix: increase resource metrics using the resource id 2022-08-10 00:34:35 +08:00
Shawn d3950b9534 fix(resource): make option 'queue_enabled' disabled by default 2022-08-10 00:34:35 +08:00
Shawn 0377d3cf61 fix: update existing testcases for new emqx_resource 2022-08-10 00:34:35 +08:00
Shawn 2fb42e4d37 refactor: create emqx_resource_worker_sup for resource workers 2022-08-10 00:34:35 +08:00
Shawn 0087b7c960 fix: remove the extra file replay.erl 2022-08-10 00:34:35 +08:00
Shawn d8d8d674e4 feat(resource): start emqx_resource_worker in pools 2022-08-10 00:34:35 +08:00
Shawn 12904d797f feat(resource): first commit for batching/async/caching mechanism 2022-08-10 00:34:35 +08:00
DDDHuang 98b36c4681 fix: hstream db connector , TODO: start apps 2022-07-27 11:38:45 +08:00
JianBo He a78a389206 chore: using standard log format 2022-07-01 12:06:35 +08:00
Shawn d6ef2f7502 refactor: graceful recreate resources 2022-06-17 05:29:18 +08:00
Shawn cc25f92273 feat: add start_after_created option to resource:create/4 2022-06-16 23:34:52 +08:00
Zaiming (Stone) Shi 2065be569e fix(emqx_cluster_rpc): fail fast on stale state
Due to:

* Cluster RPC MFA is not idempotent!
* There is a lack of rollback for callback's side-effects

For instance, when two nodes try to add a cluster-singleton
concurrently, one of them will have to wait for the table lock
then try to catch-up, then try to apply MFA.
The catch-up will have the singleton created, but the initiated
initiated multicall apply will fail causing the commit to rollback,
but not to 'undo' the singleton creation.
Later, the retries will fail indefinitely.
2022-06-12 20:18:48 +02:00
Shawn b7f27157e5 fix: also alarm resource down when start resource failed 2022-06-01 15:41:55 +08:00
Shawn 88ca25c60c fix(resource): fast return when starting a unavailable resource 2022-06-01 08:24:53 +08:00
Shawn 9f69e3cad6 fix(resource): discard dry_run resource down alarm 2022-06-01 08:24:53 +08:00
Shawn d37a66e9b8 fix(test): update test cases for emqx_resource:health_check/1 2022-05-31 10:14:37 +08:00
Shawn 1054c364ad refactor(resource): improve health check and alarm it if resource down 2022-05-31 01:40:40 +08:00
EMQ-YangM 574a40b327 fix: wait for test_resource stop 2022-05-16 17:00:42 +08:00
EMQ-YangM b5addf7e05 fix: log all ignore events 2022-05-16 15:08:03 +08:00
EMQ-YangM bbbfea1b5b fix: ignore all other events 2022-05-16 15:08:03 +08:00
EMQ-YangM 1a1c82932a fix: when connecting health check failed, update status. 2022-05-16 10:47:20 +08:00
Chris 93799e3ac6 refactor: delete now unused emqx_resource modules 2022-05-16 09:54:26 +08:00
Xinyu Liu c4fd31ae25
Merge pull request #7916 from emqx/EMQX-4204-auto-timer-based-retry-when-in-disconnected-state
feat: add auto_retry for disconnected state in resource manager
2022-05-16 09:34:08 +08:00
JianBo He 3f59650e4b
Merge pull request #7944 from EMQ-YangM/fix_bridge_status
fix: restart resource should not clear metrics
2022-05-16 09:16:12 +08:00
Zaiming (Stone) Shi c355c40ea8 refactor: call emqx_alarm:ensure_deactivated everywhere 2022-05-13 16:02:55 +02:00
Chris 6574c33797 feat: add auto_retry for disconnected state in resource manager 2022-05-13 11:19:39 +02:00
EMQ-YangM d5c416736b fix: restart resource should not clear metrics 2022-05-13 16:05:41 +08:00
JimMoen a5ddc5390f refactor(resource): add resource recreate fun with empty opts 2022-05-12 14:19:56 +08:00
Chris 0b3e30e813 feat: isolate resource manager processes 2022-05-09 13:24:34 +02:00
EMQ-YangM c52b464b3c fix: check process alive before health check 2022-04-29 17:34:26 +08:00
EMQ-YangM 1bf33f75cc fix: set resource status disconnected 2022-04-29 17:05:12 +08:00
Zaiming (Stone) Shi 4e65322667 refactor: move emqx_plugin_libs_metrics to emqx app
because it can not depend on other apps
2022-04-29 12:41:36 +08:00
DDDHuang 132b37813c refactor: code format emqx_connector emqx_resource 2022-04-28 15:32:47 +08:00
DDDHuang 667da90e52 refactor: resource instance do_create_dry_run 2022-04-28 15:32:41 +08:00