Commit Graph

370 Commits

Author SHA1 Message Date
Zaiming (Stone) Shi c0d478bd41 fix(buffer_worker): type spec 2023-02-02 14:11:12 +01:00
zhongwencool ee852d8204
Merge pull request #9886 from zhongwencool/mongo-connection-default-async
fix: remove async mode from mongodb/redis/mysql/pgsql bridge
2023-02-02 21:08:01 +08:00
Zaiming (Stone) Shi d5c482b0b0 docs: remove timer unit from description
the user input has time unit. e.g. "5s" for 5 seconds etc.
2023-02-02 13:49:20 +01:00
Zaiming (Stone) Shi 9864587389 fix: send to buffer-supported connector even when disconnected 2023-02-02 12:04:17 +01:00
Zaiming (Stone) Shi 13ef30c46c
Merge pull request #9884 from savonarola/resource-fixes
fix(resources): fix resource lifecycle
2023-02-02 12:02:34 +01:00
Zhongwen Deng 1c9035d24c test: remove async from redis ct 2023-02-02 17:37:18 +08:00
Zhongwen Deng 22cc1cc745 fix: make spell_check happy 2023-02-02 17:37:18 +08:00
Zhongwen Deng f8936013b7 chore: replace async with sync 2023-02-02 17:37:18 +08:00
Zhongwen Deng 22c3f50020 fix: add query_mode_sync_only for mysql pgsql redis mongodb bridge 2023-02-02 17:37:18 +08:00
Andrew Mayorov ca5c192f4b
Merge pull request #9882 from fwup/fix/no-mqtt-bridge-middleman
refactor(mqtt-worker): avoid unnecessary abstraction
2023-02-02 13:11:31 +04:00
Zhongwen Deng 1a90c1654c chore: bad typo 2023-02-02 11:43:04 +08:00
Ilya Averyanov 14f528cc86 fix(resources): fix resource lifecycle
* do not resume all buffer workers on successful healthcheck
* do not pass undefined state to resource healthcheck callback
2023-02-01 18:26:13 +02:00
Andrew Mayorov 5fd7f65a1f
test(bufworker): make testcase simpler to follow
The confusion was due to the fact that subsequent query was missing
`async_reply_fun` and thus, was not accumulating in the results.
2023-02-01 16:52:47 +03:00
Andrew Mayorov ff473e0f1b
test(bufworker): fix testcase flapping due to data races 2023-02-01 12:57:46 +03:00
Zaiming (Stone) Shi b3ad9e97d2
Merge pull request #9870 from keynslug/fix/mqtt-connection-loss-feedback
feat(mqtt-bridge): avoid middleman process
2023-01-31 19:12:18 +01:00
Andrew Mayorov c76311c9c3
fix(buffer): count inflight batches properly 2023-01-31 18:30:42 +03:00
Zaiming (Stone) Shi b3e486041b
Merge pull request #9853 from zmstone/0127-refactor-buffer-worker-no-need-to-keep-request-for-reply-callback
0127 refactor buffer worker no need to keep request for reply callback
2023-01-31 08:44:01 +01:00
Stefan Strigler 27881064dc fix: increase dropped.queue_full by number of messages 2023-01-30 11:37:35 +01:00
Zaiming (Stone) Shi d47941601d refactor(buffer_worker): rename trace points 2023-01-28 11:52:11 +01:00
Zaiming (Stone) Shi 7f66c6a9e2
Merge pull request #9840 from olcai/redact-influxdb-tokens
fix: redact influxdb tokens in logs and reduce log level
2023-01-28 11:47:36 +01:00
Zaiming (Stone) Shi fc38ea9571 refactor(buffer_worker): do not keep request body in reply context
the request body can be potentially very large
the reply context is sent to the async call handler and kept
in its memory until the async reply is received from bridge
target service.

this commit tries to minimize the size of the reply context
by replacing the request body with `[]`.
2023-01-27 17:12:55 +01:00
Zaiming (Stone) Shi 578271ea3d refactor: use lists:map instead of lc for safty 2023-01-27 15:15:46 +01:00
Zaiming (Stone) Shi f793807bc1 refactor(buffer_worker): rename function
batch_reply_after_query to handle_async_batch_reply
2023-01-27 15:04:28 +01:00
Zaiming (Stone) Shi 262c3a2869 refactor(buffer_worker): rename function
from reply_after_query to handle_async_reply
2023-01-27 15:03:18 +01:00
Zaiming (Stone) Shi 52b75ada04
Merge pull request #9832 from sstrigler/EMQX-8774-failure-to-handle-timeout-error-in-resource-worker
EMQX 8774 failure to handle timeout error in resource worker
2023-01-27 14:36:44 +01:00
Zaiming (Stone) Shi 514609bcf7
Merge pull request #9850 from zmstone/0127-fix-influxdb-bridge-atom-leak
0127 fix influxdb bridge atom leak
2023-01-27 14:30:20 +01:00
Zaiming (Stone) Shi d53106145f fix: stop resource when resource manager terminates 2023-01-27 12:39:05 +01:00
Stefan Strigler 2d62de5188 test: fix expected result from timeout error 2023-01-27 11:43:48 +01:00
Stefan Strigler a180bd9aa5 fix: catch error, not exit 2023-01-27 11:40:06 +01:00
Stefan Strigler b7e3f9d5a6 fix: try-case-of rather than try-of
try-of catches only what happens within but not after
2023-01-27 11:40:06 +01:00
Zaiming (Stone) Shi db2f631a8a refactor(buffer_worker): simplify caller reply 2023-01-27 11:33:45 +01:00
Zaiming (Stone) Shi d4fab92b72 refactor(buffer_worker): no need to keep request for REPLY macro 2023-01-27 10:41:30 +01:00
Zaiming (Stone) Shi 1f799dfd59 fix: reply with {error, buffer_overflow} when discarded 2023-01-26 17:15:36 +01:00
Zaiming (Stone) Shi ed28789164 refactor(buffer_worker): no need to return after collect into buf queue 2023-01-26 14:50:40 +01:00
Zaiming (Stone) Shi 25b4821adc refactor: move the the per-message overflow log from error to info level 2023-01-26 14:48:43 +01:00
Zaiming (Stone) Shi bb26632c8a fix(buffer_worker): fix a wrong assertion
the assertion is to ensure queue items are not binary
but should not assert the queue itself
2023-01-26 14:33:16 +01:00
Erik Timan 805d08e823 fix: reduce log level from error to warning in several places
This reduces the log level from error to warning in places that are
connected to the influxdb bridge. Transient errors for external
resources should not render an error log.
2023-01-25 14:49:50 +01:00
Zaiming (Stone) Shi 5fdf7fd24c fix(kafka): use async callback to bump success counters
some telemetry events from wolff are discarded:

* dropped:
    this is double counted in wolff,
    we now only subscribe to the dropped_queue_full event
* retried_failed:
    it has different meanings in wolff,
    in wolff, it means it's the 2nd (or onward) produce attempt
    in EMQX, it means it's eventually failed after some retries

* retried_success
    since we are going to handle the success counters in callbac
    this having this reported from wolff will only make things
    harder to understand

* failed
    wolff never fails (unelss drop which is a different counter)
2023-01-24 21:12:36 +01:00
Erik Timan 9d20431257 fix(emqx_resource): fix crash while flushing queue
We used next_event for flushing the queue in emqx_resource, but this
leads to a crash. We now call flush_worker/1 instead.
2023-01-24 14:13:35 +01:00
Erik Timan 28718edbfd chore: bump application VSNs 2023-01-24 14:12:34 +01:00
Zaiming (Stone) Shi 8fde169abb
Merge pull request #9821 from thalesmg/buffer-worker-expiry-v50
feat(buffer_worker): add expiration time to requests
2023-01-24 13:54:04 +01:00
Thales Macedo Garitezi ca4a262b75 refactor: re-organize dealing with unrecoverable errors 2023-01-20 12:00:17 -03:00
Thales Macedo Garitezi 6fa6c679bb feat(buffer_worker): add expiration time to requests
With this, we avoid performing work or replying to callers that are no
longer waiting on a result.

Also introduces two new counters:

- `dropped.expired` :: happens when a request expires before being
  sent downstream
- `late_reply` :: when a response is receive from downstream, but the
  caller is no longer for a reply because the request has expired, and
  the caller might even have retried it.
2023-01-20 11:36:52 -03:00
Zaiming (Stone) Shi 1c3e055b13
Merge pull request #9822 from JimMoen/fix-schema-typo
chore: i18n typo fix
2023-01-20 11:11:18 +01:00
JimMoen 16f45a60fd
chore: i18n typo fix 2023-01-20 11:50:01 +08:00
Thales Macedo Garitezi 47f796dd12 refactor: rename `emqx_resource_worker` -> `emqx_resource_buffer_worker`
To make it more clear that it's purpose is serve as a buffering layer.
2023-01-18 16:15:34 -03:00
Ilya Averyanov 44a6e5ed15 chore(resources): add missing parameters to emqx_resource schema 2023-01-18 14:33:45 +02:00
Zaiming (Stone) Shi d4f3b4c8c2 Merge remote-tracking branch 'origin/master' into fix-buffer-clear-replayq-on-delete-v50 2023-01-18 11:39:47 +01:00
Ivan Dyachkov 430b0a03d4
Merge pull request #9780 from id/fix-ensure-no-colon-in-filenames
fix: ensure no colon in filenames
2023-01-18 09:36:16 +01:00
Zaiming (Stone) Shi faf5916ed6 test: relax recoverable/unrecoverable error check
for now, treat all other errors unrecoverable
2023-01-18 07:52:28 +01:00
Thales Macedo Garitezi 5c2ac0ac81 chore: don't cancel inflight items upon worker death; retry them 2023-01-17 19:50:30 -03:00
Thales Macedo Garitezi 087b667263 fix(buffer_worker): allow signalling unrecoverable errors 2023-01-17 19:50:30 -03:00
Thales Macedo Garitezi 4ed7bff33f chore: fix dialyzer warnings 2023-01-17 16:49:16 -03:00
Thales Macedo Garitezi fa01deb3eb chore: retry as much as possible, don't reply to caller too soon 2023-01-17 16:49:15 -03:00
Thales Macedo Garitezi b82009bc29 refactor: use monotonic times as refs and store initial times when creating ets
with this, we may measure latencies in the future.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 3ba65c4377 feat: poke the buffer workers when inflight is no longer full
if max inflight = 1, then we only make progress based on the state
timer, since the callbacks were not poking the buffer workers.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi b5aaef084c refactor: enter running state directly
now that we don't have the possibility of dirty disk queues (we always
use volatile replayq), we will never resume old work.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi bd0e2a74ba refactor: rename inflight_name field to inflight_tid 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 006b4bda97 feat(buffer_worker): monitor async workers and cancel their inflight requests upon death 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 731ac6567a fix(buffer_worker): don't retry all kinds of inflight requests
Some requests should not be retried during the blocked state.  For
example, if some async requests are just taking some time to process,
we should avoid retrying them periodically, lest risk overloading the
downstream further.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 5425f3d88e refactor: rm unused fn 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 5dd24a64c3 refactor(buffer_worker): check if inflight is full before flushing 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 344eeebe63 fix: always ack async replies
The caller should decide if it should retry in that case, to avoid
overwhelming the resource with retries.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi bd95a95409 refactor: remove redundant `BlockWorker` arg, change boolean to ack/nack
`BlockWorker` was always false (ack).  Also, changed the return to
something more semantic than a boolean to avoid [boolean
blindness](https://runtimeverification.com/blog/code-smell-boolean-blindness/)
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 30a227bd38 refactor: rename `resume` state timeout to `unblock` 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 7401d6f0ce refactor: rename ack fn 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 196bf1c5ba feat: mass collect calls from mailbox also when blocked 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi d4724d6ce9 refactor: remove redundant function
`retry_queue` does basically what the running state does, now that we
refactored the buffer workers to always use the queue.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi d6a9d0aa48 fix: set queuing to 0 after buffer worker termination 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 81fc561ed5 fix(buffer_worker): check for overflow after enqueuing new requests 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 4cb83d0c9a fix: fix some expressions after refactoring 2023-01-17 16:48:48 -03:00
Zaiming (Stone) Shi fecdbac9a8 refactor: rename a few functions 2023-01-17 16:48:48 -03:00
Zaiming (Stone) Shi cdd8de11b0 chore: fix a typo in function name 2023-01-17 16:48:48 -03:00
Zaiming (Stone) Shi 618b97870b refactor: call local function queue_count everywhere 2023-01-17 16:48:48 -03:00
Zaiming (Stone) Shi 249c4c1c79 refactor: use 'bufs' for resource worker replayq dir 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi af6807e863 refactor: cancel flush timer sooner
Avoids the cancellation being delayed.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 477c55d8ef fix: sanitizy replayq dir filepath
Colons (`:`) are not allowed in Windows.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 4c04a01370 refactor(buffer_worker): remove `?Q_ITEM` wrapping and use lightweight size estimate 2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi 32a9e60313 feat(buffer_worker): also use the inflight table for sync requests
Related: https://emqx.atlassian.net/browse/EMQX-8692

This should also correctly account for `retried.*` metrics for sync
requests.

Also fixes cases where race conditions for retrying async requests
could potentially lead to inconsistent metrics.

Fixes more cases where a stale reference to `replayq` was being held
accidentally after a `pop`.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi ff23d25e8b chore(replayq): update replayq -> 0.3.6 and use `clean_start` for buffer workers
So we can truly avoid resuming work after a node restart.
2023-01-17 16:48:48 -03:00
Thales Macedo Garitezi c383558467 fix(buffer): fix `replayq` usages in buffer workers (5.0)
https://emqx.atlassian.net/browse/EMQX-8700

Fixes a few errors in the usage of `replayq` queues.

- Close `replayq` when `emqx_resource_worker` terminates.
- Do not keep old references to `replayq` after any `pop`s.
- Clear `replayq`'s data directories when removing a resource.
2023-01-17 16:48:48 -03:00
Stefan Strigler e54f2f83b3 test: use same default timeout as elsewhere 2023-01-17 15:29:19 +01:00
Ivan Dyachkov 676f017ec0 fix: ensure no colon in filenames 2023-01-16 21:27:01 +01:00
Stefan Strigler e08c1d2229 Merge remote-tracking branch 'olcai/refactor-bridges-api' into dev/api-refactor 2023-01-13 15:49:52 +01:00
Stefan Strigler 1690a6dcfc
Merge branch 'master' into dev/api-refactor 2023-01-13 15:34:13 +01:00
Erik Timan 61e98900be chore: bump app vsn of emqx_resource 2023-01-13 15:13:35 +01:00
Kjell Winblad 1ac03ab208
Merge pull request #9730 from kjellwinblad/kjell/fix/resource_atom_leak/EMQX-8583
fix: remove atom leaks
2023-01-13 14:38:28 +01:00
Kjell Winblad 734e6b9c96 chore: fix flaky test cases, log labels and review comments
Co-authored-by: Thales Macedo Garitezi <thalesmg@gmail.com>
2023-01-13 11:05:02 +01:00
Ivan Dyachkov b5d3e9d8b8 fix: remove time unit from duration fields description 2023-01-12 14:18:55 +01:00
Kjell Winblad 8c482e03d1 fix: remove atom leaks
Both emqx_resource_managers and emqx_resource_workers leaked atoms as they
created an unique atoms to use as registered names. This is fixed by
removing the need to register the names.

Fixes: https://emqx.atlassian.net/browse/EMQX-8583
2023-01-11 17:03:28 +01:00
Stefan Strigler 8ad8288195 feat: report error in create_dry_run 2023-01-11 14:22:37 +01:00
Zaiming (Stone) Shi 85a8eff90b fix(emqx_resource_manager): do not start when disabled 2023-01-11 08:33:48 +01:00
Thales Macedo Garitezi 70eb5ffb58 refactor: remove unused function 2023-01-05 10:16:01 -03:00
Thales Macedo Garitezi 56437228dc docs: improve descriptions
Thanks to @qzhuyan for the corrections.
2023-01-05 10:16:01 -03:00
Thales Macedo Garitezi fd360ac6c0 feat(buffer_worker): refactor buffer/resource workers to always use queue
This makes the buffer/resource workers always use `replayq` for
queuing, along with collecting multiple requests in a single call.
This is done to avoid long message queues for the buffer workers and
rely on `replayq`'s capabilities of offloading to disk and detecting
overflow.

Also, this deprecates the `enable_batch` and `enable_queue` resource
creation options, as: i) queuing is now always enables; ii) batch_size
> 1 <=> batch_enabled.  The corresponding metric
`dropped.queue_not_enabled` is dropped, along with `batching`.  The
batching is too ephemeral, especially considering a default batch time
of 20 ms, and is not shown in the dashboard, so it was removed.
2023-01-05 10:15:09 -03:00
Thales Macedo Garitezi bf3983e7c4 feat(buffer_worker): use offload mode for `replayq`
To avoid confusion for the users as to what persistence guarantees we
offer when buffering bridges/resources, we will always enable offload
mode for `replayq`.  With this, when the buffer size is above the max
segment size, it'll flush the queue to disk, but on recovery after a
restart it'll clean the existing segments rather than resuming from
them.
2023-01-05 10:11:59 -03:00
Erik Timan b9d012e072 refactor(emqx_resource): ingress bridge counter
Unify code paths for resource metrics by removing
emqx_resource:inc_received/1 and adding
emqx_resource_metrics:received_inc/1 & friends.
2023-01-02 15:11:52 +01:00
Thales Macedo Garitezi 7e02eac3bc
Merge pull request #9619 from thalesmg/refactor-gauges-v50
refactor(metrics): use absolute gauge values rather than deltas (v5.0)
2023-01-02 10:56:47 -03:00
Zaiming (Stone) Shi dbc10c2eed chore: update copyright year 2023 2023-01-02 09:22:27 +01:00
Thales Macedo Garitezi 305ed68916 chore: bump app vsns 2022-12-30 16:51:24 -03:00