Commit Graph

422 Commits

Author SHA1 Message Date
Thales Macedo Garitezi 8844b22c80
docs: improve descriptions
Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>
2023-03-22 15:32:09 -03:00
Thales Macedo Garitezi 127a075b66 test(dynamo): attempt to fix dynamo tests
Those tests in the `flaky` test are really flaky and require lots of
CI retries.

Apparently, the flakiness comes from race conditions from restarting
bridges with the same name too fast between test cases.  Previously,
all test cases were sharing the same bridge name (the module name).
2023-03-22 14:34:37 -03:00
Thales Macedo Garitezi 61cb03b45a fix(buffer_worker): change the default `resume_interval` value and expose it as hidden config
Also removes the previously added alarm for request timeout.

There are situations where having a short request timeout and a long
health check interval make sense, so we don't want to alarm the user
for those situations.  Instead, we automatically attempt to set a
reasonable `resume_interval` value.
2023-03-22 11:47:36 -03:00
Kjell Winblad 27b8445337 fix: add inflight window setting to the clickhouse bridge
This commit makes sure the inflight window setting is present for the
clickhouse bridge. It also changes emqx_resource_schema that previously
removed the inflight window setting from resources with query mode
`always_sync`. We don't need to do that because all bridges that uses
the buffer worker queue will get async call handling even if the bridge
don't support the async callback.

Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>
2023-03-21 17:14:03 +01:00
Stefan Strigler c1384b6e6e feat(emqx_resource): include error with alarm for resource_down 2023-03-21 15:02:29 +01:00
Stefan Strigler 53825b9aba fix(emqx_bridge): propagate connection error to resource status 2023-03-21 15:02:29 +01:00
Thales Macedo Garitezi 20414d7373 fix(buffer_worker): check request timeout and health check interval
Fixes https://emqx.atlassian.net/browse/EMQX-9099

The default value for `request_timeout` is 15 seconds, and the default
resume interval is also 15 seconds (the health check timeout, if
`resume_interval` is not explicitly given).  This means that, in
practice, if a buffer worker ever gets into the blocked state, then
almost all requests will timeout.

Proposed improvement:

- `request_timeout` should by default be twice as much as
health_check_interval.
- Emit a alarm if `request_timeout` is not greater than
`health_check_interval`.
2023-03-16 13:46:45 -03:00
Thales Macedo Garitezi d464e2aad5 refactor: rename test resource prefix
Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>
2023-03-16 13:43:01 -03:00
Thales Macedo Garitezi 03342923b9 fix(bridge): use the same dry run prefix
Kafka Producer and Consumer bridges rely on this prefix for detecting
a dry run and avoid leaking atoms.  At some point, this prefix was
changed, effectively disabling the check in Kafka Producer.
2023-03-16 13:43:01 -03:00
Thales Macedo Garitezi 91a57faa95
Merge pull request #10128 from thalesmg/ocsp-v50-mkII
feat: add ocsp stapling support to mqtt ssl listener (5.0)
2023-03-16 13:10:48 -03:00
Thales Macedo Garitezi 164440fe83 test(resource): fix flaky test
Sometimes this test might retry more times, so we check the prefix
of the trace only.
2023-03-15 14:25:55 -03:00
Andrew Mayorov a9bc8a4464
refactor(resman): rename `ets_lookup` → `lookup_cached`
That way we hide the impementation details + the interface becomes
cleaner and more obvious.
2023-03-15 19:17:30 +03:00
Andrew Mayorov 29907875bf
test(bufworker): set `batch_time` for batch-related testcases
By default it's `0` since e9d3fc51. This made a couple of tests prone
to flapping.
2023-03-15 19:17:30 +03:00
Andrew Mayorov e411c5d5f8
refactor(resman): work with state cache atomically
Also ensure that cache entries are always consistent with `Data`,
so that most of the code could rely on reading the cached entry
most of the time.
2023-03-15 19:17:30 +03:00
Andrew Mayorov cad6492c99
perf(bridge-api): ask bridge listings in parallel
Also rename response formatting functions to better clarify their
purpose.
2023-03-15 19:17:29 +03:00
Thales Macedo Garitezi 422597a441 test: fix flaky tests 2023-03-14 16:08:47 -03:00
Andrew Mayorov 686bf8255b
fix(bridge): reply `emqx_resource:get_instance/1` from cache
The resource manager may be busy at times, so this change ensures that
getting resource instance state will not block. Currently, no users of
`emqx_resource:get_instance/1` do seem to be relying on state being
"as-actual-as-possible" guarantee it was providing.
2023-03-13 14:35:08 +03:00
Andrew Mayorov a86d06f043
chore: bump app versions following last merge-back 2023-03-10 16:44:15 +03:00
Zaiming (Stone) Shi fe27604010 Merge remote-tracking branch 'origin/release-50' into 0308-merge-release-50-back-to-master 2023-03-08 16:46:45 +01:00
Thales Macedo Garitezi eef65fba60 fix(buffer_worker): handle `request_timeout = infinity` case
The current schema allows `infinity` for `request_timeout`, so we have
to take that into account.  It's not currently possible to set
`batch_time = infinity`, so there's no need to treat that case.
2023-03-06 15:31:28 -03:00
Thales Macedo Garitezi 18ab7ed197 chore: bump app vsns 2023-03-06 15:31:28 -03:00
Thales Macedo Garitezi 0e707e837f docs(buffer_worker): improve description of `request_timeout` 2023-03-06 15:31:28 -03:00
Thales Macedo Garitezi e9d3fc511f chore(buffer_worker): change default `batch_time` to 0 and improve docs 2023-03-06 15:31:28 -03:00
Thales Macedo Garitezi f95a30ae89 fix(webhook): convert `request_timeout`s in root and resource_opts 2023-03-06 10:12:38 -03:00
Thales Macedo Garitezi 167b7a212f refactor(buffer_worker): avoid starting 0-time timers 2023-03-06 10:12:38 -03:00
Thales Macedo Garitezi e9ffabf936 fix(buffer_worker): add batch time automatic adjustment
To avoid message loss due to misconfigurations, we adjust `batch_time`
based on `request_timeout`.  If `batch_time` > `request_timeout`, all
requests will timeout before being sent if the message rate is low.
Even worse if `pool_size` is high.  We cap `batch_time` at
`request_timeout div 2` as a rule of thumb.
2023-03-06 10:12:38 -03:00
Kjell Winblad 67acdf0888 feat: add clickhouse database bridge
This commit adds a Clickhouse bridge to EMQX 5. The bridge is similar to
the Clickhouse bridge in the 4.4, but adds the possibility to use
different formats (such as JSON) for values to be inserted.
2023-03-02 12:22:11 +01:00
Andrew Mayorov c883e4b36a
test: drop custom `loop_wait` in favor of snabkaffe's `?retry` 2023-02-24 18:16:35 +03:00
Andrew Mayorov 2b4e49e7df
fix(bufworker): handle replies of simple async queries
Before that change, simple queries were treated as "retries"
essentially, thus skipping all the reply processing there is.
2023-02-24 15:06:49 +03:00
Zaiming (Stone) Shi c97d17cc91 test: refactor to loop wait for counters 2023-02-24 09:02:03 +01:00
Zaiming (Stone) Shi a10dbba084 refactor(buffer_worker): less defensive on inflight counter decrement 2023-02-23 21:23:10 +01:00
Zaiming (Stone) Shi 7a6465e2cf fix(buffer_worker): ensure flush timer reset in blocked state 2023-02-23 21:06:38 +01:00
Zaiming (Stone) Shi 3a6dbbdd05 refactor(buffer_worker): ensure flsh message is never missed 2023-02-23 20:11:00 +01:00
Zaiming (Stone) Shi dbfdeec5e9 fix(buffer_worker): log unknown async replies 2023-02-23 12:55:49 +01:00
Zaiming (Stone) Shi 356a94af30 fix(buffer_worker): ensure async flush message is sent
This is a new issue introduced in the previous fix commits
after handling the partial expiry correctly, the
IsFullBefore check is no longer the state before the reply
is received but the state after a partially-expired batch
is shrinked.
The fix is simple, move the check to the entry-point of
where async reply callback enters, then send an async
'flush' notification regardless of the handling result.
2023-02-23 09:47:34 +01:00
Zaiming (Stone) Shi 713220f88b refactor(buffer_worker): more generic process for all_expired 2023-02-23 00:04:20 +01:00
Zaiming (Stone) Shi 036f69cd6e test: ensure batch size > 1 is covered in expiration test 2023-02-22 23:26:04 +01:00
Zaiming (Stone) Shi bf8becd521 test: make sure gauge return to 0 in test cases 2023-02-22 23:07:12 +01:00
Zaiming (Stone) Shi fc614e16e5 fix(bridge): update inflight items after partial expiry 2023-02-22 22:05:56 +01:00
Zaiming (Stone) Shi bb13d0708f fix(bridge): fix dropped counter and inflight gauge
Prior to this fix there were two metrics issues
1. if a batch is all requests expired when receiving a reply
   it only bumped 1 instead of the batch size for 'late_reply'
2. when a batch is partially delivered (or expired), the
   dropped requests were not decremented from the inflight size gauge
2023-02-22 13:20:58 +01:00
Erik Timan 056bc71af2 chore: bump VSN version 2023-02-16 15:05:38 +01:00
Erik Timan 2442a4dea7 test(emqx_resource): add regression test for recursive flushing 2023-02-16 14:17:16 +01:00
Erik Timan dcf70e0e68 refactor(emqx_resource): add more trace points for flushing 2023-02-16 14:17:16 +01:00
Zaiming (Stone) Shi fb61c2b266 perf: avoid getting metrics (gen_server:call) for each resource lookup 2023-02-10 19:40:37 +01:00
Zaiming (Stone) Shi 42dfaf3ef2
Merge pull request #9910 from sstrigler/EMQX-8861-improve-bridge-restart-button-behaviour
EMQX 8861 improve bridge restart button behaviour
2023-02-09 18:00:48 +01:00
Andrew Mayorov 81b1bab11e
chore: bump `emqx_resource` version to 0.1.7
Also add the changelog entry.
2023-02-08 14:21:30 +03:00
Andrew Mayorov c6fc0ec8cd
fix(bufworker): do not avoid retry if inflight table is full
Otherwise there's no other piece of code that would retry the inflight
queries in that case.
2023-02-08 14:08:04 +03:00
Andrew Mayorov d8d06a260f
test(buffer): add test on inflight overflow w/ async queries
This testcase should verify that the buffer will retry all inflight
queries failed with recoverable errors + flush all outstanding queries.

Co-authored-by: ieQu1 <99872536+ieQu1@users.noreply.github.com>
2023-02-08 14:08:04 +03:00
Stefan Strigler 86f3f5787f feat: allow to manually re-connect disconected bridge 2023-02-07 11:58:30 +01:00
Zaiming (Stone) Shi 7ea140599a
Merge pull request #9894 from id/ci-always-run-static-checks
ci: always run static_checks
2023-02-02 16:33:19 +01:00
Zaiming (Stone) Shi feca4cc0a5
Merge pull request #9892 from zmstone/0202-docs-cosmetic
0202 docs cosmetic
2023-02-02 15:43:58 +01:00
Zaiming (Stone) Shi 58627b7958 chore(emqx_resource_manager): ignore unused return value for dialyzer 2023-02-02 14:11:12 +01:00
Zaiming (Stone) Shi c0d478bd41 fix(buffer_worker): type spec 2023-02-02 14:11:12 +01:00
zhongwencool ee852d8204
Merge pull request #9886 from zhongwencool/mongo-connection-default-async
fix: remove async mode from mongodb/redis/mysql/pgsql bridge
2023-02-02 21:08:01 +08:00
Zaiming (Stone) Shi d5c482b0b0 docs: remove timer unit from description
the user input has time unit. e.g. "5s" for 5 seconds etc.
2023-02-02 13:49:20 +01:00
Zaiming (Stone) Shi 9864587389 fix: send to buffer-supported connector even when disconnected 2023-02-02 12:04:17 +01:00
Zaiming (Stone) Shi 13ef30c46c
Merge pull request #9884 from savonarola/resource-fixes
fix(resources): fix resource lifecycle
2023-02-02 12:02:34 +01:00
Zhongwen Deng 1c9035d24c test: remove async from redis ct 2023-02-02 17:37:18 +08:00
Zhongwen Deng 22cc1cc745 fix: make spell_check happy 2023-02-02 17:37:18 +08:00
Zhongwen Deng f8936013b7 chore: replace async with sync 2023-02-02 17:37:18 +08:00
Zhongwen Deng 22c3f50020 fix: add query_mode_sync_only for mysql pgsql redis mongodb bridge 2023-02-02 17:37:18 +08:00
Andrew Mayorov ca5c192f4b
Merge pull request #9882 from fwup/fix/no-mqtt-bridge-middleman
refactor(mqtt-worker): avoid unnecessary abstraction
2023-02-02 13:11:31 +04:00
Zhongwen Deng 1a90c1654c chore: bad typo 2023-02-02 11:43:04 +08:00
Ilya Averyanov 14f528cc86 fix(resources): fix resource lifecycle
* do not resume all buffer workers on successful healthcheck
* do not pass undefined state to resource healthcheck callback
2023-02-01 18:26:13 +02:00
Andrew Mayorov 5fd7f65a1f
test(bufworker): make testcase simpler to follow
The confusion was due to the fact that subsequent query was missing
`async_reply_fun` and thus, was not accumulating in the results.
2023-02-01 16:52:47 +03:00
Andrew Mayorov ff473e0f1b
test(bufworker): fix testcase flapping due to data races 2023-02-01 12:57:46 +03:00
Zaiming (Stone) Shi b3ad9e97d2
Merge pull request #9870 from keynslug/fix/mqtt-connection-loss-feedback
feat(mqtt-bridge): avoid middleman process
2023-01-31 19:12:18 +01:00
Andrew Mayorov c76311c9c3
fix(buffer): count inflight batches properly 2023-01-31 18:30:42 +03:00
Zaiming (Stone) Shi b3e486041b
Merge pull request #9853 from zmstone/0127-refactor-buffer-worker-no-need-to-keep-request-for-reply-callback
0127 refactor buffer worker no need to keep request for reply callback
2023-01-31 08:44:01 +01:00
Stefan Strigler 27881064dc fix: increase dropped.queue_full by number of messages 2023-01-30 11:37:35 +01:00
Zaiming (Stone) Shi d47941601d refactor(buffer_worker): rename trace points 2023-01-28 11:52:11 +01:00
Zaiming (Stone) Shi 7f66c6a9e2
Merge pull request #9840 from olcai/redact-influxdb-tokens
fix: redact influxdb tokens in logs and reduce log level
2023-01-28 11:47:36 +01:00
Zaiming (Stone) Shi fc38ea9571 refactor(buffer_worker): do not keep request body in reply context
the request body can be potentially very large
the reply context is sent to the async call handler and kept
in its memory until the async reply is received from bridge
target service.

this commit tries to minimize the size of the reply context
by replacing the request body with `[]`.
2023-01-27 17:12:55 +01:00
Zaiming (Stone) Shi 578271ea3d refactor: use lists:map instead of lc for safty 2023-01-27 15:15:46 +01:00
Zaiming (Stone) Shi f793807bc1 refactor(buffer_worker): rename function
batch_reply_after_query to handle_async_batch_reply
2023-01-27 15:04:28 +01:00
Zaiming (Stone) Shi 262c3a2869 refactor(buffer_worker): rename function
from reply_after_query to handle_async_reply
2023-01-27 15:03:18 +01:00
Zaiming (Stone) Shi 52b75ada04
Merge pull request #9832 from sstrigler/EMQX-8774-failure-to-handle-timeout-error-in-resource-worker
EMQX 8774 failure to handle timeout error in resource worker
2023-01-27 14:36:44 +01:00
Zaiming (Stone) Shi 514609bcf7
Merge pull request #9850 from zmstone/0127-fix-influxdb-bridge-atom-leak
0127 fix influxdb bridge atom leak
2023-01-27 14:30:20 +01:00
Zaiming (Stone) Shi d53106145f fix: stop resource when resource manager terminates 2023-01-27 12:39:05 +01:00
Stefan Strigler 2d62de5188 test: fix expected result from timeout error 2023-01-27 11:43:48 +01:00
Stefan Strigler a180bd9aa5 fix: catch error, not exit 2023-01-27 11:40:06 +01:00
Stefan Strigler b7e3f9d5a6 fix: try-case-of rather than try-of
try-of catches only what happens within but not after
2023-01-27 11:40:06 +01:00
Zaiming (Stone) Shi db2f631a8a refactor(buffer_worker): simplify caller reply 2023-01-27 11:33:45 +01:00
Zaiming (Stone) Shi d4fab92b72 refactor(buffer_worker): no need to keep request for REPLY macro 2023-01-27 10:41:30 +01:00
Zaiming (Stone) Shi 1f799dfd59 fix: reply with {error, buffer_overflow} when discarded 2023-01-26 17:15:36 +01:00
Zaiming (Stone) Shi ed28789164 refactor(buffer_worker): no need to return after collect into buf queue 2023-01-26 14:50:40 +01:00
Zaiming (Stone) Shi 25b4821adc refactor: move the the per-message overflow log from error to info level 2023-01-26 14:48:43 +01:00
Zaiming (Stone) Shi bb26632c8a fix(buffer_worker): fix a wrong assertion
the assertion is to ensure queue items are not binary
but should not assert the queue itself
2023-01-26 14:33:16 +01:00
Erik Timan 805d08e823 fix: reduce log level from error to warning in several places
This reduces the log level from error to warning in places that are
connected to the influxdb bridge. Transient errors for external
resources should not render an error log.
2023-01-25 14:49:50 +01:00
Zaiming (Stone) Shi 5fdf7fd24c fix(kafka): use async callback to bump success counters
some telemetry events from wolff are discarded:

* dropped:
    this is double counted in wolff,
    we now only subscribe to the dropped_queue_full event
* retried_failed:
    it has different meanings in wolff,
    in wolff, it means it's the 2nd (or onward) produce attempt
    in EMQX, it means it's eventually failed after some retries

* retried_success
    since we are going to handle the success counters in callbac
    this having this reported from wolff will only make things
    harder to understand

* failed
    wolff never fails (unelss drop which is a different counter)
2023-01-24 21:12:36 +01:00
Erik Timan 9d20431257 fix(emqx_resource): fix crash while flushing queue
We used next_event for flushing the queue in emqx_resource, but this
leads to a crash. We now call flush_worker/1 instead.
2023-01-24 14:13:35 +01:00
Erik Timan 28718edbfd chore: bump application VSNs 2023-01-24 14:12:34 +01:00
Zaiming (Stone) Shi 8fde169abb
Merge pull request #9821 from thalesmg/buffer-worker-expiry-v50
feat(buffer_worker): add expiration time to requests
2023-01-24 13:54:04 +01:00
Thales Macedo Garitezi ca4a262b75 refactor: re-organize dealing with unrecoverable errors 2023-01-20 12:00:17 -03:00
Thales Macedo Garitezi 6fa6c679bb feat(buffer_worker): add expiration time to requests
With this, we avoid performing work or replying to callers that are no
longer waiting on a result.

Also introduces two new counters:

- `dropped.expired` :: happens when a request expires before being
  sent downstream
- `late_reply` :: when a response is receive from downstream, but the
  caller is no longer for a reply because the request has expired, and
  the caller might even have retried it.
2023-01-20 11:36:52 -03:00
Zaiming (Stone) Shi 1c3e055b13
Merge pull request #9822 from JimMoen/fix-schema-typo
chore: i18n typo fix
2023-01-20 11:11:18 +01:00
JimMoen 16f45a60fd
chore: i18n typo fix 2023-01-20 11:50:01 +08:00
Thales Macedo Garitezi 47f796dd12 refactor: rename `emqx_resource_worker` -> `emqx_resource_buffer_worker`
To make it more clear that it's purpose is serve as a buffering layer.
2023-01-18 16:15:34 -03:00
Ilya Averyanov 44a6e5ed15 chore(resources): add missing parameters to emqx_resource schema 2023-01-18 14:33:45 +02:00
Zaiming (Stone) Shi d4f3b4c8c2 Merge remote-tracking branch 'origin/master' into fix-buffer-clear-replayq-on-delete-v50 2023-01-18 11:39:47 +01:00