Commit Graph

16 Commits

Author SHA1 Message Date
Thales Macedo Garitezi ca435975de fix(webhook): treat http status code 429 as recoverable 2023-06-30 09:46:03 -03:00
Thales Macedo Garitezi 59b109eb5c fix(webhook): treat 404 and other error replies as errors in async requests
Fixes https://emqx.atlassian.net/browse/EMQX-10405

The problem here was that, for async requests, ehttpc responses of the form `{ok, 4__, _,
_}` and similar were being treated as successes.
2023-06-29 15:45:23 -03:00
zhongwencool 093cdab838 chore: to_integer to make sure integer is converted 2023-06-20 08:39:23 +08:00
zhongwencool 07172e42f0 test: integer CI check failed 2023-06-20 08:39:23 +08:00
Stefan Strigler 0d6d441f4c test(emqx_connector): start/stop test for webhook bridge 2023-06-14 09:56:50 +02:00
Stefan Strigler b2a5065641 fix(emqx_connector): report errors in on_start handler 2023-06-13 16:57:08 +02:00
Thales Macedo Garitezi 99796224d8 refactor(resource): rename `request_timeout` -> `request_ttl`
See
https://emqx.atlassian.net/wiki/spaces/P/pages/612368639/open+e5.1+remove+auto+restart+interval+from+buffer+worker+resource+options
2023-06-01 13:01:53 -03:00
Thales Macedo Garitezi 10425eb925 feat(resource): deprecate `auto_restart_interval` in favor of `health_check_interval`
See:
https://emqx.atlassian.net/wiki/spaces/P/pages/612368639/open+e5.1+remove+auto+restart+interval+from+buffer+worker+resource+options

Current problem:

In 5.0.x, we have two timer options that control the state changing of buffer worker
resources: auto_restart_interval and health_check_interval.

- auto_restart_interval controls how often the resource attempts to transition from
disconnected to connected.

- health_check_interval controls how often the resource is checked and potentially moved
from connected to disconnected or connecting.

The existence of two independent timers for very similar purposes is confusing to users,
QA and even developers.  Also, an intimately related configuration is request_timeout,
which can interact badly with auto_restart_interval if the latter is poorly configured:
requests may always expire if request_timeout < auto_restart_interval and if the resource
enters the disconnected state.  For health_check_interval, we attempt to derive a sane
default that gives requests a chance to retry (if request timeout is finite, then the
resource retries requests with a period of min(health_check_interval, request_timeout /
3).

Another problem with the separate auto_restart_interval is that its default value (60 s)
is too high when compared to the default request timeout and health check, leading to the
problems described above if not tuned.

Proposed solution:

We propose to drop auto_restart_interval in favor of health_check_interval, which will be
used for both disconnected -> connected and connected -> {disconnected, connecting}
transition checks.  With that, the resource will attempt to reconnect at the same interval
as the health check, which currently is 15 s.

Also, as two smaller changes to accompany this one:

- Increase the default request_timeout from 15 s to 45 s.
- Rename request_timeout to request_ttl.
2023-06-01 11:20:06 -03:00
Thales Macedo Garitezi a7b41e1cdf perf(webhook): add retry attempts for async
This is a performance improvement for webhook bridge.

Since this bridge is called using `async` callback mode, and `ehttpc`
frequently returns errors of the form `normal` and `{shutdown,
normal}` that are retried "for free" by `ehttpc`, we add this behavior
to async requests as well.  Other errors are retried too, but they are
not "free": 3 attempts are made at a maximum.

This is important because, when using buffer workers, we should avoid
making them enter the `blocked` state, since that halts all progress
and makes throughput plummet.
2023-05-17 09:20:50 -03:00
Thales Macedo Garitezi e073bc90bc refactor(buffer_worker): rename `s/queue/buffer/g` 2023-04-14 11:37:19 -03:00
Kjell Winblad 35474578ca refactor: rename async_inflight_window to inflight_window everywhere 2023-03-23 14:21:57 +01:00
Thales Macedo Garitezi 66eb4ef069 test: fix inter-suite flakiness 2023-03-16 13:43:01 -03:00
Thales Macedo Garitezi 03b95073fc test: fix inter-suite flakiness 2023-03-15 14:25:41 -03:00
Zaiming (Stone) Shi 26b29185b2 test(emqx_bridge_webhook_SUITE): fix flakyness in test web server 2023-03-07 20:57:38 +01:00
Kjell Winblad 163b33ab28 test: remove unnecessary dependencies of ee apps 2023-03-07 20:57:38 +01:00
Kjell Winblad ca947e3e70 fix: lost messages when HTTP connection times out
When using async mode with the webhook bridge, queued messages that are
not fully processed when the connection times out could be lost. This
commit fixes this by letting the bridge return a recoverable_error when
this happen. The message send will then be retried in sync mode by the
emqx_resource_buffer_worker.

Fixes: https://emqx.atlassian.net/browse/EMQX-8974
2023-03-07 20:57:19 +01:00