See:
https://emqx.atlassian.net/wiki/spaces/P/pages/612368639/open+e5.1+remove+auto+restart+interval+from+buffer+worker+resource+options
Current problem:
In 5.0.x, we have two timer options that control the state changing of buffer worker
resources: auto_restart_interval and health_check_interval.
- auto_restart_interval controls how often the resource attempts to transition from
disconnected to connected.
- health_check_interval controls how often the resource is checked and potentially moved
from connected to disconnected or connecting.
The existence of two independent timers for very similar purposes is confusing to users,
QA and even developers. Also, an intimately related configuration is request_timeout,
which can interact badly with auto_restart_interval if the latter is poorly configured:
requests may always expire if request_timeout < auto_restart_interval and if the resource
enters the disconnected state. For health_check_interval, we attempt to derive a sane
default that gives requests a chance to retry (if request timeout is finite, then the
resource retries requests with a period of min(health_check_interval, request_timeout /
3).
Another problem with the separate auto_restart_interval is that its default value (60 s)
is too high when compared to the default request timeout and health check, leading to the
problems described above if not tuned.
Proposed solution:
We propose to drop auto_restart_interval in favor of health_check_interval, which will be
used for both disconnected -> connected and connected -> {disconnected, connecting}
transition checks. With that, the resource will attempt to reconnect at the same interval
as the health check, which currently is 15 s.
Also, as two smaller changes to accompany this one:
- Increase the default request_timeout from 15 s to 45 s.
- Rename request_timeout to request_ttl.
Fixes https://emqx.atlassian.net/browse/EMQX-10001
Recently, we unified request_timeout in a single field located at the
webhook connector schema. However, the correct fix would be to use
the resource_opts.request_timeout one, as that’s the only one that
allows infinity timeout.
Fixes https://emqx.atlassian.net/browse/EMQX-9905
Since calling `telemetry` is costly in a hot path, we instead collect
metrics inside the buffer workers state and periodically flush them,
rather than immediately as events happen.
This is a performance improvement for webhook bridge.
Since this bridge is called using `async` callback mode, and `ehttpc`
frequently returns errors of the form `normal` and `{shutdown,
normal}` that are retried "for free" by `ehttpc`, we add this behavior
to async requests as well. Other errors are retried too, but they are
not "free": 3 attempts are made at a maximum.
This is important because, when using buffer workers, we should avoid
making them enter the `blocked` state, since that halts all progress
and makes throughput plummet.
This option was previously only in tests to avoid
emqx_conf app overwriting previously set configs with default values.
After a03f2dd64b, the issue for
test cases had been resolved.
This commit is to get rid of the option all together
Fixes https://emqx.atlassian.net/browse/EMQX-9325
Currently, ingress bridges referenced in the `FROM` clause of rules
are not being accounted as dependencies.
When we try to delete an ingress bridge that's referenced in a rule
like `select * from "$bridges/mqtt:ingress"`, that bridge does not
trigger an UI warning about dependent actions.
Excluding a couple of testcases which does not make much sense running
in the cluster. Also try to reduce amount of "noise" in the testcases,
making them easier to comprehend.
Co-authored-by: Thales Macedo Garitezi <thalesmg@gmail.com>
Metrics should only be exposed via the /bridges/:id/metrics endpoint,
and not in other operations such as getting the list of all bridges, or
in the response when a bridge has been created. This commit removes all
traces of metrics for the non-dedicated API endpoints.
When using async mode with the webhook bridge, queued messages that are
not fully processed when the connection times out could be lost. This
commit fixes this by letting the bridge return a recoverable_error when
this happen. The message send will then be retried in sync mode by the
emqx_resource_buffer_worker.
Fixes: https://emqx.atlassian.net/browse/EMQX-8974
When some resource manager is busy with trying to estabilish a
connection with remote, we hit the "read-from-cache" codepath so the
resource data will not contain any metrics.
With this, we avoid performing work or replying to callers that are no
longer waiting on a result.
Also introduces two new counters:
- `dropped.expired` :: happens when a request expires before being
sent downstream
- `late_reply` :: when a response is receive from downstream, but the
caller is no longer for a reply because the request has expired, and
the caller might even have retried it.
Related: https://emqx.atlassian.net/browse/EMQX-8692
This should also correctly account for `retried.*` metrics for sync
requests.
Also fixes cases where race conditions for retrying async requests
could potentially lead to inconsistent metrics.
Fixes more cases where a stale reference to `replayq` was being held
accidentally after a `pop`.
In order to improve the consistency with other API endpoints, we move
the enable/disable operations to a separate endpoint
/bridges/{id}/enable/[true,false].
In order for the /bridges APIs to be consistent with other APIs, we move
out metrics from GET /bridges/{id} to its own endpoint,
/bridges/{id}/metrics. We also rename /bridges/reset_metrics to
/bridges/metrics/reset.
This fixes https://emqx.atlassian.net/browse/EMQX-8648. The issue
described in `EMQX-8648` is that when deleting a non-existing bridge the
server gives a success response. See below:
```
curl --head -u admin:public2 -X 'DELETE' 'http://localhost:18083/api/v5/bridges/webhook:i_do_not_exist'
HTTP/1.1 204 No Content
date: Tue, 03 Jan 2023 16:59:01 GMT
server: Cowboy
```
After the fix, deleting a non existing bridge will give the following
response:
```
HTTP/1.1 404 Not Found
content-length: 49
content-type: application/json
date: Thu, 05 Jan 2023 12:40:35 GMT
server: Cowboy
```
Closes: EMQX-8648
This makes the buffer/resource workers always use `replayq` for
queuing, along with collecting multiple requests in a single call.
This is done to avoid long message queues for the buffer workers and
rely on `replayq`'s capabilities of offloading to disk and detecting
overflow.
Also, this deprecates the `enable_batch` and `enable_queue` resource
creation options, as: i) queuing is now always enables; ii) batch_size
> 1 <=> batch_enabled. The corresponding metric
`dropped.queue_not_enabled` is dropped, along with `batching`. The
batching is too ephemeral, especially considering a default batch time
of 20 ms, and is not shown in the dashboard, so it was removed.
https://emqx.atlassian.net/browse/EMQX-8548
Currently, we face several issues trying to keep resource metrics
reasonable. For example, when a resource is re-created and has its
metrics reset, but then its durable queue resumes its previous work
and leads to strange (often negative) metrics.
Instead using `counters` that are shared by more than one worker to
manage gauges, we introduce an ETS table whose key is not only scoped
by the Resource ID as before, but also by the worker ID. This way,
when a worker starts/terminates, they should set their own gauges to
their values (often 0 or `replayq:count` when resuming off a queue).
With this scoping and initialization procedure, we'll hopefully avoid
hitting those strange metrics scenarios and have better control over
the gauges.
https://emqx.atlassian.net/browse/EMQX-8445
Currently the bridge client’s client ID is prefixed with the resource
ID.
Sometimes it’s useful for users to have control of this prefix,
e.g. prefix based ACL rules in the target broker.