This removes the Kafka specific knowledge from emqx_bridge_v2 and
makes it possible to add new Bridge V2 bridges without modifying
the emqx_bridge application.
Fixes https://emqx.atlassian.net/browse/EMQX-11330
After feedback from Product team, we should rename `bridges_v2` to `actions` everywhere.
We'll start with the public facing APIs.
- HTTP API
- Hocon schema root key
Co-authored-by: Thales Macedo Garitezi <thalesmg@gmail.com>
Co-authored-by: Stefan Strigler <stefan.strigler@emqx.io>
Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>
Several bridges should be able to share a connector pool defined by a
single connector. The connectors should be possible to enable and
disable similar to how one can disable and enable bridges. There should
also be an API for checking the status of a connector and for
add/edit/delete connectors similar to the current bridge API.
Issues:
https://emqx.atlassian.net/browse/EMQX-10805
Fixes https://emqx.atlassian.net/browse/EMQX-10654
Despite loading the application in `nodetool`, we need to invoke `:module_info()` to force
the module to be actually loaded, especially when it's called in `call_hocon` to generate
the initial configurations. Otherwise, it'll fail to list all the bridge schemas.
See:
https://emqx.atlassian.net/wiki/spaces/P/pages/612368639/open+e5.1+remove+auto+restart+interval+from+buffer+worker+resource+options
Current problem:
In 5.0.x, we have two timer options that control the state changing of buffer worker
resources: auto_restart_interval and health_check_interval.
- auto_restart_interval controls how often the resource attempts to transition from
disconnected to connected.
- health_check_interval controls how often the resource is checked and potentially moved
from connected to disconnected or connecting.
The existence of two independent timers for very similar purposes is confusing to users,
QA and even developers. Also, an intimately related configuration is request_timeout,
which can interact badly with auto_restart_interval if the latter is poorly configured:
requests may always expire if request_timeout < auto_restart_interval and if the resource
enters the disconnected state. For health_check_interval, we attempt to derive a sane
default that gives requests a chance to retry (if request timeout is finite, then the
resource retries requests with a period of min(health_check_interval, request_timeout /
3).
Another problem with the separate auto_restart_interval is that its default value (60 s)
is too high when compared to the default request timeout and health check, leading to the
problems described above if not tuned.
Proposed solution:
We propose to drop auto_restart_interval in favor of health_check_interval, which will be
used for both disconnected -> connected and connected -> {disconnected, connecting}
transition checks. With that, the resource will attempt to reconnect at the same interval
as the health check, which currently is 15 s.
Also, as two smaller changes to accompany this one:
- Increase the default request_timeout from 15 s to 45 s.
- Rename request_timeout to request_ttl.
Fixes https://emqx.atlassian.net/browse/EMQX-10001
Recently, we unified request_timeout in a single field located at the
webhook connector schema. However, the correct fix would be to use
the resource_opts.request_timeout one, as that’s the only one that
allows infinity timeout.
In order for the /bridges APIs to be consistent with other APIs, we move
out metrics from GET /bridges/{id} to its own endpoint,
/bridges/{id}/metrics. We also rename /bridges/reset_metrics to
/bridges/metrics/reset.
This makes the buffer/resource workers always use `replayq` for
queuing, along with collecting multiple requests in a single call.
This is done to avoid long message queues for the buffer workers and
rely on `replayq`'s capabilities of offloading to disk and detecting
overflow.
Also, this deprecates the `enable_batch` and `enable_queue` resource
creation options, as: i) queuing is now always enables; ii) batch_size
> 1 <=> batch_enabled. The corresponding metric
`dropped.queue_not_enabled` is dropped, along with `batching`. The
batching is too ephemeral, especially considering a default batch time
of 20 ms, and is not shown in the dashboard, so it was removed.