Co-authored-by: Thales Macedo Garitezi <thalesmg@gmail.com>
Co-authored-by: Stefan Strigler <stefan.strigler@emqx.io>
Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>
Several bridges should be able to share a connector pool defined by a
single connector. The connectors should be possible to enable and
disable similar to how one can disable and enable bridges. There should
also be an API for checking the status of a connector and for
add/edit/delete connectors similar to the current bridge API.
Issues:
https://emqx.atlassian.net/browse/EMQX-10805
We need to reverse the dependency of `emqx_bridge` and `emqx_bridge_*`, because the former
loads and starts bridges during its application startup. If the individual bridge
application being loaded has not started with its dependencies, the supervision tree will
not be ready for that.
Fixes https://emqx.atlassian.net/browse/EMQX-10278
Since Pulsar client has its own replayq that lives outside the management of the buffer
workers, we must not return disconnected status for such bridge: otherwise, the resource
manager will eventually kill the producers and data may be lost.
Fixes https://emqx.atlassian.net/browse/EMQX-10228
This is a cosmetic fix for the Pulsar Producer bridge health check status.
Pulsar connection process is asynchronous, therefore, when a bridge of this type is
created or updated (which is the same as stopping and re-creating it), the immediate
status will be connecting because it’s indeed still connecting. The bridge will connect
very soon afterwards (assuming there are no true network/config issue), but having to
refresh the UI to see the new status and/or seeing the resource alarm might annoy users.
This workaround adds a few retries to account for that effect to reduce the probability of
seeing the `connecting` state on such happy-paths.
This commit refactor the query_mode resource detection code according to
a suggestion from @zmstone. This commit should not contain any
functional change except for a change of the Kafka producer bridge
config.
See:
https://emqx.atlassian.net/wiki/spaces/P/pages/612368639/open+e5.1+remove+auto+restart+interval+from+buffer+worker+resource+options
Current problem:
In 5.0.x, we have two timer options that control the state changing of buffer worker
resources: auto_restart_interval and health_check_interval.
- auto_restart_interval controls how often the resource attempts to transition from
disconnected to connected.
- health_check_interval controls how often the resource is checked and potentially moved
from connected to disconnected or connecting.
The existence of two independent timers for very similar purposes is confusing to users,
QA and even developers. Also, an intimately related configuration is request_timeout,
which can interact badly with auto_restart_interval if the latter is poorly configured:
requests may always expire if request_timeout < auto_restart_interval and if the resource
enters the disconnected state. For health_check_interval, we attempt to derive a sane
default that gives requests a chance to retry (if request timeout is finite, then the
resource retries requests with a period of min(health_check_interval, request_timeout /
3).
Another problem with the separate auto_restart_interval is that its default value (60 s)
is too high when compared to the default request timeout and health check, leading to the
problems described above if not tuned.
Proposed solution:
We propose to drop auto_restart_interval in favor of health_check_interval, which will be
used for both disconnected -> connected and connected -> {disconnected, connecting}
transition checks. With that, the resource will attempt to reconnect at the same interval
as the health check, which currently is 15 s.
Also, as two smaller changes to accompany this one:
- Increase the default request_timeout from 15 s to 45 s.
- Rename request_timeout to request_ttl.
Using `simple_one_for_one` has a potential race condition issue where
we read the PID of the resource manager before trying to remove a
resource, and then that PID changes because it was either dead at
first, or it crashed and changed, and later we use this stale PID to
try to remove it from the supervisor. Under such circumstances, the
restarting child might linger in the supervisor, leaking resources.
By using the resource ID itself as a child ID (and using `one_for_one`
restart strategy), we ensure the child is truly removed.