See:
https://emqx.atlassian.net/wiki/spaces/P/pages/612368639/open+e5.1+remove+auto+restart+interval+from+buffer+worker+resource+options
Current problem:
In 5.0.x, we have two timer options that control the state changing of buffer worker
resources: auto_restart_interval and health_check_interval.
- auto_restart_interval controls how often the resource attempts to transition from
disconnected to connected.
- health_check_interval controls how often the resource is checked and potentially moved
from connected to disconnected or connecting.
The existence of two independent timers for very similar purposes is confusing to users,
QA and even developers. Also, an intimately related configuration is request_timeout,
which can interact badly with auto_restart_interval if the latter is poorly configured:
requests may always expire if request_timeout < auto_restart_interval and if the resource
enters the disconnected state. For health_check_interval, we attempt to derive a sane
default that gives requests a chance to retry (if request timeout is finite, then the
resource retries requests with a period of min(health_check_interval, request_timeout /
3).
Another problem with the separate auto_restart_interval is that its default value (60 s)
is too high when compared to the default request timeout and health check, leading to the
problems described above if not tuned.
Proposed solution:
We propose to drop auto_restart_interval in favor of health_check_interval, which will be
used for both disconnected -> connected and connected -> {disconnected, connecting}
transition checks. With that, the resource will attempt to reconnect at the same interval
as the health check, which currently is 15 s.
Also, as two smaller changes to accompany this one:
- Increase the default request_timeout from 15 s to 45 s.
- Rename request_timeout to request_ttl.
after the previous refactoring, emqx_license app is now restarted
after join/rejoin the cluster, so there is no longer a need for the
installer process which monitors the 'emqx' name registration changes
and then issue license reloading and hook re-adding etc.
* release-50:
fix(pulsar): use a binary duration as default `health_check_interval`
docs: add changelog entry
docs: clarify description of bridge username and password
chore: bump to v5.0.25
fix(limiter): adjust type for compatibility
fix(limiter): fix that update node-level limiter config will not working
chore: upgrade dashboard to v1.2.4-1 for ce
chore: upgarde rulesql to 0.1.6 to fix invaid utf8 input
chore: add changelog for 10659
fix: crash when sysmon.os.mem_check_interval = disabled
chore: bump influxdb version && update changes
refactor(influxdb): move influxdb bridge into its own app
chore: add listener default changelog
fix: ocsp cache SUITE failed
fix: ensure atom key for emqx_config:get
fix: only fill cerf_file default in server side
fix: authn init is empty
fix: bad listeners default ssl_options
* release-50:
fix(limiter): fix an error when setting `max_conn_rate` in a listener
chore: bump erlcloud dependencies vsns
chore: rename dynamo template files
refactor(dynamo): move dynamo bridge into its own app
chore: update changes && bump app versions
fix: issues with the RabbitMQ config
refactor(pgsql): move pgsql && matrix && timescale bridges into their own app
fix: the iotdb password field so it has the password format
chore: update changes
refactor(tdengine): move tdengine bridge into its own app
feat: deprecate listeners's authn http api
* master: (194 commits)
fix(limiter): update change && fix deprecated version
chore: update changes
perf(limiter): simplify the memory represent of limiter configuration
ci(perf test): update tf variable name and set job timeout
ci: fix artifact name in scheduled packages workflow
fix: build_packages_cron.yaml workflow
ci: move scheduled builds to a separate workflow
build: check mnesia compatibility when generating mria config
docs: fix a typo in api doc description
feat(./dev): use command style and added 'ctl' command
test: fix delayed-pubish test case flakyness
refactor: remove raw_with_default config load option
chore: add changelog for trace timestrap
feat: increase the time precision of trace logs to microseconds
chore: make sure topic_metrics/rewrite's default is []
docs: Update changes/ce/perf-10417.en.md
chore: bump `snabbkaffe` to 1.0.8
ci: run static checks in separate jobs
chore(schema): mark deprecated quic listener fields ?IMPORTANCE_HIDDEN
chore: remove unused mqtt cap 'subscription_identifiers'
...
(Almost?) fixes https://emqx.atlassian.net/browse/EMQX-9637
During the course of performance tests comparing the performance of
e5.0.3 and e4.4.16 regarding the webhook bridge in sync mode, we
observed that the throughput in e5.0.3 (sync) was much lower than in
e4.4.16: ~ 9 k msgs / s vs. ~ 50 k msgs / s, respectively.
Analyzing `observer_cli` output, we noticed that a lot of the time
both buffer workers and ehttpc processes was spent in `ets:info/2`.
That function was called to check the size of the inflight table when
updating metrics and checking if the inflight table was full. Other
uses of `ets:info/2` were contained inside the arguments to some
`?tp/2` macro usages (https://github.com/kafka4beam/snabbkaffe/pull/60).
By using a specific record to track the size of the table, we managed
to improve the bridge performance to ~ 45 k msgs / s in sync mode.
Before this change, a separate `manager_id` / `instance_id` was used
as resource manager id, which made connector interface somewhat
inconsistent: part of function calls to connector implementation
used instance id as first argument while the rest used resource id
itself.
* master: (279 commits)
chore: shorten ct/run.sh script
chore: rename cassandra_impl to cassandra_connector
chore: fix mix.exs checking
refactor(cassandra): move cassandra bridge into its own app
chore: apply review suggestions
chore: update changes/ce/fix-10449.en.md
test: add a test for authn {}
chore: add changlog for authn_http validation
fix: always check authn_http's header and ssl_option
chore: apply suggestions from code review
fix(emqx_bridge): validate Webhook bad URL and return 'BAD_REQUEST' if it's invalid
fix(emqx_alarm): add safe call API to activate/deactivate alarms and use it in resource_manager
perf(emqx_alarm): use dirty Mnesia operations to activate an alarm
ci: simplify find-apps.sh for ee apps
perf(emqx_resource): don't reactivate alarms on reoccurring errors
ci: check if Elixir files are formatted in pre-commit hook
fix(dynamo): fix field name errors
chore: remove *_collector for prometheus api's example
chore: make plugins config to low level
chore: re-split dynamo i18n file
...