Commit Graph

475 Commits

Author SHA1 Message Date
Zaiming (Stone) Shi e8ccdb8d0f
Merge pull request #10998 from zmstone/0609-no-batch-for-mongodb
fix(mongodb): hide batch_size for mongodb resource
2023-06-11 21:26:12 +02:00
Zaiming (Stone) Shi ddef751527 fix(mongodb): hide batch_size for mongodb resource
MongoDB connector currently does not support batching
so the batch_size option has no effect.
However we cannot remove the field, so we choose to hide it from
schema
2023-06-11 11:08:58 +02:00
Kjell Winblad 4215da12f0
Merge pull request #10970 from kjellwinblad/kjell/feat/kafka_add_async_param/EMQX-8631
feat: add sync/async option to the Kafka producer bridge
2023-06-09 12:51:23 +02:00
Kjell Winblad 2671e8ecf9 fix: dialyzer type problem 2023-06-09 11:00:05 +02:00
firest 86a7b2d69a fix(resource): improve log security when resource creation fails 2023-06-09 11:43:42 +08:00
Kjell Winblad cb3a5fdbd4 style: only callback modules should do dynamic calls 2023-06-08 16:27:04 +02:00
Kjell Winblad ed9e29e769 refactor: refacor query_mode detection code
This commit refactor the query_mode resource detection code according to
a suggestion from @zmstone. This commit should not contain any
functional change except for a change of the Kafka producer bridge
config.
2023-06-08 16:26:55 +02:00
Kjell Winblad 47fa17b3c1 feat: add sync/async option to the Kafka producer bridge
This commit makes it possible to configure if a Kafka bridge should work
in query mode sync or async by setting the resource_opts.query_mode
configuration option.

Fixes:
https://emqx.atlassian.net/browse/EMQX-8631
2023-06-08 13:16:06 +02:00
Thales Macedo Garitezi cc8631223e fix(schema): avoid `function_clause` error when compacting errors
Fixes https://emqx.atlassian.net/browse/EMQX-10168

The bridge probe API displayed the typecheck errors for the new
timeout duration types correctly, but when an user tried to create the
bridge anyway a `function_clause` error was raised when trying to
compact hocon errors:

```
09:47:19.045 [warning] [exception: :error, path: '/bridges', reason: {:case_clause, {:error, {:config_update_crashed, :function_clause}}}, stacktrace: [{:emqx_bridge_api, :create_or_update_bridge, 4, [file: '/home/thales/dev/emqx/emqx/apps/emqx_bridge/src/emqx_bridge_api.erl', line: 602]}, {:minirest_handler, :apply_callback, 3, [file: '/home/thales/dev/emqx/emqx/deps/minirest/src/minirest_handler.erl', line: 111]}, {:minirest_handler, :handle, 2, [file: '/home/thales/dev/emqx/emqx/deps/minirest/src/minirest_handler.erl', line: 44]}, {:minirest_handler, :init, 2, [file: '/home/thales/dev/emqx/emqx/deps/minirest/src/minirest_handler.erl', line: 27]}, {:cowboy_handler, :execute, 2, [file: '/home/thales/dev/emqx/emqx/deps/cowboy/src/cowboy_handler.erl', line: 41]}, {:cowboy_stream_h, :execute, 3, [file: '/home/thales/dev/emqx/emqx/deps/cowboy/src/cowboy_stream_h.erl', line: 318]}, {:cowboy_stream_h, :request_process, 3, [file: '/home/thales/dev/emqx/emqx/deps/cowboy/src/cowboy_stream_h.erl', line: 302]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 240]}]]
```

This fixes the issue so that both APIs return more friendly error messages.
2023-06-07 10:34:58 -03:00
Thales Macedo Garitezi 75df622426
Merge pull request #10930 from thalesmg/max-int-timeout-v50
check maximum timeout values in schema (5.1)
2023-06-05 12:51:36 -03:00
Thales Macedo Garitezi 46393343e2 chore: use `timeout_duration` types for timer fields
Fixes https://emqx.atlassian.net/browse/EMQX-10020
2023-06-05 11:46:38 -03:00
lafirest d51c658a30
Merge pull request #10908 from lafirest/feat/rocketmq_on_stop
feat(rocketmq): refactored bridge to avoid leaking resources during crashes at creation
2023-06-05 15:00:32 +08:00
firest 5921ed3d2e fix(rocketmq): improve function name 2023-06-05 10:43:28 +08:00
Thales Macedo Garitezi 0790c88aaf
refactor: use default's type as first union member
Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>
2023-06-02 09:08:11 -03:00
Thales Macedo Garitezi 99796224d8 refactor(resource): rename `request_timeout` -> `request_ttl`
See
https://emqx.atlassian.net/wiki/spaces/P/pages/612368639/open+e5.1+remove+auto+restart+interval+from+buffer+worker+resource+options
2023-06-01 13:01:53 -03:00
Thales Macedo Garitezi f42ccb6262 feat(resource): increase default request timeout to 45 s
See
https://emqx.atlassian.net/wiki/spaces/P/pages/612368639/open+e5.1+remove+auto+restart+interval+from+buffer+worker+resource+options
2023-06-01 11:20:06 -03:00
Thales Macedo Garitezi 10425eb925 feat(resource): deprecate `auto_restart_interval` in favor of `health_check_interval`
See:
https://emqx.atlassian.net/wiki/spaces/P/pages/612368639/open+e5.1+remove+auto+restart+interval+from+buffer+worker+resource+options

Current problem:

In 5.0.x, we have two timer options that control the state changing of buffer worker
resources: auto_restart_interval and health_check_interval.

- auto_restart_interval controls how often the resource attempts to transition from
disconnected to connected.

- health_check_interval controls how often the resource is checked and potentially moved
from connected to disconnected or connecting.

The existence of two independent timers for very similar purposes is confusing to users,
QA and even developers.  Also, an intimately related configuration is request_timeout,
which can interact badly with auto_restart_interval if the latter is poorly configured:
requests may always expire if request_timeout < auto_restart_interval and if the resource
enters the disconnected state.  For health_check_interval, we attempt to derive a sane
default that gives requests a chance to retry (if request timeout is finite, then the
resource retries requests with a period of min(health_check_interval, request_timeout /
3).

Another problem with the separate auto_restart_interval is that its default value (60 s)
is too high when compared to the default request timeout and health check, leading to the
problems described above if not tuned.

Proposed solution:

We propose to drop auto_restart_interval in favor of health_check_interval, which will be
used for both disconnected -> connected and connected -> {disconnected, connecting}
transition checks.  With that, the resource will attempt to reconnect at the same interval
as the health check, which currently is 15 s.

Also, as two smaller changes to accompany this one:

- Increase the default request_timeout from 15 s to 45 s.
- Rename request_timeout to request_ttl.
2023-06-01 11:20:06 -03:00
firest 232ef23a48 feat(rocketmq): refactored bridge to avoid leaking resources during crashes at creation 2023-06-01 18:49:45 +08:00
Thales Macedo Garitezi a7f4f81c38
Merge pull request #10887 from thalesmg/fix-async-worker-down-buffer-worker-20230530-v50
fix: block buffer workers so they may retry requests
2023-05-30 17:39:18 -03:00
Andrew Mayorov a2688325e5
Merge pull request #10754 from fix/EMQX-10056/mqtt
feat(mqttconn): employ ecpool instead of single worker
2023-05-30 23:28:10 +03:00
Thales Macedo Garitezi 8c565abc84 test(cassandra): fix flaky test 2023-05-30 15:42:53 -03:00
Thales Macedo Garitezi 6be8ff378e fix(buffer_worker): make buffer worker enter `blocked` state when async worker dies
Fixes https://emqx.atlassian.net/browse/EMQX-10074

Otherwise, requests from those async workers, now retriable, might not
be retried until the buffer worker blocks for other reasons, which
might take a long time.
2023-05-30 15:34:22 -03:00
Andrew Mayorov a5fc26736d
refactor(mqttconn): split ingress/egress into 2 separate pools
Each with a more refined set of responsibilities, at the cost of slight
code duplication. Also provide two different config fields for each pool
size.
2023-05-30 17:21:44 +03:00
Thales Macedo Garitezi 75fcac9711
Merge pull request #10826 from thalesmg/test-partial-batch-expired-inflight-v50
test(buffer_worker): add assertion for inflight count after batch expiration
2023-05-30 09:05:59 -03:00
Zaiming (Stone) Shi 91cdc69976
Merge pull request #10867 from zmstone/0530-merge-release-50-to-master
0530 merge release 50 to master
2023-05-30 09:54:57 +02:00
Zaiming (Stone) Shi 9529919046 chore: bump app versions 2023-05-30 08:08:29 +02:00
Thales Macedo Garitezi 67e182e0c9
Merge pull request #10813 from thalesmg/refactor-kafka-on-stop-v50
feat(kafka): ensure allocated resources are removed on failures
2023-05-29 16:49:29 -03:00
Zaiming (Stone) Shi 36e268c933 chore: bump app versions 2023-05-26 16:05:37 +02:00
Zaiming (Stone) Shi cc5b4d3748 Merge remote-tracking branch 'origin/release-50' into 0526-ci-delete-otp-24-from-standalone-app-test 2023-05-26 15:58:16 +02:00
Thales Macedo Garitezi 32e6213ce3 fix(resource_manager_sup): use `one_for_one` instead of `simple_one_for_one`
Using `simple_one_for_one` has a potential race condition issue where
we read the PID of the resource manager before trying to remove a
resource, and then that PID changes because it was either dead at
first, or it crashed and changed, and later we use this stale PID to
try to remove it from the supervisor.  Under such circumstances, the
restarting child might linger in the supervisor, leaking resources.

By using the resource ID itself as a child ID (and using `one_for_one`
restart strategy), we ensure the child is truly removed.
2023-05-25 18:07:43 -03:00
Thales Macedo Garitezi 42b37690c7 refactor(pulsar): use macros for allocatable resources 2023-05-25 16:38:09 -03:00
Thales Macedo Garitezi db60dcbada test(buffer_worker): add assertion for inflight count after batch expiration
Fixes https://emqx.atlassian.net/browse/EMQX-9829
2023-05-25 16:11:37 -03:00
Thales Macedo Garitezi 18d57ba3eb
Merge pull request #10812 from thalesmg/test-flakiness-20230524
test: attempts to reduce flakiness (pgsql, cassandra)
2023-05-25 09:29:13 -03:00
JianBo He de7f1c8aec test: add tests for auto_restart_interval 2023-05-25 17:15:19 +08:00
JianBo He 71b636e321 fix: fix auto_restart_interval checker 2023-05-25 12:04:23 +08:00
Paulo Zulato 122ebcac24 fix: add user-friendly message when interval is out of range 2023-05-24 15:46:00 -03:00
Thales Macedo Garitezi 7f88521836 test(pgsql): reduce flakiness
Depending on timing, `t_write_timeout` was getting stuck while
checking the resource health, and the previous request timeout options
were making a response to never be sent if that process took too long.
2023-05-24 15:41:25 -03:00
Thales Macedo Garitezi fd2940cd77 feat(pulsar): ensure allocated resources are removed on failures (v5.0)
Fixes https://emqx.atlassian.net/browse/EMQX-9937
2023-05-24 12:29:00 -03:00
Zaiming (Stone) Shi 732a7be187 Merge remote-tracking branch 'origin/release-50' 2023-05-22 17:46:54 +02:00
Thales Macedo Garitezi 0559d6f639 refactor(buffer_worker): use static fn for bumping counters 2023-05-22 09:12:08 -03:00
Thales Macedo Garitezi c74c93388e refactor: rename some variables and sum type constructors for clarity 2023-05-22 09:11:23 -03:00
Thales Macedo Garitezi 7d798c10e9 perf(buffer_worker): flush metrics periodically inside buffer worker process
Fixes https://emqx.atlassian.net/browse/EMQX-9905

Since calling `telemetry` is costly in a hot path, we instead collect
metrics inside the buffer workers state and periodically flush them,
rather than immediately as events happen.
2023-05-22 09:11:23 -03:00
Andrew Mayorov ba6b208df2
fix(clickhouse): start app in tests
Otherwise, depending on the test execution order, tests might
sometimes fail.

Moreover, ensure that applications describe their dependecies
correctly and avoid starting irrelevant apps in tests.
2023-05-19 23:08:40 +03:00
Zaiming (Stone) Shi cb76e5a241 docs: add changelog for 10755 2023-05-19 20:41:26 +02:00
Zaiming (Stone) Shi 0d8ffc0d59 fix(resource-manager): ensure no false creation
Update is implemented as remove + create.
If a dleete call is made while the create is in progress
the remove call is likely to timeout too.

This causes the follwing creation to falsely succeed,
because there is alreay a running child under the supervisor.

As a result, the resource is permanently removed after
resource_manager eventually handles the remove call.
2023-05-19 18:55:16 +02:00
Zaiming (Stone) Shi f5e5c59763 refactor(resource-manager-sup): do not force kill resource manager
the shutdown timeout is now set to infinity so it will never
force kill a resource manager, otherwise there will be
resource leaks
2023-05-19 18:55:16 +02:00
Zaiming (Stone) Shi 21de0f8274 fix(buffer-worker-sup): fast stop
the timeout shutdown in child spec may
significantly slow down the deletion of a resource

this commit chagnes the shutdown to brutal kill

also, the pool worker removal code has been delete
because it's not necessary since the entier pool is
going to be force-delete later anyway
2023-05-19 18:55:16 +02:00
firest baeb96a6e4 chore: update changes 2023-05-19 15:36:18 +08:00
firest 0eea8438bf fix(resource): make some logging of the resource manager more secure 2023-05-19 15:28:19 +08:00
Paulo Zulato 5d289ade56 fix: validate range for some bridge options
Fixes https://emqx.atlassian.net/browse/EMQX-9864

Setting a very large interval can cause `erlang:start_timer` to crash.
Also, setting auto_restart_interval or health_check_interval to "0s"
causes the state machine to be in loop as time 0 is handled separately:

| state_timeout() = timeout() | integer()
| (...)
| If Time is relative and 0 no timer is actually started, instead the the
| time-out event is enqueued to ensure that it gets processed before any
| not yet received external event.
from "https://www.erlang.org/doc/man/gen_statem.html#type-state_timeout"

Therefore, both fields are now validated against the range [1ms, 1h],
which doesn't cause above issues.
2023-05-18 10:10:58 -03:00