Commit Graph

50 Commits

Author SHA1 Message Date
Thales Macedo Garitezi b9b11d8f4d fix(machine_boot): use shared list of reboot apps and add bridges to reboot list 2023-07-19 20:15:42 -03:00
JimMoen 30c931ae62
fix: pulsar flaky cluster tests 2023-07-07 12:25:37 +08:00
JimMoen b089fba100
refactor: rm ee_bridge and ee_connector application 2023-07-07 12:25:37 +08:00
Thales Macedo Garitezi b609792a90 fix(pulsar_producer): do not return `disconnected` when checking status (r5.1)
Fixes https://emqx.atlassian.net/browse/EMQX-10278

Since Pulsar client has its own replayq that lives outside the management of the buffer
workers, we must not return disconnected status for such bridge: otherwise, the resource
manager will eventually kill the producers and data may be lost.
2023-06-13 11:44:45 -03:00
Thales Macedo Garitezi c11011d857
Merge pull request #11024 from thalesmg/pulsar-cosmetic-connecting-check-r51
feat(pulsar): retry health check a bit before returning (r5.1)
2023-06-12 13:00:54 -03:00
Thales Macedo Garitezi 741c1f091e fix(pulsar): update pulsar -> 0.8.3
Fixes https://emqx.atlassian.net/browse/EMQX-10229

See https://github.com/emqx/pulsar-client-erl/pull/59

Fixes a case_clause error that could arise from race conditions.
2023-06-12 10:29:40 -03:00
Thales Macedo Garitezi db5d14d5bf feat(pulsar): retry health check a bit before returning (r5.1)
Fixes https://emqx.atlassian.net/browse/EMQX-10228

This is a cosmetic fix for the Pulsar Producer bridge health check status.

Pulsar connection process is asynchronous, therefore, when a bridge of this type is
created or updated (which is the same as stopping and re-creating it), the immediate
status will be connecting because it’s indeed still connecting.  The bridge will connect
very soon afterwards (assuming there are no true network/config issue), but having to
refresh the UI to see the new status and/or seeing the resource alarm might annoy users.

This workaround adds a few retries to account for that effect to reduce the probability of
seeing the `connecting` state on such happy-paths.
2023-06-12 10:26:07 -03:00
Zaiming (Stone) Shi 97850de524 Merge remote-tracking branch 'origin/release-51' into 0610-merge-release-51-to-master 2023-06-10 12:23:55 +02:00
Andrew Mayorov b930f4cc73
Merge pull request #10987 from keynslug/fix/EMQX-9257/sep-placeholder
refactor: tear `emqx_plugin_libs` application apart
2023-06-09 17:02:14 +02:00
Andrew Mayorov a51baaa206
refactor(pluglib): move conversion utils to `emqx_utils_conv` 2023-06-09 14:44:37 +03:00
Andrew Mayorov d6c1ee183f
refactor(pluglib): move `emqx_placeholder` to utils app
Also make user that existing code calls it directly.
2023-06-09 14:44:36 +03:00
Kjell Winblad 1c7834e056 fix: fixes due to comments from @zmstone 2023-06-08 16:47:02 +02:00
Kjell Winblad ed9e29e769 refactor: refacor query_mode detection code
This commit refactor the query_mode resource detection code according to
a suggestion from @zmstone. This commit should not contain any
functional change except for a change of the Kafka producer bridge
config.
2023-06-08 16:26:55 +02:00
Thales Macedo Garitezi 97a9bb484a test(pulsar_producer): attempt to fix flaky test 2023-06-06 09:32:38 -03:00
Thales Macedo Garitezi 46393343e2 chore: use `timeout_duration` types for timer fields
Fixes https://emqx.atlassian.net/browse/EMQX-10020
2023-06-05 11:46:38 -03:00
Thales Macedo Garitezi 3e4790edd4 test(pulsar_producer): fix flaky test 2023-06-01 13:01:58 -03:00
Thales Macedo Garitezi 10425eb925 feat(resource): deprecate `auto_restart_interval` in favor of `health_check_interval`
See:
https://emqx.atlassian.net/wiki/spaces/P/pages/612368639/open+e5.1+remove+auto+restart+interval+from+buffer+worker+resource+options

Current problem:

In 5.0.x, we have two timer options that control the state changing of buffer worker
resources: auto_restart_interval and health_check_interval.

- auto_restart_interval controls how often the resource attempts to transition from
disconnected to connected.

- health_check_interval controls how often the resource is checked and potentially moved
from connected to disconnected or connecting.

The existence of two independent timers for very similar purposes is confusing to users,
QA and even developers.  Also, an intimately related configuration is request_timeout,
which can interact badly with auto_restart_interval if the latter is poorly configured:
requests may always expire if request_timeout < auto_restart_interval and if the resource
enters the disconnected state.  For health_check_interval, we attempt to derive a sane
default that gives requests a chance to retry (if request timeout is finite, then the
resource retries requests with a period of min(health_check_interval, request_timeout /
3).

Another problem with the separate auto_restart_interval is that its default value (60 s)
is too high when compared to the default request timeout and health check, leading to the
problems described above if not tuned.

Proposed solution:

We propose to drop auto_restart_interval in favor of health_check_interval, which will be
used for both disconnected -> connected and connected -> {disconnected, connecting}
transition checks.  With that, the resource will attempt to reconnect at the same interval
as the health check, which currently is 15 s.

Also, as two smaller changes to accompany this one:

- Increase the default request_timeout from 15 s to 45 s.
- Rename request_timeout to request_ttl.
2023-06-01 11:20:06 -03:00
Thales Macedo Garitezi 0d539e91d1 test(pulsar_producer): attempt to stabilize flaky test
https://github.com/emqx/emqx/actions/runs/5125166433/jobs/9218613872?pr=10886#step:7:679
```
  =CRITICAL REPORT==== 30-May-2023::19:38:58.003170 ===
  Run stage failed: error:{badmatch,
                              {timeout,
                                  [#{msg => pulsar_producer_bridge_started,
                                     '~meta' =>
                                         #{gl => <97891.472.0>,
                                           location =>
                                               #Fun<emqx_bridge_pulsar_impl_producer.11.109752493>,
                                           node => 'autocluster_node1@127.0.0.1',
                                           pid => <97891.787.0>,
                                           time => -576460611692219}}]}}
  Stacktrace: [{emqx_bridge_pulsar_impl_producer_SUITE,'-t_cluster/1-fun-10-',
                   6,
                   [{file,
                        "/emqx/apps/emqx_bridge_pulsar/test/emqx_bridge_pulsar_impl_producer_SUITE.erl"},
                    {line,1073}]},
               {emqx_bridge_pulsar_impl_producer_SUITE,t_cluster,1,
                   [{file,
                        "/emqx/apps/emqx_bridge_pulsar/test/emqx_bridge_pulsar_impl_producer_SUITE.erl"},
                    {line,1064}]}]
```
2023-05-31 10:19:55 -03:00
Thales Macedo Garitezi 9c3f838e14
Merge pull request #10841 from thalesmg/kafka-validate-key-v50
feat({kafka,pulsar}_producer): add validation for empty message key when strategy = key_dispatch
2023-05-30 09:37:15 -03:00
Zaiming (Stone) Shi 91cdc69976
Merge pull request #10867 from zmstone/0530-merge-release-50-to-master
0530 merge release 50 to master
2023-05-30 09:54:57 +02:00
Zaiming (Stone) Shi 9529919046 chore: bump app versions 2023-05-30 08:08:29 +02:00
Thales Macedo Garitezi 67e182e0c9
Merge pull request #10813 from thalesmg/refactor-kafka-on-stop-v50
feat(kafka): ensure allocated resources are removed on failures
2023-05-29 16:49:29 -03:00
Thales Macedo Garitezi 3edbad9f56 feat(pulsar_producer): add validation for empty message key when strategy = key_dispatch 2023-05-29 10:04:19 -03:00
Zaiming (Stone) Shi 36e268c933 chore: bump app versions 2023-05-26 16:05:37 +02:00
Thales Macedo Garitezi 32e6213ce3 fix(resource_manager_sup): use `one_for_one` instead of `simple_one_for_one`
Using `simple_one_for_one` has a potential race condition issue where
we read the PID of the resource manager before trying to remove a
resource, and then that PID changes because it was either dead at
first, or it crashed and changed, and later we use this stale PID to
try to remove it from the supervisor.  Under such circumstances, the
restarting child might linger in the supervisor, leaking resources.

By using the resource ID itself as a child ID (and using `one_for_one`
restart strategy), we ensure the child is truly removed.
2023-05-25 18:07:43 -03:00
Thales Macedo Garitezi 42b37690c7 refactor(pulsar): use macros for allocatable resources 2023-05-25 16:38:09 -03:00
Thales Macedo Garitezi fd2940cd77 feat(pulsar): ensure allocated resources are removed on failures (v5.0)
Fixes https://emqx.atlassian.net/browse/EMQX-9937
2023-05-24 12:29:00 -03:00
JimMoen 28015597ee
Merge remote-tracking branch 'emqx/release-50' into merge-release-50 2023-05-24 19:34:12 +08:00
Zaiming (Stone) Shi 732a7be187 Merge remote-tracking branch 'origin/release-50' 2023-05-22 17:46:54 +02:00
Thales Macedo Garitezi 65f973044f feat(pulsar): improve authn error check time and add connect timeout
Fixes https://emqx.atlassian.net/browse/EMQX-9910
2023-05-22 11:33:16 -03:00
Zaiming (Stone) Shi 40e8d5d039 refactor: rename lib-ee/emqx_ee_conf to apps/emqx_enterprise 2023-05-22 14:51:27 +02:00
Thales Macedo Garitezi 4327e40740 fix(pulsar): redact error reason
Fixes https://emqx.atlassian.net/browse/EMQX-9940

Error log after fix:

```
2023-05-19T13:09:26.304769+00:00 [error] msg: failed_to_start_pulsar_client, mfa: emqx_bridge_pulsar_impl_producer:on_start/2, line: 104, instance_id: <<"bridge:pulsar_producer:pprodu">>, pulsar_hosts: pulsar://pulsar:6652, reason: {#{{"pulsar",6652} => #{error => 'AuthenticationError',message => "Unable to authenticate",request_id => 18446744073709551615}},{child,undefined,'pulsar_producer:pprodu:emqx@127.0.0.1',{pulsar_client,start_link,['pulsar_producer:pprodu:emqx@127.0.0.1',["pulsar://pulsar:6652"],#{conn_opts => #{auth_data => <<"******">>,auth_method_name => <<"basic">>},ssl_opts => []}]},transient,false,5000,worker,[pulsar_client]}}
```
2023-05-19 10:09:57 -03:00
Thales Macedo Garitezi 659cf64ad7 feat(pulsar): use an union member selector for better error messages 2023-05-17 17:56:53 -03:00
Thales Macedo Garitezi 447b76464b Merge branch 'release-50' into merge-r50-into-v50-a 2023-05-17 14:50:18 -03:00
Thales Macedo Garitezi dcccc0910a fix(pulsar): mark whole auth struct as sensitive (r5.0)
Fixes https://emqx.atlassian.net/browse/EMQX-9900

I tried to patch hocon itself to filter the sensitive data, but the
way it's currently structured doesn't seem to keep that field
metadata.  So, for now, we can just mark the whole auth union as
sensitive.
2023-05-17 11:50:22 -03:00
Thales Macedo Garitezi cebde87114 fix(pulsar): use a binary duration as default `health_check_interval`
Fixes https://emqx.atlassian.net/browse/EMQX-9885

The frontend needs the default value to match the duration (binary)
type to display correctly.
2023-05-16 11:29:29 -03:00
Thales Macedo Garitezi 5960cc530a fix(pulsar): use a binary duration as default `health_check_interval`
Fixes https://emqx.atlassian.net/browse/EMQX-9885

The frontend needs the default value to match the duration (binary)
type to display correctly.
2023-05-15 09:05:21 -03:00
JianBo He 38fcb7a097 docs: hide the not-ready document links 2023-05-15 11:20:23 +08:00
Andrew Mayorov 4575167607
feat(resource): drop `manager_id()` type 2023-05-02 17:29:20 +03:00
Thales Macedo Garitezi 633eacad3b test(pulsar): add more test cases for Pulsar Producer bridge
Fixes https://emqx.atlassian.net/browse/EMQX-8400
2023-04-28 10:47:03 -03:00
Thales Macedo Garitezi b56a158a54 fix(pulsar): fix function return typespec 2023-04-25 14:29:40 -03:00
Thales Macedo Garitezi f69ebdcd1a test(pulsar): teardown tls group 2023-04-25 14:27:24 -03:00
Thales Macedo Garitezi 4cc4c4ffaa style: format rebar.config file 2023-04-25 14:26:19 -03:00
Thales Macedo Garitezi 377b143325 refactor: split `parse_server` into smaller functions, improve return type to use map 2023-04-24 14:17:29 -03:00
Thales Macedo Garitezi 4f2262129b refactor: rename `{first_,}key_dispatch` partition strategy option 2023-04-24 10:28:26 -03:00
Thales Macedo Garitezi 4af6e3eb6e refactor: rm unused modules 2023-04-24 10:28:26 -03:00
Thales Macedo Garitezi 1e8dd70a11 chore: fix error message and rename variable 2023-04-24 10:28:26 -03:00
Thales Macedo Garitezi 180b6acd9e docs: remove auto-generated comment 2023-04-24 10:28:26 -03:00
Thales Macedo Garitezi f4a3affd6f docs: change phrasing after review 2023-04-24 10:28:26 -03:00
Thales Macedo Garitezi ad4be08bb2 feat: implement Pulsar Producer bridge (e5.0)
Fixes https://emqx.atlassian.net/browse/EMQX-8398
2023-04-24 10:28:26 -03:00