This will inevitably fail: it's not generally possible to update
different keys through the same cluster connection, one or more
update will fail with `MOVED` status. This testcase should serve
as a regression test later.
some telemetry events from wolff are discarded:
* dropped:
this is double counted in wolff,
we now only subscribe to the dropped_queue_full event
* retried_failed:
it has different meanings in wolff,
in wolff, it means it's the 2nd (or onward) produce attempt
in EMQX, it means it's eventually failed after some retries
* retried_success
since we are going to handle the success counters in callbac
this having this reported from wolff will only make things
harder to understand
* failed
wolff never fails (unelss drop which is a different counter)
With this, we avoid performing work or replying to callers that are no
longer waiting on a result.
Also introduces two new counters:
- `dropped.expired` :: happens when a request expires before being
sent downstream
- `late_reply` :: when a response is receive from downstream, but the
caller is no longer for a reply because the request has expired, and
the caller might even have retried it.
A new atom was created every time one did a dry run of a Kafka bridge
(that is, clicking the Test button in the settings dialog for the
bridge).
After this fix, we will only create a new atom when a bridge with a new
name is created. This should be acceptable as bridges with new names are
created relatively rarely and it seems to be useful to have an unique
atom for every Wolff producer.
Fixes: https://emqx.atlassian.net/browse/EMQX-8739
Related: https://emqx.atlassian.net/browse/EMQX-8692
This should also correctly account for `retried.*` metrics for sync
requests.
Also fixes cases where race conditions for retrying async requests
could potentially lead to inconsistent metrics.
Fixes more cases where a stale reference to `replayq` was being held
accidentally after a `pop`.
In order for the /bridges APIs to be consistent with other APIs, we move
out metrics from GET /bridges/{id} to its own endpoint,
/bridges/{id}/metrics. We also rename /bridges/reset_metrics to
/bridges/metrics/reset.
the connector schemas are shared between authn, auth and bridges,
the difference is that authn and authz configs are unions like:
[
{mongo_type = rs,
...
}
{another backend config}
]
however the brdige types are maps like, for example:
mongodb_rs.$name {
mongo_type = rs
...
}
in which case, the mongo_type is not required.
in order to keep the schema static as much as possible,
this field is chanegd to 'required => false' with a default value.
However, for authn and authz, the union selector will still raise
exception if the there is no type provided.
This makes the buffer/resource workers always use `replayq` for
queuing, along with collecting multiple requests in a single call.
This is done to avoid long message queues for the buffer workers and
rely on `replayq`'s capabilities of offloading to disk and detecting
overflow.
Also, this deprecates the `enable_batch` and `enable_queue` resource
creation options, as: i) queuing is now always enables; ii) batch_size
> 1 <=> batch_enabled. The corresponding metric
`dropped.queue_not_enabled` is dropped, along with `batching`. The
batching is too ephemeral, especially considering a default batch time
of 20 ms, and is not shown in the dashboard, so it was removed.
https://emqx.atlassian.net/browse/EMQX-8548
Currently, we face several issues trying to keep resource metrics
reasonable. For example, when a resource is re-created and has its
metrics reset, but then its durable queue resumes its previous work
and leads to strange (often negative) metrics.
Instead using `counters` that are shared by more than one worker to
manage gauges, we introduce an ETS table whose key is not only scoped
by the Resource ID as before, but also by the worker ID. This way,
when a worker starts/terminates, they should set their own gauges to
their values (often 0 or `replayq:count` when resuming off a queue).
With this scoping and initialization procedure, we'll hopefully avoid
hitting those strange metrics scenarios and have better control over
the gauges.
https://emqx.atlassian.net/browse/EMQX-8547
If a Kafka Producer bridge is given bad configuration (e.g.: bad authn
credentials), the Wolff client process is started successfully, as it
does not attempt to connect, but when the producer process is
attempted to be started, it fails (only then the client tries to
connect to Kafka). At this point, an error was thrown, but the
supervised client process remained running.
If the configuration was later fixed and the bridge updated, which
prompted its removal and recreation, the Wolff client would report to
be "already started", so it would never pick up the new (fixed)
configuration, and the producers would perpetually fail to start until
the node would be restarted.
We simply ensure the client is stopped before throwing the error,
unrolling the start-up procedure.