Commit Graph

293 Commits

Author SHA1 Message Date
ieQu1 4c76a2574d
fix(ds): Fix egress flush condition 2024-04-21 21:51:31 +02:00
Andrew Mayorov 43f8346c00
fix(dssnap): ensure idempotent write of empty chunks 2024-04-19 18:52:33 +02:00
ieQu1 93bb840365
docs(ds): Update README 2024-04-17 01:21:52 +02:00
Andrew Mayorov 5d7b2e2ce6
fix(dsrepl): attempt leadership transfer on terminate
In addition to on removal. The reasoning is basically the same: try to
avoid situations when log entries are replicated (or will be considered
replicated when the new leader is elected) but the leader terminates
before replying to the client.

To be clear: this is a stupid solution. Something much more robust is
needed.
2024-04-15 22:05:24 +02:00
Andrew Mayorov 89f42f1171
fix(dsrepl): make placeholder shard process permanent under supervisor 2024-04-15 16:43:52 +02:00
Andrew Mayorov c4d1360b96
fix(dsrepl): trigger election for new ra servers unconditionallly
Otherwise we might end up in a situation when there's no member online
yet at the time of the election trigger, and the election will never
happen.
2024-04-15 16:42:29 +02:00
Andrew Mayorov d12e907209
fix(dsrepl): correctly handle ra membership change command results
Before this change, results similar to `{error, {no_more_servers_to_try,
[{error, nodedown}, {error, not_member}]}}` were considered retryable
failures, which is incorrect.
2024-04-08 22:44:34 +02:00
Andrew Mayorov 3223797ae5
fix(dsrepl): attempt leadership transfer before server removal
This should make it much less likely to hit weird edge cases that lead
to duplicate Raft log entries because of client retries upon receiving
`shutdown` from the leader being removed.
2024-04-08 22:43:58 +02:00
Andrew Mayorov 1e95bd4da6
test(dsrepl): test unresponsive nodes removal / node restarts 2024-04-08 21:27:56 +02:00
Andrew Mayorov 7a836317ac
fix(dsrepl): trigger unfinished shard transition upon startup
Also provide a trivial API to trigger them by hand.
2024-04-08 16:12:42 +02:00
Andrew Mayorov 75bb7f5cdc
fix(dsrepl): retry only `{add, Site}` crashed membership transitions
To minimize the potential negative impact of removal transitions that
crash for some unknown and unusual reasons.
2024-04-08 16:04:33 +02:00
Andrew Mayorov 4c0cc079c2
fix(dsrepl): apply unnecessary rebalancing transitions cleanly 2024-04-08 13:25:45 +02:00
Andrew Mayorov dcde30c38a
test(dsrepl): add two more testcases for rebalancing 2024-04-08 13:22:31 +02:00
Andrew Mayorov 2ace9bb893
chore(dsrepl): sprinkle few comments and typespecs for exports 2024-04-07 22:51:56 +02:00
Andrew Mayorov ecaad348a7
chore(dsrepl): update few outdated comments / TODOs 2024-04-07 22:51:56 +02:00
Andrew Mayorov 6293efb995
fix(dsrepl): retry crashed membership transitions 2024-04-07 22:51:56 +02:00
Andrew Mayorov 826ce5806d
fix(dsrepl): ensure that new member UID matches server's UID
Before that change, UIDs supplied in the `ra:add_member/3` were not
the same as those servers were using. This haven't caused any issues
for some reason, but it's better to ensure that UIDs are the same.
2024-04-07 22:31:24 +02:00
Andrew Mayorov 556ffc78c9
feat(dsrepl): implement membership changes and rebalancing 2024-04-05 18:57:28 +02:00
Andrew Mayorov d6058b7f51
feat(dsrepl): allow to subscribe to DB metadata changes
Currently, only shard metadata changes are announced to the
subscribers.
2024-04-05 17:40:55 +02:00
Andrew Mayorov a07295d3bc
fix(ds): address shards in the supervisor properly 2024-04-05 17:40:38 +02:00
ieQu1 a62db08676
feat(ds): Add REST API for durable storage 2024-04-05 15:22:06 +02:00
ieQu1 d09787d1a6
fix(ds): Fix return types in replication_layer_meta 2024-04-05 15:22:06 +02:00
Andrew Mayorov 70396e9766
Merge pull request #12825 from keynslug/feat/EMQX-12110/repl-meta-api
feat(dsrepl): add APIs to manage DB replication sites
2024-04-04 22:32:03 +02:00
Andrew Mayorov df6c5b35fe
feat(dsrepl): add more primitive operations to modify DB sites 2024-04-04 21:22:49 +02:00
Andrew Mayorov bb8ffee18c
feat(dsrepl): add API to get current DB replication sites 2024-04-04 21:22:02 +02:00
Andrew Mayorov ad52f7838e
feat(dsrepl): add APIs to manage DB replication sites 2024-04-04 21:22:01 +02:00
Thales Macedo Garitezi c57c36adb2 feat(ds): clear all checkpoints when (re)starting storage layer
Fixes https://emqx.atlassian.net/browse/EMQX-12143
2024-04-04 14:05:52 -03:00
ieQu1 f37ed3a40a fix(ds): Limit the number of retries in egress to 0 2024-04-03 16:38:49 +02:00
ieQu1 2bbfada7af
fix(ds): Make async batches truly async 2024-04-03 11:57:47 +02:00
ieQu1 92ca90c0ca
fix(ds): Improve egress logging 2024-04-03 11:57:47 +02:00
ieQu1 ae5935e7f7
test(ds): Attempt to stabilize metrics_worker tests in CI 2024-04-02 19:14:10 +02:00
ieQu1 4382971443
fix(ds): Preserve errors in the egress 2024-04-02 16:47:43 +02:00
ieQu1 94ca7ad0f8
feat(ds): Report counters for LTS storage layout 2024-04-02 16:47:43 +02:00
ieQu1 b379f331de
fix(sessds): Handle errors when storing messages 2024-04-02 16:47:41 +02:00
ieQu1 f41e538526
feat(sessds): Observe next time 2024-04-02 16:45:52 +02:00
ieQu1 75b092bf0e
fix(ds): Actually retry sending batch 2024-04-02 16:45:49 +02:00
ieQu1 0de255cac8
feat(ds): Report egress flush time 2024-04-02 16:25:04 +02:00
ieQu1 044f3d4ef5
fix(ds): Don't reverse entries in the atomic batch 2024-04-02 16:25:04 +02:00
ieQu1 606f2a88cd
feat(ds): Add egress metrics 2024-04-02 16:25:04 +02:00
ieQu1 c9de336234
feat(ds): Add metrics worker to the builtin db supervision tree 2024-04-02 16:25:04 +02:00
Andrew Mayorov 778e897f1f
chore(dsrepl): describe snapshot ownership and few shortcomings 2024-04-02 13:48:51 +02:00
Andrew Mayorov c666c65c6a
test(ds): factor out storage iteration into helper module 2024-04-02 13:48:51 +02:00
Andrew Mayorov 7cebf598a8
chore(dsrepl): simplify snapshot transfer code a bit
Co-Authored-By: Thales Macedo Garitezi <thalesmg@gmail.com>
2024-04-02 13:48:51 +02:00
Andrew Mayorov e029b8f996
test(dsrepl): wait for whole cluster readiness
To minimize the chance of flaky tests due to the shards not being
completely online.

Co-Authored-By: Thales Macedo Garitezi <thalesmg@gmail.com>
2024-04-02 13:48:50 +02:00
Andrew Mayorov e8b06a6a9f
chore(dsrepl): mark few more BPAPI targets as obsolete 2024-04-02 13:48:50 +02:00
Andrew Mayorov d31cd0c728
feat(ds): ensure LTS state ids are deterministic 2024-04-02 13:48:50 +02:00
Andrew Mayorov 2cd357a5bd
fix(ds): ensure store batch is idempotent wrt generations 2024-04-02 13:48:50 +02:00
Andrew Mayorov 77a022bd93
feat(dsrepl): transfer storage snapshot during ra snapshot recovery 2024-04-02 13:48:49 +02:00
Andrew Mayorov b8b9b7739b
chore(ds): slightly simplify working with storage generations 2024-04-02 13:48:08 +02:00
Andrew Mayorov fa66a640c3
fix(dsrepl): handle RPC errors gracefully when storage is down 2024-03-28 15:17:01 +01:00
Ivan Dyachkov db9efb9317 chore: bump apps versions 2024-03-28 10:19:09 +01:00
Thales Macedo Garitezi 796c04e7a8 test: fix flaky test
We should emit the trace event before replying to callers.

Example failure:

https://github.com/emqx/emqx/actions/runs/8378977952/job/22946318696#step:6:182

```
 =CRITICAL REPORT==== 21-Mar-2024::17:45:37.676024 ===
"check stage" failed: error
{assertMatch,[{module,emqx_ds_storage_bitfield_lts_SUITE},
              {line,270},
              {expression,"? of_kind ( emqx_ds_replication_layer_egress_flush , Trace )"},
              {pattern,"[ # { batch := [ _ , _ , _ ] } ]"},
              {value,[]}]}
Stacktrace: [{emqx_ds_storage_bitfield_lts_SUITE,
                 '-t_atomic_store_batch/1-fun-1-',1,
                 [{file,
                      "/__w/emqx/emqx/apps/emqx_durable_storage/test/emqx_ds_storage_bitfield_lts_SUITE.erl"},
                  {line,270}]},
             {emqx_ds_storage_bitfield_lts_SUITE,t_atomic_store_batch,1,
                 [{file,
                      "/__w/emqx/emqx/apps/emqx_durable_storage/test/emqx_ds_storage_bitfield_lts_SUITE.erl"},
                  {line,249}]}]
```
2024-03-21 15:47:29 -03:00
Thales Macedo Garitezi 68af211130 fix(ds): reply sync callers after raft store failure 2024-03-21 15:40:21 -03:00
Thales Macedo Garitezi 70737a437a fix(ds): add caller to pending replies before flushing 2024-03-21 14:39:21 -03:00
Andrew Mayorov fe50a1711b
fix(ds-egress): drop pending batch on failures
Before this commit, messages in the current batch will be retried as
part of next batch. This could have led to message duplication which is
probably not what the user wants by default.
2024-03-20 13:20:25 +01:00
Andrew Mayorov a1f5de3f5b
fix(dsrepl): turn memoize into a safer function 2024-03-20 13:20:24 +01:00
Andrew Mayorov d39ca41070
chore(dsrepl): mark per-node `add_generation` RPC target obsolete
Also annotate internal exports with comments according with their
intended use.
2024-03-20 13:20:24 +01:00
Andrew Mayorov 35b18f9125
fix(dsrepl): properly handle error conditions in generation mgmt
Also update few outdated typespecs. Also make error reasons easier
to comprehend.
2024-03-20 13:20:24 +01:00
Andrew Mayorov f2268aa69a
fix(dsrepl): use correct base dir for ra system stuff
Co-Authored-By: Thales Macedo Garitezi <thalesmg@gmail.com>
2024-03-20 13:20:24 +01:00
Andrew Mayorov 404e919494
refactor(dsrepl): make shard allocator more robust and consistent
Co-Authored-By: Thales Macedo Garitezi <thalesmg@gmail.com>
2024-03-20 13:20:24 +01:00
Andrew Mayorov 0e18bd6e80
fix(dsrepl): increase replication site id bitsize back
In order to minimize chances of site id collision to practically zero.
2024-03-20 13:20:24 +01:00
Andrew Mayorov ac9700dd28
fix(dsrepl): split shard allocator into a separate module 2024-03-20 13:20:23 +01:00
Andrew Mayorov 1b647035d0
chore(dsrepl): make dialyzer a bit happier 2024-03-20 13:20:23 +01:00
Andrew Mayorov 611b3f0e07
feat(dsrepl): use more straightforward way to drop ra shards 2024-03-20 13:20:23 +01:00
Andrew Mayorov 74881e8706
feat(dsrepl): make storage layer unaware of granularity of time
Storage also becomes a bit more pure, depending on the upper layer to
provide the timestamps, which also makes it possible to handle more
operations idempotently.
2024-03-20 13:20:23 +01:00
Andrew Mayorov 3cb36a5619
feat(ds-lts): extract timestamp from storage key itself
1. This avoids the need to deserialize the message to get the timestamp.
2. It also makes possible to decouple the storage key timestamp from the
   message timestamp, which might be useful for replication purposes.
2024-03-19 20:21:56 +01:00
Andrew Mayorov 5cc0246351
feat(dsrepl): allow to tune select ra options 2024-03-19 20:21:55 +01:00
Andrew Mayorov 54b5adf868
feat(dsrepl): allocate shards predictably
To ensure strictly optimal and fair shard allocation across
cluster. Before this commit it was quite easy to end up with
an allocation significantly skewed towards some node, because
of the nature of randomness and relatively small number of
shards.
2024-03-19 20:21:55 +01:00
Andrew Mayorov 887e151be5
fix(dsrepl): handle errors gracefully in shard egress process
Also add cooldown on timeout / unavailability.
2024-03-19 20:21:53 +01:00
Andrew Mayorov e16aee99b4
chore(dsrepl): clarify how to perform leadership transfer in runtime 2024-03-19 20:21:18 +01:00
Andrew Mayorov 00d509f27b
feat(dsrepl): prefer local replica in read path
To optimize out any unnecessary RPCs. Given the load should be
smoothed evenly across the cluster, choosing non-leader node is
not a priority.
2024-03-19 20:11:42 +01:00
Andrew Mayorov 19305c223c
fix(dsrepl): require majority for replication-related tables 2024-03-19 20:11:42 +01:00
Andrew Mayorov f89909f60c
fix(dsrepl): tolerate trigger election timeouts for existing servers 2024-03-19 20:11:42 +01:00
Andrew Mayorov 3b59cf2ebf
feat(dsrepl): move shard allocation to separate process
That starts shard and egress processes only when shards are fully
allocated.
2024-03-19 20:11:41 +01:00
Andrew Mayorov 4dafbf21f6
fix(dsrepl): make db + shard part of machine state
It doesn't feel right, but right now is the easiest way to have it
in the scope of `apply/3`, because `init/1` doesn't actually invoked
for ra machines recovered from the existing log / snapshot.
2024-03-19 20:11:41 +01:00
Andrew Mayorov d19128ed65
feat(dsrepl): cache shard metadata in persistent terms 2024-03-19 20:11:41 +01:00
Andrew Mayorov e6c2c2fb07
feat(dsrepl): manage generations / db config through ra machine 2024-03-19 20:11:39 +01:00
Andrew Mayorov 5e94bdb932
feat(dsrepl): allocate shards once predefined number of sites online
Before this commit the most likely shard allocation outcome was
that all shard are allocated to just one node.
2024-03-19 20:11:03 +01:00
Andrew Mayorov be793e4735
fix(dsrepl): reassign timestamp at the time of submission
This is needed to ensure total message order for a shard, and
guarantee that no messages will be written "in the past". which
may break replay consistency.
2024-03-19 20:11:01 +01:00
Andrew Mayorov 146f082fdc
feat(dsrepl): implement raft-based replication
Still very rough but mostly working.
2024-03-19 20:09:44 +01:00
Ivan Dyachkov f2dc940436 Merge remote-tracking branch 'upstream/release-56' into 0319-sync-release56 2024-03-19 15:20:08 +01:00
Thales Macedo Garitezi 11ae04d810 test(ds): rm unused var warning 2024-03-14 17:16:24 -03:00
Thales Macedo Garitezi 2ebc8dcc55 fix(ds): use `infinity` timeout when storing batches 2024-03-14 10:17:18 -03:00
Thales Macedo Garitezi 6af01b916e feat(ds): implement `get_delete_streams`, `make_delete_iterator` and `delete_next` callbacks for builtin storage
Part of https://emqx.atlassian.net/browse/EMQX-11841
2024-03-08 09:56:46 -03:00
Andrew Mayorov f7e3afde16
test(ds): avoid introducing new macros 2024-03-07 16:49:20 +01:00
Andrew Mayorov 69427dc42d
test(ds): add tests for error mapping and replay recovery 2024-03-07 12:59:58 +01:00
Andrew Mayorov 09905d78cd
chore(ds): make error handling slightly simpler
Co-Authored-By: Thales Macedo Garitezi <thalesmg@gmail.com>
2024-03-07 12:59:57 +01:00
Andrew Mayorov b39c710ec2
fix(ds): tidy up few typespecs 2024-03-07 12:59:57 +01:00
Andrew Mayorov 2146d9e1fe
feat(ds): introduce error classes in critical API functions
For now, only recoverable / unrecoverable errors are introduced.
2024-03-07 12:59:57 +01:00
Thales Macedo Garitezi 5d87d400f4 feat(ds): add atomic store API
Part of https://emqx.atlassian.net/browse/EMQX-11841
2024-03-06 15:24:14 -03:00
Thales Macedo Garitezi 06334798a5 fix(ds): fix `drop_generation` typespec
This typespec fix will be used downstream by other backends.
2024-03-04 14:15:59 -03:00
Ilya Averyanov b706caf294 feat(ds): export types 2024-02-29 14:27:18 +03:00
Ilya Averyanov d5ae0e5c53 feat(ds): update delete/count interface 2024-02-28 22:51:24 +03:00
Ilya Averyanov b010d34640 chore(ds): add delete callbacks 2024-02-26 17:35:13 +03:00
Zaiming (Stone) Shi 46877e979b chore: update copyright-year 2024-02-23 08:21:06 +01:00
Thales Macedo Garitezi d469f4158e chore: bump app vsns 2024-02-20 16:53:57 -03:00
ieQu1 8cfb22f0b8
fix(ds): Retry getting the shard leader 2024-02-16 12:42:48 +01:00
ieQu1 280fcd8c52
Merge pull request #12437 from ieQu1/dev/optimize_make_filter
Optimize emqx_ds_bitmask_keymapper:make_filter function.
2024-02-05 17:32:28 +01:00
ieQu1 4665837cf0 fix(ds): Apply review remarks 2024-02-05 16:52:06 +01:00
ieQu1 c7888ad1f1
Merge pull request #12475 from ieQu1/dev/lean-stream
Use a more compact data structure to represent streams
2024-02-05 13:55:24 +01:00