ieQu1
29345aaa30
fix(ds): Fix idle event generation in bitfield_lts layout
2024-05-22 18:01:33 +02:00
ieQu1
b3ded7edce
fix(ds): Fix code review remark
2024-05-22 18:01:33 +02:00
ieQu1
60edf5e9b8
fix(ds): Move responsibility of returning end_of_stream to the CBM
2024-05-22 18:01:33 +02:00
ieQu1
0ff307e789
fix(ds): Include generation ID in the storage events
...
Make sure storage events originating from generation X are handled in
the context of the same generation.
2024-05-22 18:01:33 +02:00
ieQu1
acdae4fba3
fix(ds): Workaround for the idempotency error when dropping gens
2024-05-22 18:01:33 +02:00
ieQu1
eb7c43ee9d
fix(ds): Always store messages in the current generation
2024-05-22 18:01:33 +02:00
ieQu1
074d98a14a
test(ds): Refactor ds_SUITE
2024-05-22 18:01:33 +02:00
ieQu1
e4a73f003a
feat(ds): Implement format_status callback
...
Reduce volume of logs and crash reports from DS
2024-05-22 18:01:32 +02:00
ieQu1
1526c527d0
fix(ds): Log generation operations
2024-05-22 18:01:32 +02:00
ieQu1
aca2d9586c
fix(ds): Fix return type of drop_generation
2024-05-22 18:01:32 +02:00
ieQu1
c6fc76e335
fix(ds): Perform read operations on the leader.
2024-05-22 18:01:32 +02:00
ieQu1
4580906405
fix(ds): Use erpc instead of gen_rpc for `delete_next'
2024-05-22 18:01:32 +02:00
Andrew Mayorov
e6c5c1b598
chore(dsrepl): provide more information in rebalancing log messages
2024-05-22 17:24:08 +02:00
Andrew Mayorov
c355c9ad50
fix(dsrepl): properly handle transaction abort during forget site
2024-05-22 17:22:55 +02:00
JimMoen
4ad0743f61
Merge pull request #13081 from JimMoen/fix-typo
...
chore: fix typos
2024-05-22 02:05:20 +08:00
JimMoen
bb3c66638c
chore: fix typos
2024-05-21 17:45:20 +08:00
Andrew Mayorov
86f99959b0
Merge pull request #13054 from keynslug/fix/EMQX-12365/node-leave
...
fix(dsrepl): anticipate and handle nodes leaving the cluster
2024-05-17 09:43:15 +02:00
ieQu1
ee6e7174cf
fix(sessds): Rename the durable messages DB to `messages`
2024-05-16 21:31:32 +02:00
Andrew Mayorov
5157e61418
fix(dsrepl): verify if shards already allocated first
2024-05-16 18:56:54 +02:00
Andrew Mayorov
0119728d45
feat(dsrepl): also reflect pending transitions in ds status
2024-05-16 18:56:21 +02:00
Andrew Mayorov
26c4a4f597
feat(dsrepl): reflect conflicts and inconsistencies in ds status
2024-05-16 18:32:08 +02:00
Andrew Mayorov
7e86e3e61c
fix(dsrepl): anticipate and handle nodes leaving the cluster
...
Also make `claim_site/2` safer by refusing to claim a site for a node
that is already there.
2024-05-16 18:32:07 +02:00
Andrew Mayorov
35e360fcbe
feat(api-ds): provide more information on nonexistent site leave
2024-05-14 15:57:41 +02:00
ieQu1
525e4dac95
Merge pull request #13036 from ieQu1/dev/reduce-log-spam
...
tests: Reduce log spam in the failed test suites
2024-05-14 10:53:30 +02:00
ieQu1
ac3f5a083d
test: Reduce log spam in the failed test suites
2024-05-13 22:00:33 +02:00
ieQu1
8506ca7919
Merge pull request #12998 from ieQu1/dev/improve-latency
...
Use leader's clock when calculating LTS cutoff timestamp
2024-05-13 21:54:06 +02:00
ieQu1
3da3a36863
test(ds): Add generation in the replication suite
2024-05-13 19:51:04 +02:00
ieQu1
9f7ef9f34f
fix(ds): Apply review remarks
2024-05-13 19:35:24 +02:00
ieQu1
07aa708894
test(ds): Refactor replication suite
2024-05-09 03:56:56 +02:00
ieQu1
63e51fca66
test(ds): Use streams to fill the storage
2024-05-09 02:46:57 +02:00
ieQu1
a0a3977043
feat(ds): Assign latest timestamp deterministically
2024-05-08 23:17:57 +02:00
ieQu1
2236af84ba
feat(ds): two-stage storage commit on the storage level
2024-05-08 23:17:57 +02:00
ieQu1
1ddbbca90e
feat(ds): Allow incremental update of the LTS trie
2024-05-08 23:17:57 +02:00
ieQu1
68ca891f41
test(ds): Use streams to create traffic
2024-05-08 23:17:57 +02:00
Andrew Mayorov
d84c180ccb
feat(dsrepl): avoid contacting unreachable ra servers
...
Assuming estabilished Erlang distribution channel is a reliable way to
tell whether a remote node is reachable.
2024-05-08 18:12:13 +02:00
ieQu1
3642bcd1b6
docs(ds): Fix comment for the builtin DS metrics
2024-05-06 11:21:32 +02:00
ieQu1
b2a633aca1
fix(ds): Use leader's clock for computing LTS safe cutoff time
2024-05-06 11:21:32 +02:00
ieQu1
1ff2e02fd9
feat(ds): Pass current time to the storage layer via argument
2024-05-06 11:21:32 +02:00
ieQu1
8ac9700aab
feat(ds): Add an API for DB-global variables
2024-05-06 11:21:32 +02:00
ieQu1
86d45522e3
fix(dsrepl): Don't reverse elements of batches
2024-05-06 11:21:32 +02:00
ieQu1
bcfa7b2209
fix(ds): Destroy LTS tries when the generation is dropped
2024-05-06 11:21:32 +02:00
ieQu1
9999ccd36c
feat(ds): Ignore safe cutoff time for streams without varying levels
2024-05-06 11:21:32 +02:00
ieQu1
e4c3283c9c
docs(ds): Update README with CLI and REST API endpoints
2024-04-23 16:28:35 +02:00
ieQu1
4c76a2574d
fix(ds): Fix egress flush condition
2024-04-21 21:51:31 +02:00
Andrew Mayorov
43f8346c00
fix(dssnap): ensure idempotent write of empty chunks
2024-04-19 18:52:33 +02:00
ieQu1
93bb840365
docs(ds): Update README
2024-04-17 01:21:52 +02:00
Andrew Mayorov
5d7b2e2ce6
fix(dsrepl): attempt leadership transfer on terminate
...
In addition to on removal. The reasoning is basically the same: try to
avoid situations when log entries are replicated (or will be considered
replicated when the new leader is elected) but the leader terminates
before replying to the client.
To be clear: this is a stupid solution. Something much more robust is
needed.
2024-04-15 22:05:24 +02:00
Andrew Mayorov
89f42f1171
fix(dsrepl): make placeholder shard process permanent under supervisor
2024-04-15 16:43:52 +02:00
Andrew Mayorov
c4d1360b96
fix(dsrepl): trigger election for new ra servers unconditionallly
...
Otherwise we might end up in a situation when there's no member online
yet at the time of the election trigger, and the election will never
happen.
2024-04-15 16:42:29 +02:00
Andrew Mayorov
d12e907209
fix(dsrepl): correctly handle ra membership change command results
...
Before this change, results similar to `{error, {no_more_servers_to_try,
[{error, nodedown}, {error, not_member}]}}` were considered retryable
failures, which is incorrect.
2024-04-08 22:44:34 +02:00
Andrew Mayorov
3223797ae5
fix(dsrepl): attempt leadership transfer before server removal
...
This should make it much less likely to hit weird edge cases that lead
to duplicate Raft log entries because of client retries upon receiving
`shutdown` from the leader being removed.
2024-04-08 22:43:58 +02:00
Andrew Mayorov
1e95bd4da6
test(dsrepl): test unresponsive nodes removal / node restarts
2024-04-08 21:27:56 +02:00
Andrew Mayorov
7a836317ac
fix(dsrepl): trigger unfinished shard transition upon startup
...
Also provide a trivial API to trigger them by hand.
2024-04-08 16:12:42 +02:00
Andrew Mayorov
75bb7f5cdc
fix(dsrepl): retry only `{add, Site}` crashed membership transitions
...
To minimize the potential negative impact of removal transitions that
crash for some unknown and unusual reasons.
2024-04-08 16:04:33 +02:00
Andrew Mayorov
4c0cc079c2
fix(dsrepl): apply unnecessary rebalancing transitions cleanly
2024-04-08 13:25:45 +02:00
Andrew Mayorov
dcde30c38a
test(dsrepl): add two more testcases for rebalancing
2024-04-08 13:22:31 +02:00
Andrew Mayorov
2ace9bb893
chore(dsrepl): sprinkle few comments and typespecs for exports
2024-04-07 22:51:56 +02:00
Andrew Mayorov
ecaad348a7
chore(dsrepl): update few outdated comments / TODOs
2024-04-07 22:51:56 +02:00
Andrew Mayorov
6293efb995
fix(dsrepl): retry crashed membership transitions
2024-04-07 22:51:56 +02:00
Andrew Mayorov
826ce5806d
fix(dsrepl): ensure that new member UID matches server's UID
...
Before that change, UIDs supplied in the `ra:add_member/3` were not
the same as those servers were using. This haven't caused any issues
for some reason, but it's better to ensure that UIDs are the same.
2024-04-07 22:31:24 +02:00
Andrew Mayorov
556ffc78c9
feat(dsrepl): implement membership changes and rebalancing
2024-04-05 18:57:28 +02:00
Andrew Mayorov
d6058b7f51
feat(dsrepl): allow to subscribe to DB metadata changes
...
Currently, only shard metadata changes are announced to the
subscribers.
2024-04-05 17:40:55 +02:00
Andrew Mayorov
a07295d3bc
fix(ds): address shards in the supervisor properly
2024-04-05 17:40:38 +02:00
ieQu1
a62db08676
feat(ds): Add REST API for durable storage
2024-04-05 15:22:06 +02:00
ieQu1
d09787d1a6
fix(ds): Fix return types in replication_layer_meta
2024-04-05 15:22:06 +02:00
Andrew Mayorov
70396e9766
Merge pull request #12825 from keynslug/feat/EMQX-12110/repl-meta-api
...
feat(dsrepl): add APIs to manage DB replication sites
2024-04-04 22:32:03 +02:00
Andrew Mayorov
df6c5b35fe
feat(dsrepl): add more primitive operations to modify DB sites
2024-04-04 21:22:49 +02:00
Andrew Mayorov
bb8ffee18c
feat(dsrepl): add API to get current DB replication sites
2024-04-04 21:22:02 +02:00
Andrew Mayorov
ad52f7838e
feat(dsrepl): add APIs to manage DB replication sites
2024-04-04 21:22:01 +02:00
Thales Macedo Garitezi
c57c36adb2
feat(ds): clear all checkpoints when (re)starting storage layer
...
Fixes https://emqx.atlassian.net/browse/EMQX-12143
2024-04-04 14:05:52 -03:00
ieQu1
f37ed3a40a
fix(ds): Limit the number of retries in egress to 0
2024-04-03 16:38:49 +02:00
ieQu1
2bbfada7af
fix(ds): Make async batches truly async
2024-04-03 11:57:47 +02:00
ieQu1
92ca90c0ca
fix(ds): Improve egress logging
2024-04-03 11:57:47 +02:00
ieQu1
ae5935e7f7
test(ds): Attempt to stabilize metrics_worker tests in CI
2024-04-02 19:14:10 +02:00
ieQu1
4382971443
fix(ds): Preserve errors in the egress
2024-04-02 16:47:43 +02:00
ieQu1
94ca7ad0f8
feat(ds): Report counters for LTS storage layout
2024-04-02 16:47:43 +02:00
ieQu1
b379f331de
fix(sessds): Handle errors when storing messages
2024-04-02 16:47:41 +02:00
ieQu1
f41e538526
feat(sessds): Observe next time
2024-04-02 16:45:52 +02:00
ieQu1
75b092bf0e
fix(ds): Actually retry sending batch
2024-04-02 16:45:49 +02:00
ieQu1
0de255cac8
feat(ds): Report egress flush time
2024-04-02 16:25:04 +02:00
ieQu1
044f3d4ef5
fix(ds): Don't reverse entries in the atomic batch
2024-04-02 16:25:04 +02:00
ieQu1
606f2a88cd
feat(ds): Add egress metrics
2024-04-02 16:25:04 +02:00
ieQu1
c9de336234
feat(ds): Add metrics worker to the builtin db supervision tree
2024-04-02 16:25:04 +02:00
Andrew Mayorov
778e897f1f
chore(dsrepl): describe snapshot ownership and few shortcomings
2024-04-02 13:48:51 +02:00
Andrew Mayorov
c666c65c6a
test(ds): factor out storage iteration into helper module
2024-04-02 13:48:51 +02:00
Andrew Mayorov
7cebf598a8
chore(dsrepl): simplify snapshot transfer code a bit
...
Co-Authored-By: Thales Macedo Garitezi <thalesmg@gmail.com>
2024-04-02 13:48:51 +02:00
Andrew Mayorov
e029b8f996
test(dsrepl): wait for whole cluster readiness
...
To minimize the chance of flaky tests due to the shards not being
completely online.
Co-Authored-By: Thales Macedo Garitezi <thalesmg@gmail.com>
2024-04-02 13:48:50 +02:00
Andrew Mayorov
e8b06a6a9f
chore(dsrepl): mark few more BPAPI targets as obsolete
2024-04-02 13:48:50 +02:00
Andrew Mayorov
d31cd0c728
feat(ds): ensure LTS state ids are deterministic
2024-04-02 13:48:50 +02:00
Andrew Mayorov
2cd357a5bd
fix(ds): ensure store batch is idempotent wrt generations
2024-04-02 13:48:50 +02:00
Andrew Mayorov
77a022bd93
feat(dsrepl): transfer storage snapshot during ra snapshot recovery
2024-04-02 13:48:49 +02:00
Andrew Mayorov
b8b9b7739b
chore(ds): slightly simplify working with storage generations
2024-04-02 13:48:08 +02:00
Andrew Mayorov
fa66a640c3
fix(dsrepl): handle RPC errors gracefully when storage is down
2024-03-28 15:17:01 +01:00
Ivan Dyachkov
db9efb9317
chore: bump apps versions
2024-03-28 10:19:09 +01:00
Thales Macedo Garitezi
796c04e7a8
test: fix flaky test
...
We should emit the trace event before replying to callers.
Example failure:
https://github.com/emqx/emqx/actions/runs/8378977952/job/22946318696#step:6:182
```
=CRITICAL REPORT==== 21-Mar-2024::17:45:37.676024 ===
"check stage" failed: error
{assertMatch,[{module,emqx_ds_storage_bitfield_lts_SUITE},
{line,270},
{expression,"? of_kind ( emqx_ds_replication_layer_egress_flush , Trace )"},
{pattern,"[ # { batch := [ _ , _ , _ ] } ]"},
{value,[]}]}
Stacktrace: [{emqx_ds_storage_bitfield_lts_SUITE,
'-t_atomic_store_batch/1-fun-1-',1,
[{file,
"/__w/emqx/emqx/apps/emqx_durable_storage/test/emqx_ds_storage_bitfield_lts_SUITE.erl"},
{line,270}]},
{emqx_ds_storage_bitfield_lts_SUITE,t_atomic_store_batch,1,
[{file,
"/__w/emqx/emqx/apps/emqx_durable_storage/test/emqx_ds_storage_bitfield_lts_SUITE.erl"},
{line,249}]}]
```
2024-03-21 15:47:29 -03:00
Thales Macedo Garitezi
68af211130
fix(ds): reply sync callers after raft store failure
2024-03-21 15:40:21 -03:00
Thales Macedo Garitezi
70737a437a
fix(ds): add caller to pending replies before flushing
2024-03-21 14:39:21 -03:00
Andrew Mayorov
fe50a1711b
fix(ds-egress): drop pending batch on failures
...
Before this commit, messages in the current batch will be retried as
part of next batch. This could have led to message duplication which is
probably not what the user wants by default.
2024-03-20 13:20:25 +01:00
Andrew Mayorov
a1f5de3f5b
fix(dsrepl): turn memoize into a safer function
2024-03-20 13:20:24 +01:00
Andrew Mayorov
d39ca41070
chore(dsrepl): mark per-node `add_generation` RPC target obsolete
...
Also annotate internal exports with comments according with their
intended use.
2024-03-20 13:20:24 +01:00