yuanbiao/emqx - emqx

Commit Graph

Author	SHA1	Message	Date
Andrew Mayorov	5d7b2e2ce6	fix(dsrepl): attempt leadership transfer on terminate In addition to on removal. The reasoning is basically the same: try to avoid situations when log entries are replicated (or will be considered replicated when the new leader is elected) but the leader terminates before replying to the client. To be clear: this is a stupid solution. Something much more robust is needed.	2024-04-15 22:05:24 +02:00
Andrew Mayorov	89f42f1171	fix(dsrepl): make placeholder shard process permanent under supervisor	2024-04-15 16:43:52 +02:00
Andrew Mayorov	c4d1360b96	fix(dsrepl): trigger election for new ra servers unconditionallly Otherwise we might end up in a situation when there's no member online yet at the time of the election trigger, and the election will never happen.	2024-04-15 16:42:29 +02:00
Andrew Mayorov	d12e907209	fix(dsrepl): correctly handle ra membership change command results Before this change, results similar to `{error, {no_more_servers_to_try, [{error, nodedown}, {error, not_member}]}}` were considered retryable failures, which is incorrect.	2024-04-08 22:44:34 +02:00
Andrew Mayorov	3223797ae5	fix(dsrepl): attempt leadership transfer before server removal This should make it much less likely to hit weird edge cases that lead to duplicate Raft log entries because of client retries upon receiving `shutdown` from the leader being removed.	2024-04-08 22:43:58 +02:00
Andrew Mayorov	1e95bd4da6	test(dsrepl): test unresponsive nodes removal / node restarts	2024-04-08 21:27:56 +02:00
Andrew Mayorov	7a836317ac	fix(dsrepl): trigger unfinished shard transition upon startup Also provide a trivial API to trigger them by hand.	2024-04-08 16:12:42 +02:00
Andrew Mayorov	75bb7f5cdc	fix(dsrepl): retry only `{add, Site}` crashed membership transitions To minimize the potential negative impact of removal transitions that crash for some unknown and unusual reasons.	2024-04-08 16:04:33 +02:00
Andrew Mayorov	4c0cc079c2	fix(dsrepl): apply unnecessary rebalancing transitions cleanly	2024-04-08 13:25:45 +02:00
Andrew Mayorov	dcde30c38a	test(dsrepl): add two more testcases for rebalancing	2024-04-08 13:22:31 +02:00
Andrew Mayorov	2ace9bb893	chore(dsrepl): sprinkle few comments and typespecs for exports	2024-04-07 22:51:56 +02:00
Andrew Mayorov	ecaad348a7	chore(dsrepl): update few outdated comments / TODOs	2024-04-07 22:51:56 +02:00
Andrew Mayorov	6293efb995	fix(dsrepl): retry crashed membership transitions	2024-04-07 22:51:56 +02:00
Andrew Mayorov	826ce5806d	fix(dsrepl): ensure that new member UID matches server's UID Before that change, UIDs supplied in the `ra:add_member/3` were not the same as those servers were using. This haven't caused any issues for some reason, but it's better to ensure that UIDs are the same.	2024-04-07 22:31:24 +02:00
Andrew Mayorov	556ffc78c9	feat(dsrepl): implement membership changes and rebalancing	2024-04-05 18:57:28 +02:00
Andrew Mayorov	d6058b7f51	feat(dsrepl): allow to subscribe to DB metadata changes Currently, only shard metadata changes are announced to the subscribers.	2024-04-05 17:40:55 +02:00
Andrew Mayorov	a07295d3bc	fix(ds): address shards in the supervisor properly	2024-04-05 17:40:38 +02:00
ieQu1	a62db08676	feat(ds): Add REST API for durable storage	2024-04-05 15:22:06 +02:00
ieQu1	d09787d1a6	fix(ds): Fix return types in replication_layer_meta	2024-04-05 15:22:06 +02:00
Andrew Mayorov	70396e9766	Merge pull request #12825 from keynslug/feat/EMQX-12110/repl-meta-api feat(dsrepl): add APIs to manage DB replication sites	2024-04-04 22:32:03 +02:00
Andrew Mayorov	df6c5b35fe	feat(dsrepl): add more primitive operations to modify DB sites	2024-04-04 21:22:49 +02:00
Andrew Mayorov	bb8ffee18c	feat(dsrepl): add API to get current DB replication sites	2024-04-04 21:22:02 +02:00
Andrew Mayorov	ad52f7838e	feat(dsrepl): add APIs to manage DB replication sites	2024-04-04 21:22:01 +02:00
Thales Macedo Garitezi	c57c36adb2	feat(ds): clear all checkpoints when (re)starting storage layer Fixes https://emqx.atlassian.net/browse/EMQX-12143	2024-04-04 14:05:52 -03:00
ieQu1	f37ed3a40a	fix(ds): Limit the number of retries in egress to 0	2024-04-03 16:38:49 +02:00
ieQu1	2bbfada7af	fix(ds): Make async batches truly async	2024-04-03 11:57:47 +02:00
ieQu1	92ca90c0ca	fix(ds): Improve egress logging	2024-04-03 11:57:47 +02:00
ieQu1	ae5935e7f7	test(ds): Attempt to stabilize metrics_worker tests in CI	2024-04-02 19:14:10 +02:00
ieQu1	4382971443	fix(ds): Preserve errors in the egress	2024-04-02 16:47:43 +02:00
ieQu1	94ca7ad0f8	feat(ds): Report counters for LTS storage layout	2024-04-02 16:47:43 +02:00
ieQu1	b379f331de	fix(sessds): Handle errors when storing messages	2024-04-02 16:47:41 +02:00
ieQu1	f41e538526	feat(sessds): Observe next time	2024-04-02 16:45:52 +02:00
ieQu1	75b092bf0e	fix(ds): Actually retry sending batch	2024-04-02 16:45:49 +02:00
ieQu1	0de255cac8	feat(ds): Report egress flush time	2024-04-02 16:25:04 +02:00
ieQu1	044f3d4ef5	fix(ds): Don't reverse entries in the atomic batch	2024-04-02 16:25:04 +02:00
ieQu1	606f2a88cd	feat(ds): Add egress metrics	2024-04-02 16:25:04 +02:00
ieQu1	c9de336234	feat(ds): Add metrics worker to the builtin db supervision tree	2024-04-02 16:25:04 +02:00
Andrew Mayorov	778e897f1f	chore(dsrepl): describe snapshot ownership and few shortcomings	2024-04-02 13:48:51 +02:00
Andrew Mayorov	c666c65c6a	test(ds): factor out storage iteration into helper module	2024-04-02 13:48:51 +02:00
Andrew Mayorov	7cebf598a8	chore(dsrepl): simplify snapshot transfer code a bit Co-Authored-By: Thales Macedo Garitezi <thalesmg@gmail.com>	2024-04-02 13:48:51 +02:00
Andrew Mayorov	e029b8f996	test(dsrepl): wait for whole cluster readiness To minimize the chance of flaky tests due to the shards not being completely online. Co-Authored-By: Thales Macedo Garitezi <thalesmg@gmail.com>	2024-04-02 13:48:50 +02:00
Andrew Mayorov	e8b06a6a9f	chore(dsrepl): mark few more BPAPI targets as obsolete	2024-04-02 13:48:50 +02:00
Andrew Mayorov	d31cd0c728	feat(ds): ensure LTS state ids are deterministic	2024-04-02 13:48:50 +02:00
Andrew Mayorov	2cd357a5bd	fix(ds): ensure store batch is idempotent wrt generations	2024-04-02 13:48:50 +02:00
Andrew Mayorov	77a022bd93	feat(dsrepl): transfer storage snapshot during ra snapshot recovery	2024-04-02 13:48:49 +02:00
Andrew Mayorov	b8b9b7739b	chore(ds): slightly simplify working with storage generations	2024-04-02 13:48:08 +02:00
Andrew Mayorov	fa66a640c3	fix(dsrepl): handle RPC errors gracefully when storage is down	2024-03-28 15:17:01 +01:00
Ivan Dyachkov	db9efb9317	chore: bump apps versions	2024-03-28 10:19:09 +01:00
Thales Macedo Garitezi	796c04e7a8	test: fix flaky test We should emit the trace event before replying to callers. Example failure: https://github.com/emqx/emqx/actions/runs/8378977952/job/22946318696#step:6:182 ``` =CRITICAL REPORT==== 21-Mar-2024::17:45:37.676024 === "check stage" failed: error {assertMatch,[{module,emqx_ds_storage_bitfield_lts_SUITE}, {line,270}, {expression,"? of_kind ( emqx_ds_replication_layer_egress_flush , Trace )"}, {pattern,"[ # { batch := [ _ , _ , _ ] } ]"}, {value,[]}]} Stacktrace: [{emqx_ds_storage_bitfield_lts_SUITE, '-t_atomic_store_batch/1-fun-1-',1, [{file, "/__w/emqx/emqx/apps/emqx_durable_storage/test/emqx_ds_storage_bitfield_lts_SUITE.erl"}, {line,270}]}, {emqx_ds_storage_bitfield_lts_SUITE,t_atomic_store_batch,1, [{file, "/__w/emqx/emqx/apps/emqx_durable_storage/test/emqx_ds_storage_bitfield_lts_SUITE.erl"}, {line,249}]}] ```	2024-03-21 15:47:29 -03:00
Thales Macedo Garitezi	68af211130	fix(ds): reply sync callers after raft store failure	2024-03-21 15:40:21 -03:00

1 2 3 4 5

240 Commits