You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The integration test test_replication_without_zookeeper/test.py::test_startup_without_zookeeper started failing deterministically on the antalya-25.8 branch after PR #1600 (Antalya 25.8: Bump to 25.8.21) was merged. Every MasterCI run on antalya-25.8 since the bump fails this test, across all build types (amd_asan, amd_tsan, amd_binary, arm_binary, etc.).
The failure is a kazoo.exceptions.NotEmptyError raised by zk.delete("/clickhouse", recursive=True) inside the test. It occurs because the ClickHouse DDLWorker re-creates /clickhouse/task_queue/replicas/<host_id> while the recursive delete is walking the tree.
This issue:
Is a regression introduced by PR Antalya 25.8: Bump to 25.8.21 #1600 on antalya-25.8. Last green commit before the bump: 49bb3f7beb5e (2026-03-22). First red: bump/antalya-25.8/25.8.21fa0d3798f20b (2026-03-30). Every commit on antalya-25.8 since then is 100% red on this test.
Was traced to upstream PR ClickHouse/ClickHouse#92339 ("Check and mark the interserver IO address active in DDL worker") which is the upstream change that PR Antalya 25.8: Bump to 25.8.21 #1600 backports and which alters the DDLWorker::markReplicasActive behavior.
Has no existing GitHub issue tracking it upstream (we searched ClickHouse/ClickHouse for the test name, error signature, and related symbols).
How to reproduce the behavior
Environment
Affected versions:25.8.22.20001.altinityantalya (commit a36e13144c06560f7e9ff2bf898b86cc432e90ad and later).
Last good Altinity version:25.8.16.20002.altinityantalya.
Build type: Reproduces on every build type the test runs in.
Summary: start altinityinfra/clickhouse-server:1709-25.8.22.20001.altinityantalya + matching Keeper, create a ReplicatedMergeTree table, then run a tiny Kazoo loop:
zk.delete("/clickhouse", recursive=True)
The very first delete raises NotEmptyError on the affected build.
Expected behavior
zk.delete("/clickhouse", recursive=True) should succeed (as it did on 25.8.16 and earlier), and the test should proceed to validate that ClickHouse switches the replica to read-only.
Actual behavior
The test fails at the recursive delete with:
test_replication_without_zookeeper/test.py:39: in drop_zk
zk.delete(path="/clickhouse", recursive=True)
...
E kazoo.exceptions.NotEmptyError
The kazoo wire log (excerpt from a real failed run) shows ClickHouse re-creating /clickhouse/task_queue/replicas/<host_id> between the recursive delete's walk operations:
In src/Interpreters/DDLWorker.cpp, PR ClickHouse#92339 moved createReplicaDirs(zookeeper, all_host_ids) out of initializeReplication() (called once on startup/reconnect) and into markReplicasActive(), which is called every iteration of runMainThread:
// runMainThread loop body — every iteration:if (host_ids_updated.exchange(false))
markReplicasActive(/*reinitialized=*/false); // new call site
...
markReplicasActive(reinitialized);
// markReplicasActive now does:auto all_host_ids = getAllHostIDsFromClusters();
// ... add interserver IO host IDs ...createReplicaDirs(zookeeper, all_host_ids); // every tick
Hypotheses (not yet proven)
The following items are our current understanding / hypotheses, separated from the evidence above:
Why this fails in antalya-25.8 deterministically while it is rare upstream: the same upstream PR (ClickHouse#92339) is present in upstream master and other release branches, where this test mostly passes (~5 real NotEmptyError in ~67k runs over 90 days on master). Our manual reproduction also shows the race is rare on the upstream image (25.8.22.28, ~1 in 3000) but deterministic on the Altinity 25.8.22 image (1st attempt). We don't yet have a confirmed explanation for this gap. Candidate factors include:
The PR also bumps Keeper (src/Coordination/KeeperStorage.cpp +40/-8, KeeperConstants.cpp -6) — possibly changing tick/IO timing.
Regardless of the cause of the gap, the race itself is in upstream code — it reproduces with the upstream pure 25.8.22.28 image, just less frequently.
Suggested fix (not verified)
In src/Interpreters/DDLWorker.cpp, markReplicasActive(): only call createReplicaDirs when the worker is being reinitialized (or when host_ids_updated was just consumed), not on every main-loop iteration:
This restores the pre-ClickHouse#92339 invariant ("create host dirs once at startup/reconnect") without losing the interserver-IO fix that ClickHouse#92339 was actually trying to add (the interserver IO host IDs would still be added to all_host_ids and used during the active-marking).
Additional context
Related PRs
PR Antalya 25.8: Bump to 25.8.21 #1600 — Antalya 25.8: Bump to 25.8.21 (carries the upstream backport that introduced this regression on antalya-25.8).
Upstream PR ClickHouse/ClickHouse#92339 — Check and mark the interserver IO address active in DDL worker (the original change in DDLWorker.cpp).
Upstream PR ClickHouse/ClickHouse#92223 — Remove no active host exception in DDL worker (also part of the bump).
No existing upstream issue
Searches in ClickHouse/ClickHouse for test_startup_without_zookeeper, test_replication_without_zookeeper, DDLWorker markReplicasActive, createReplicaDirs, and task_queue NotEmpty returned no matches.
✅ I checked the Altinity Stable Builds lifecycle table, and the Altinity Stable Build version I'm using is still supported.
Type of problem
Bug report - something's broken
Describe the situation
The integration test
test_replication_without_zookeeper/test.py::test_startup_without_zookeeperstarted failing deterministically on theantalya-25.8branch after PR #1600 (Antalya 25.8: Bump to 25.8.21) was merged. Every MasterCI run onantalya-25.8since the bump fails this test, across all build types (amd_asan,amd_tsan,amd_binary,arm_binary, etc.).The failure is a
kazoo.exceptions.NotEmptyErrorraised byzk.delete("/clickhouse", recursive=True)inside the test. It occurs because the ClickHouseDDLWorkerre-creates/clickhouse/task_queue/replicas/<host_id>while the recursive delete is walking the tree.This issue:
antalya-25.8. Last green commit before the bump:49bb3f7beb5e(2026-03-22). First red:bump/antalya-25.8/25.8.21fa0d3798f20b(2026-03-30). Every commit onantalya-25.8since then is 100% red on this test.DDLWorker::markReplicasActivebehavior.ClickHouse/ClickHousefor the test name, error signature, and related symbols).How to reproduce the behavior
Environment
25.8.22.20001.altinityantalya(commita36e13144c06560f7e9ff2bf898b86cc432e90adand later).25.8.16.20002.altinityantalya.Option A — Run the integration test locally
Expected: the test fails with
kazoo.exceptions.NotEmptyErrorraised attest_replication_without_zookeeper/test.py:39 (drop_zk).Option B — Manual reproduction (Docker + Keeper + Kazoo)
Full step-by-step reproduction (containers, configs, Python script): https://gist.github.com/CarlosFelipeOR/2e794f24cbeb175a0068d61ea1ce537b
Summary: start
altinityinfra/clickhouse-server:1709-25.8.22.20001.altinityantalya+ matching Keeper, create a ReplicatedMergeTree table, then run a tiny Kazoo loop:The very first
deleteraisesNotEmptyErroron the affected build.Expected behavior
zk.delete("/clickhouse", recursive=True)should succeed (as it did on 25.8.16 and earlier), and the test should proceed to validate that ClickHouse switches the replica to read-only.Actual behavior
The test fails at the recursive delete with:
The kazoo wire log (excerpt from a real failed run) shows ClickHouse re-creating
/clickhouse/task_queue/replicas/<host_id>between the recursive delete's walk operations:Evidence
The following items are direct evidence (measured / observed):
CI evidence (from
gh-data.checksandplay.clickhouse.com)Altinity CI —
test_replication_without_zookeeper/test.py::test_startup_without_zookeeperlast 90 days:antalya-25.8(commits before the 25.8.21 bump)49bb3f7beb5e(2026-03-22)antalya-25.8(≥ 25.8.21, after PR #1600)amd_asan,amd_tsan,amd_binary,arm_binary distributed plan, etc.antalya-25.3antalya-26.1antalya-26.3rebase-cicd-v26.3.2.3-ltsBoundary commits on
antalya-25.8:49bb3f7beb5ebump/antalya-25.8/25.8.21fa0d3798f20ba36e13144c061eed78a32cf9/5c9d52363de8Specific failing run referenced in this report:
a36e13144c06560f7e9ff2bf898b86cc432e90adManual reproduction evidence
Tested with three builds, identical Kazoo loop:
NotEmptyErrorafteraltinity/clickhouse-server:25.8.16.20002.altinityantalya(pre-bump)altinityinfra/clickhouse-server:1709-25.8.22.20001.altinityantalya(post-bump)clickhouse/clickhouse-server:25.8.22.28(upstream pure)Source diff evidence
In
src/Interpreters/DDLWorker.cpp, PR ClickHouse#92339 movedcreateReplicaDirs(zookeeper, all_host_ids)out ofinitializeReplication()(called once on startup/reconnect) and intomarkReplicasActive(), which is called every iteration ofrunMainThread:Hypotheses (not yet proven)
The following items are our current understanding / hypotheses, separated from the evidence above:
Why this fails in
antalya-25.8deterministically while it is rare upstream: the same upstream PR (ClickHouse#92339) is present in upstreammasterand other release branches, where this test mostly passes (~5 realNotEmptyErrorin ~67k runs over 90 days onmaster). Our manual reproduction also shows the race is rare on the upstream image (25.8.22.28, ~1 in 3000) but deterministic on the Altinity 25.8.22 image (1st attempt). We don't yet have a confirmed explanation for this gap. Candidate factors include:src/Coordination/KeeperStorage.cpp+40/-8,KeeperConstants.cpp-6) — possibly changing tick/IO timing.markReplicasActive.Regardless of the cause of the gap, the race itself is in upstream code — it reproduces with the upstream pure
25.8.22.28image, just less frequently.Suggested fix (not verified)
In
src/Interpreters/DDLWorker.cpp,markReplicasActive(): only callcreateReplicaDirswhen the worker is being reinitialized (or whenhost_ids_updatedwas just consumed), not on every main-loop iteration:This restores the pre-ClickHouse#92339 invariant ("create host dirs once at startup/reconnect") without losing the interserver-IO fix that ClickHouse#92339 was actually trying to add (the interserver IO host IDs would still be added to
all_host_idsand used during the active-marking).Additional context
Related PRs
antalya-25.8).DDLWorker.cpp).No existing upstream issue
Searches in
ClickHouse/ClickHousefortest_startup_without_zookeeper,test_replication_without_zookeeper,DDLWorker markReplicasActive,createReplicaDirs, andtask_queue NotEmptyreturned no matches.Manual reproduction
Full step-by-step instructions (Docker + Keeper + Kazoo): https://gist.github.com/CarlosFelipeOR/2e794f24cbeb175a0068d61ea1ce537b