Skip to content

Antalya-25.8: test_replication_without_zookeeper::test_startup_without_zookeeper consistently fails since 25.8.21 bump (#1600) #1711

@CarlosFelipeOR

Description

@CarlosFelipeOR

I checked the Altinity Stable Builds lifecycle table, and the Altinity Stable Build version I'm using is still supported.

Type of problem

Bug report - something's broken

Describe the situation

The integration test test_replication_without_zookeeper/test.py::test_startup_without_zookeeper started failing deterministically on the antalya-25.8 branch after PR #1600 (Antalya 25.8: Bump to 25.8.21) was merged. Every MasterCI run on antalya-25.8 since the bump fails this test, across all build types (amd_asan, amd_tsan, amd_binary, arm_binary, etc.).

The failure is a kazoo.exceptions.NotEmptyError raised by zk.delete("/clickhouse", recursive=True) inside the test. It occurs because the ClickHouse DDLWorker re-creates /clickhouse/task_queue/replicas/<host_id> while the recursive delete is walking the tree.

This issue:

  • Is a regression introduced by PR Antalya 25.8: Bump to 25.8.21 #1600 on antalya-25.8. Last green commit before the bump: 49bb3f7beb5e (2026-03-22). First red: bump/antalya-25.8/25.8.21 fa0d3798f20b (2026-03-30). Every commit on antalya-25.8 since then is 100% red on this test.
  • Was traced to upstream PR ClickHouse/ClickHouse#92339 ("Check and mark the interserver IO address active in DDL worker") which is the upstream change that PR Antalya 25.8: Bump to 25.8.21 #1600 backports and which alters the DDLWorker::markReplicasActive behavior.
  • Has no existing GitHub issue tracking it upstream (we searched ClickHouse/ClickHouse for the test name, error signature, and related symbols).

How to reproduce the behavior

Environment

  • Affected versions: 25.8.22.20001.altinityantalya (commit a36e13144c06560f7e9ff2bf898b86cc432e90ad and later).
  • Last good Altinity version: 25.8.16.20002.altinityantalya.
  • Build type: Reproduces on every build type the test runs in.

Option A — Run the integration test locally

# Inside an antalya-25.8 (>= 25.8.21) ClickHouse checkout
cd tests/integration
./runner --binary /path/to/clickhouse \
    'test_replication_without_zookeeper/test.py::test_startup_without_zookeeper'

Expected: the test fails with kazoo.exceptions.NotEmptyError raised at
test_replication_without_zookeeper/test.py:39 (drop_zk).

Option B — Manual reproduction (Docker + Keeper + Kazoo)

Full step-by-step reproduction (containers, configs, Python script): https://gist.github.com/CarlosFelipeOR/2e794f24cbeb175a0068d61ea1ce537b

Summary: start altinityinfra/clickhouse-server:1709-25.8.22.20001.altinityantalya + matching Keeper, create a ReplicatedMergeTree table, then run a tiny Kazoo loop:

zk.delete("/clickhouse", recursive=True)

The very first delete raises NotEmptyError on the affected build.


Expected behavior

zk.delete("/clickhouse", recursive=True) should succeed (as it did on 25.8.16 and earlier), and the test should proceed to validate that ClickHouse switches the replica to read-only.


Actual behavior

The test fails at the recursive delete with:

test_replication_without_zookeeper/test.py:39: in drop_zk
    zk.delete(path="/clickhouse", recursive=True)
...
E   kazoo.exceptions.NotEmptyError

The kazoo wire log (excerpt from a real failed run) shows ClickHouse re-creating /clickhouse/task_queue/replicas/<host_id> between the recursive delete's walk operations:

xid=14: Delete /clickhouse/task_queue/replicas             -> True
xid=15: GetChildren /clickhouse/task_queue/ddl             -> []
xid=16: Delete /clickhouse/task_queue/ddl                  -> True
xid=17: Delete /clickhouse/task_queue                      -> NotEmptyError

Evidence

The following items are direct evidence (measured / observed):

CI evidence (from gh-data.checks and play.clickhouse.com)

Altinity CI — test_replication_without_zookeeper/test.py::test_startup_without_zookeeper last 90 days:

Branch Pass Fail Notes
antalya-25.8 (commits before the 25.8.21 bump) 60+ 0 Last green: 49bb3f7beb5e (2026-03-22)
antalya-25.8 (≥ 25.8.21, after PR #1600) 0 63 100% fail across amd_asan, amd_tsan, amd_binary, arm_binary distributed plan, etc.
antalya-25.3 71 0 Does not have the backport
antalya-26.1 1088 1 Single isolated occurrence
antalya-26.3 121 0
rebase-cicd-v26.3.2.3-lts 0 18 Same regression on a 26.3 rebase branch (also has the backport)

Boundary commits on antalya-25.8:

Commit Date Result Notes
49bb3f7beb5e 2026-03-22 4/4 PASS Last green
bump/antalya-25.8/25.8.21 fa0d3798f20b 2026-03-30 0/6 FAIL First red
a36e13144c06 2026-04-16 0/4 FAIL Merge commit of PR #1600
1eed78a32cf9 / 5c9d52363de8 2026-04-23+ 0/15 FAIL Continued

Specific failing run referenced in this report:

Manual reproduction evidence

Tested with three builds, identical Kazoo loop:

Image First NotEmptyError after Result
altinity/clickhouse-server:25.8.16.20002.altinityantalya (pre-bump) ~4000+ attempts Race exists but is rare
altinityinfra/clickhouse-server:1709-25.8.22.20001.altinityantalya (post-bump) 1st attempt Deterministic
clickhouse/clickhouse-server:25.8.22.28 (upstream pure) ~3000+ attempts Race exists but is rare

Source diff evidence

In src/Interpreters/DDLWorker.cpp, PR ClickHouse#92339 moved createReplicaDirs(zookeeper, all_host_ids) out of initializeReplication() (called once on startup/reconnect) and into markReplicasActive(), which is called every iteration of runMainThread:

// runMainThread loop body — every iteration:
if (host_ids_updated.exchange(false))
    markReplicasActive(/*reinitialized=*/false);   // new call site
...
markReplicasActive(reinitialized);

// markReplicasActive now does:
auto all_host_ids = getAllHostIDsFromClusters();
// ... add interserver IO host IDs ...
createReplicaDirs(zookeeper, all_host_ids);        // every tick

Hypotheses (not yet proven)

The following items are our current understanding / hypotheses, separated from the evidence above:

Why this fails in antalya-25.8 deterministically while it is rare upstream: the same upstream PR (ClickHouse#92339) is present in upstream master and other release branches, where this test mostly passes (~5 real NotEmptyError in ~67k runs over 90 days on master). Our manual reproduction also shows the race is rare on the upstream image (25.8.22.28, ~1 in 3000) but deterministic on the Altinity 25.8.22 image (1st attempt). We don't yet have a confirmed explanation for this gap. Candidate factors include:

Regardless of the cause of the gap, the race itself is in upstream code — it reproduces with the upstream pure 25.8.22.28 image, just less frequently.


Suggested fix (not verified)

In src/Interpreters/DDLWorker.cpp, markReplicasActive(): only call createReplicaDirs when the worker is being reinitialized (or when host_ids_updated was just consumed), not on every main-loop iteration:

void DDLWorker::markReplicasActive(bool reinitialized)
{
    ...
    auto all_host_ids = getAllHostIDsFromClusters();
    // ... add interserver IO host IDs ...

    if (reinitialized)
        createReplicaDirs(zookeeper, all_host_ids);   // restore "once per reconnect" semantics
    ...
}

This restores the pre-ClickHouse#92339 invariant ("create host dirs once at startup/reconnect") without losing the interserver-IO fix that ClickHouse#92339 was actually trying to add (the interserver IO host IDs would still be added to all_host_ids and used during the active-marking).


Additional context

Related PRs

No existing upstream issue

Searches in ClickHouse/ClickHouse for test_startup_without_zookeeper, test_replication_without_zookeeper, DDLWorker markReplicasActive, createReplicaDirs, and task_queue NotEmpty returned no matches.

Manual reproduction

Full step-by-step instructions (Docker + Keeper + Kazoo): https://gist.github.com/CarlosFelipeOR/2e794f24cbeb175a0068d61ea1ce537b

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions