Skip to content

fix(system_test): use min-HTTP health check in fuzz test instead of /ping on main HTTP#164

Open
mtopolnik wants to merge 6 commits into
mainfrom
mt_fix-fuzz-timing
Open

fix(system_test): use min-HTTP health check in fuzz test instead of /ping on main HTTP#164
mtopolnik wants to merge 6 commits into
mainfrom
mt_fix-fuzz-timing

Conversation

@mtopolnik

@mtopolnik mtopolnik commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Problem

TestQwpWsFuzz.test_all_mixed_with_bounce flakes on CI with:

producer ... failed: QWP/WebSocket close drain timed out before all published frames were acknowledged

The fixture log made it look like a QuestDB bounce restart took ~113s (vs the usual 6–40s in the same run), which consumed the producers' 120s close_drain budget.

Root cause

The server actually started in ~9s; the fixture's /ping readiness probe was then starved for ~100s.

/ping is served by the main HTTP server, which runs on the shared-network worker pool — the same pool that accepts and processes QWP ingest. On a bounce restart the reconnecting fuzz producers immediately replay their queued backlog and saturate that pool on the 4-core CI box, so the accept loop never runs and no new connection (including /ping) is accepted for ~80s.

The probe could only succeed once the producers gave up at their close_drain deadline and disconnected, freeing the workers — so the bounce's "server back up" was structurally gated on the producers failing. The server log confirms /ping answered ~1s after the producers' WebSocket disconnects.

Fix

  • Point the readiness probe at the min-HTTP server (http.min.enabled=true, dedicated http.min.worker.count=1). It runs on its own worker pool, so the probe can't be starved by ingest on the shared pool. HealthCheckProcessor answers any path with 200. Both the managed and docker fixtures now discover/publish a 4th port and probe the min-HTTP /status.
  • Cap each bounce restart at BOUNCE_RESTART_TIMEOUT_SEC = 90 (below the 120s close_drain budget) so a genuinely stuck restart fails fast as an infra error with correct attribution, instead of surfacing later as a misleading client close_drain timeout.

Test plan

  • python3 -m py_compile on the three changed files.
  • python3 -m unittest test_qwp_ws_fuzz_unit test_fixture_unit — 19/19 pass.
  • Live smoke test against a local QuestDB build (9.4.3-SNAPSHOT, JDK 25): min-HTTP comes up on a dedicated pool (custom thread pool [name=minhttp, requester=min-http, workers=1, priority=8]), /status returns 200, any path on the min port returns 200 (the useAsDefault catch-all), and the main /ping still returns 204.

Tradeoffs / scope

This isolates the readiness probe from ingest starvation. It does not change that the producers still reconnect QWP on the shared pool, nor that a bounce's graceful stop drains WAL — on a severely CPU-starved box those can still be slow. The modest per-bounce backlog drains quickly once the bounce sequence completes, and the 90s restart cap remains as a backstop for a genuinely stuck boot.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added configurable fixture startup timeouts and an option to gate readiness using either the main /ping or the additional min endpoint.
  • Bug Fixes
    • Centralized HTTP readiness/liveness probing to fail fast if the process dies, reducing flaky startup/restart; updated Docker readiness to use the shared main/min gating logic.
    • Improved robustness of Docker stop timeouts by coercing stop wait values to integers.
  • Tests
    • Tightened fuzz bounce restart budgeting with bounded restart timeouts, updated wind-down/join ceilings, and aligned client reconnect/close budgets.
    • Added unit tests covering HTTP probing and start-time probe selection.

Server start timeout shouldn't be larger than close-drain timeout, 
because the latter includes the former and will cause the test
to falsely blame close-drain, when server start was the slow one.
/ping is served from the shared thread pool, and gets starved under
severe ingestion load, making the server appear not-started to the
fuzz test harness
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The PR switches fixture startup readiness probing to a dedicated min-HTTP /status endpoint, adds configurable startup and restart timeouts, and threads the new restart timeout through fuzz bounce recovery. It also updates fuzz test timing constants and sender flush budgeting.

Changes

Min-HTTP readiness and bounce restart timeout

Layer / File(s) Summary
Min-HTTP port, config, and readiness
system_test/fixture.py
QuestDbFixture and QuestDbDockerFixture track a dedicated min-HTTP port, allocate and publish it, enable http.min in local and container configuration, and switch readiness checks to the selected main or min HTTP endpoint with a caller-provided timeout budget.
Readiness probe tests
system_test/test_fixture_unit.py
Unit tests cover the shared HTTP probe helper, the distinct main and min readiness checks, and the start-path selection between main and min probing.
BounceThread restart timeout
system_test/qwp_ws_fuzz.py
BounceThread accepts and stores a restart timeout, restarts the fixture with that timeout and min-HTTP probing, and records lifecycle failures with updated failure messaging during defensive recovery.
Fuzz timeout wiring
system_test/test.py
TestQwpWsFuzz defines a bounce restart timeout constant, passes it into BounceThread, and derives the sender reconnect and flush budgets from the bounce down-window.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • questdb/c-questdb-client#155: Modifies BounceThread lifecycle handling in the same fuzz path, with adjacent stop/start timeout behavior changes.
  • questdb/c-questdb-client#160: Updates QuestDbDockerFixture startup and readiness flow in fixture.py, overlapping with the Docker health probing changes here.

Suggested labels

bug

Suggested reviewers

  • jerrinot

Poem

🐇 I hop by /status, bright and new,
With min-HTTP ports in my view.
Bounce waits now wear a careful cap,
And flushes close without a trap.
The fixtures wake, the fuzz tests sing,
With timeout bounds on every spring.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main behavioral change: switching the fuzz test readiness check from main /ping to the min-HTTP health check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mt_fix-fuzz-timing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@mtopolnik mtopolnik changed the title fix(system_test): isolate bounce readiness probe from QWP ingest load fix(system_test): use min-HTTP health check in fuzz test instead of /ping on main HTTP Jun 24, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
system_test/qwp_ws_fuzz.py (1)

1522-1524: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Keep the defensive restart on the same bounded startup budget.

Lines 1522-1524 fall back to fixture.start()'s 300s default after a lifecycle failure. If the same slow-start condition is still present, this daemon thread can outlive the two 60s joins in TestQwpWsFuzz._run_fuzz() and keep mutating the fixture after the test has already failed. Reuse self._restart_timeout_s here so the fast-fail behavior still holds on the recovery path.

Suggested fix
                 try:
-                    self._fixture.start()
+                    self._fixture.start(
+                        start_timeout_sec=self._restart_timeout_s)
                 except Exception:
                     pass
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@system_test/qwp_ws_fuzz.py` around lines 1522 - 1524, The defensive restart
path in TestQwpWsFuzz._run_fuzz is falling back to fixture.start()’s default
timeout instead of staying within the same bounded startup budget. Update the
exception recovery around self._fixture.start() to pass self._restart_timeout_s
so the restart uses the same fast-fail timeout as the initial startup and does
not keep running past the test’s join budget.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@system_test/qwp_ws_fuzz.py`:
- Around line 1522-1524: The defensive restart path in TestQwpWsFuzz._run_fuzz
is falling back to fixture.start()’s default timeout instead of staying within
the same bounded startup budget. Update the exception recovery around
self._fixture.start() to pass self._restart_timeout_s so the restart uses the
same fast-fail timeout as the initial startup and does not keep running past the
test’s join budget.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f490a15-fd4f-4d9a-b69a-5791dba99c96

📥 Commits

Reviewing files that changed from the base of the PR and between db63120 and aa70079.

📒 Files selected for processing (3)
  • system_test/fixture.py
  • system_test/qwp_ws_fuzz.py
  • system_test/test.py

mtopolnik and others added 2 commits June 24, 2026 10:58
The managed and docker fixtures each carried a near-identical copy of
the min-HTTP readiness probe: same request to /status on the dedicated
min-HTTP port, same 200-range status check, and the same rationale
comment. Keeping the two in sync by hand risks fixture-type-dependent
flakiness if one copy is changed and the other is missed.

Hoist the probe into a single QuestDbFixtureBase._check_http_up(). The
only fixture-specific part, the liveness pre-check, becomes an
_assert_server_alive() hook that each subclass overrides (the managed
fixture inspects the process, the docker fixture inspects the
container). The base method uses self.host instead of a hardcoded
127.0.0.1, which QuestDbFixture already sets to that value.

The two handlers were not identical: the managed copy caught
socket.timeout and URLError, while the docker copy also caught OSError
to absorb the connection reset that happens when docker publishes the
host port before QuestDB binds. Since URLError and socket.timeout are
both OSError subclasses, a single `except OSError` is the exact union,
so the consolidated probe keeps both behaviors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The readiness probe was switched to the min-HTTP /status endpoint to
keep a bounce restart's probe from being starved by reconnecting
producers on the shared-network worker pool. But start() refines the
build version via a query to the main HTTP server right after the
probe passes, and the whole suite runs its SQL there. On a cold boot
the dedicated min-HTTP pool answers seconds before the main HTTP
server on the shared pool begins accepting, so query_version() hit a
not-yet-listening port and aborted startup with ConnectionRefusedError
on slow CI boxes.

Gate readiness on the server each start phase actually uses. The
initial start probes main HTTP /ping, restoring the guarantee that the
main server is accepting before query_version() and the test SQL run;
no ingest exists yet, so /ping cannot be starved. A bounce restart
passes probe_min_http=True to keep gating on the starvation-proof
min-HTTP pool, and that path issues no SQL after the probe.

Also bound the defensive recovery restart to restart_timeout_s and the
min-HTTP probe, so a still-reconnecting producer storm cannot starve
it and the daemon thread cannot outlive the test's join budget.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
system_test/qwp_ws_fuzz.py (1)

1530-1539: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Log defensive recovery failures before swallowing them.

The primary lifecycle failure is recorded, but if the best-effort stop() or start() also fails, the thread silently leaves recovery state ambiguous. Keep the non-fatal behavior, but log those exceptions for CI diagnosis.

Proposed fix
                 try:
                     self._fixture.stop(wait_timeout_sec=self._stop_timeout_s)
-                except Exception:
-                    pass
+                except Exception as recovery_stop_error:
+                    self._log(
+                        'fuzz bounce: defensive stop failed: '
+                        f'{type(recovery_stop_error).__name__}: '
+                        f'{recovery_stop_error}')
                 try:
                     self._fixture.start(
                         start_timeout_sec=self._restart_timeout_s,
                         probe_min_http=True)
-                except Exception:
-                    pass
+                except Exception as recovery_start_error:
+                    self._log(
+                        'fuzz bounce: defensive start failed: '
+                        f'{type(recovery_start_error).__name__}: '
+                        f'{recovery_start_error}')
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@system_test/qwp_ws_fuzz.py` around lines 1530 - 1539, The best-effort
recovery in the fixture restart path is swallowing exceptions from stop() and
start() without any trace, which makes recovery failures hard to diagnose. In
the recovery block around self._fixture.stop() and self._fixture.start(), keep
the existing non-fatal try/except behavior but log the caught exception before
passing, using the surrounding test/logger context so CI can see why recovery
did not complete. Reference the stop/start retry logic in qwp_ws_fuzz.py and
preserve the current fallback behavior.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@system_test/fixture.py`:
- Around line 700-704: The readiness timeout path in the managed JVM startup
flow currently re-raises without cleaning up the spawned process or log handle,
which can leak resources on failed starts. Update the startup logic around the
retry call in the managed fixture (the same flow that uses self.stop() for
Docker) so that when startup readiness times out, it explicitly stops the
JVM/process and closes any open log before re-raising. Use the existing
startup/cleanup symbols in this fixture to keep the failure path consistent with
other managed resources.

---

Nitpick comments:
In `@system_test/qwp_ws_fuzz.py`:
- Around line 1530-1539: The best-effort recovery in the fixture restart path is
swallowing exceptions from stop() and start() without any trace, which makes
recovery failures hard to diagnose. In the recovery block around
self._fixture.stop() and self._fixture.start(), keep the existing non-fatal
try/except behavior but log the caught exception before passing, using the
surrounding test/logger context so CI can see why recovery did not complete.
Reference the stop/start retry logic in qwp_ws_fuzz.py and preserve the current
fallback behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f91ce59-65ec-4379-b2d2-df1b4b0fe244

📥 Commits

Reviewing files that changed from the base of the PR and between aa70079 and 833196a.

📒 Files selected for processing (2)
  • system_test/fixture.py
  • system_test/qwp_ws_fuzz.py

Comment thread system_test/fixture.py
Comment on lines 700 to +704
retry(
check_http_up,
timeout_sec=300,
msg='Timed out waiting for HTTP service to come up.')
check_up,
timeout_sec=start_timeout_sec,
msg=f'Timed out waiting for HTTP service to come up '
f'within {start_timeout_sec}s.')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Clean up the managed JVM when readiness times out.

The new bounded retry(..., timeout_sec=start_timeout_sec) makes start failures an expected infra signal, but the managed fixture still re-raises without stopping the spawned process or closing the log. Docker already calls self.stop() on the same path; mirror that here so failed initial starts do not leak a JVM/ports.

Proposed fix
         except:
             sys.stderr.write(f'QuestDB log at `{self._log_path}`:\n')
             self.print_log()
+            try:
+                self.stop()
+            except Exception as cleanup_error:
+                sys.stderr.write(
+                    'Failed to clean up QuestDB after start failure: '
+                    f'{type(cleanup_error).__name__}: {cleanup_error}\n')
             raise
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@system_test/fixture.py` around lines 700 - 704, The readiness timeout path in
the managed JVM startup flow currently re-raises without cleaning up the spawned
process or log handle, which can leak resources on failed starts. Update the
startup logic around the retry call in the managed fixture (the same flow that
uses self.stop() for Docker) so that when startup readiness times out, it
explicitly stops the JVM/process and closes any open log before re-raising. Use
the existing startup/cleanup symbols in this fixture to keep the failure path
consistent with other managed resources.

The fuzz bounce test's timing budgets assumed only the restart leg of a
bounce could starve the client, but a bounce makes the server
unavailable for a whole stop() + start() cycle. With the producer
close_drain/reconnect budget and the restart cap both around 120s, a
slow graceful shutdown could burn a producer's entire drain window and
surface as a misleading client-side close_drain timeout, with no bounce
failure recorded.

Size the budgets to the real worst case instead:

- Derive the producer reconnect and close_drain budget from
  bounce_stop_timeout_s + BOUNCE_RESTART_TIMEOUT_SEC (plus headroom) and
  apply it to both knobs, so a healthy-but-slow bounce can never strand
  a producer mid-drain. Keeping reconnect and drain equal stops a
  reconnect from giving up mid-window while the drain still has budget.

- Size the bounce-thread wind-down join to a full cycle and record a
  failure if the thread is still alive afterward, so a stuck lifecycle
  is attributed here instead of racing the verification queries and
  teardown on shared fixture state.

Lowering bounce_stop_timeout_s was rejected: it risks force-killing a
server still draining its WAL backlog. Raising the client budgets only
slows a genuine failure (now up to ~240s), never a passing run, since
the ceilings are reached only by a genuinely stuck server.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mtopolnik

Copy link
Copy Markdown
Contributor Author

Reviewing at level 3. Note up front: this PR is entirely Python test infrastructure (system_test/fixture.py, qwp_ws_fuzz.py, test.py). None of the level-3 machinery aimed at FFI/ABI, ILP wire format, unsafe, panics, or the C/C++ headers applies — there is no Rust/C/C++ in the diff. I ran the dimensions that do apply: correctness/timing math, process/file-handle lifecycle, cross-test blast radius, test coverage, and a fresh-context adversarial pass.

The core fix is sound. The root cause (main-HTTP /ping starved because it shares the shared-network worker pool with QWP ingest, so "server back up" was structurally gated on producers failing) is correctly addressed by moving the bounce-restart probe to a dedicated-pool min-HTTP /status. The budget arithmetic is internally consistent: client_budget = wind_down = bounce_stop_timeout_s(120) + BOUNCE_RESTART_TIMEOUT_SEC(90) + 30 = 240s, which clears the worst legitimate down-window (stop <120s + restart <90s ≈ <210s) by the +30s headroom, so a healthy-but-slow bounce can't strand a producer mid-drain and an over-cap leg is attributed in the bounce thread. Managed/docker config did not drift. int() correctly keeps 240000.0 out of the conf string.

Critical

None.

Moderate

M1 — Zero unit-test coverage for the new readiness logic (in-diff). fixture.py:381-418 (_check_main_http_up, _check_min_http_up, _probe_http, the three _assert_server_alive overrides) and the start(start_timeout_sec, probe_min_http) selection wiring have no direct test. The PR's "19/19 pass" is true but every one of those 19 is a pre-existing test for stop() escalation / print_log() / is_transient_network_error — none touches the new code (verified by reading test_fixture_unit.py and grepping the symbols). The single most valuable gap is the status-predicate asymmetry — main requires exactly status == 204, min accepts 200 <= status < 300 (fixture.py:388,398). A future edit that flips these ships green and silently breaks readiness gating, which is the whole point of the PR. The existing harness makes this cheap: test_fixture_unit.py:61-64,93 already builds a QuestDbFixture from a temp dir and assigns a stand-in _proc. A server-free test can monkeypatch urllib.request.urlopen to assert the True/OSError→False branches, assign a finished _proc to assert _probe_http fails fast via _assert_server_alive before any HTTP call, and call the two _check_* methods with _probe_http captured to pin (port, path) and status_ok(204/200/500). Non-blocking (the fix is validated by the integration run + the author's live smoke test), but worth adding given how cheap it is.

Minor

m1 — QuestDbDockerFixture.stop() hands a float to docker stop -t (out-of-diff, pre-existing, currently unreachable). fixture.py:1072: str(wait_timeout_sec) with a float 120.0 produces docker stop -t 120.0, which the docker CLI rejects (-t is ParseInt). Because _remove_container runs with check=False then unconditionally docker rm -f, the failure is swallowed and the container is hard-killed — silently skipping the graceful WAL-drain the whole bounce design relies on. This PR does not trigger it (the fuzz/BounceThread path is the only float-passing caller and it skips the docker fixture; all docker-mode stop() calls use the int-30 default). It's a genuine latent foot-gun in a file this PR edits, and the trivial fix is str(int(wait_timeout_sec)). Optional to fix here since it's out of scope.

m2 — _probe_http except OSError misses http.client.HTTPException (in-diff, not a regression). fixture.py:410. BadStatusLine/IncompleteRead are HTTPException, not OSError, so a malformed/truncated early HTTP response would escape the handler and abort the whole retry() loop with a misleading traceback instead of just retrying. Low probability here (the common docker "connect-then-reset" case is RemoteDisconnected, which is an OSError and is caught; the managed fixture's port opens with the app). The old code had the same gap, so it's not a regression — but since the helper was just rewritten, adding http.client.HTTPException to the except clause is a cheap robustness win.

m3 — Defensive recovery swallows stop()/start() exceptions with no trace (in-diff). qwp_ws_fuzz.py:1536-1545: both except Exception: pass blocks discard the recovery failure. The primary failure is already recorded, but logging the swallowed recovery errors (as CodeRabbit suggested) would help CI diagnosis. Trivial.

m4 — Managed start() doesn't self-clean on failure, unlike docker (in-diff, benign). fixture.py:705-708 only prints the log and re-raises, leaving self._proc live + self._log open; the docker except (:998-1002) calls self.stop(). Verified benign: on the bounce path BounceThread's defensive stop() reaps the orphan, and on the initial-start path the caller's finally: _stop_and_maybe_wipe (test.py:4248) reaps it. Optional symmetry polish — mirror docker with a guarded self.stop() — but not a leak.

m5 — Stuck-bounce-thread wind-down: double-count + teardown race (in-diff, acknowledged, cosmetic). test.py:2563-2571. When the daemon thread is still alive after the join, record_failure correctly bumps failure_counter, and the assertion at test.py:2578 short-circuits before the verification queries (verified ordering) — so the design goal is met. Residual: the still-alive daemon can race tearDown/stop() on shared _proc/_log (e.g. AttributeError if one thread nulls _proc between another's if self._proc: and .terminate()), and the failure may be counted twice. Both are cosmetic noise on an already-failed run, and the code comments acknowledge it. Optional: self.fail(...) at the detection point to make the abort explicit.

Nits

  • n1_check_min_http_up docstring says HealthCheckProcessor "answers any path with 200," yet the code probes a specific /status and accepts a 2xx range. Mildly self-contradictory; align the comment. (Correctness depends on the min server returning 2xx for /status on the repo build — the author's smoke test confirms this; if it ever 404'd, every bounce would fail at the 90s cap.)
  • n2 — The budget expression bounce_stop_timeout_s + BOUNCE_RESTART_TIMEOUT_SEC + 30 is duplicated verbatim in three places (test.py:2558-2560, test.py:2614-2616, and relied on by qwp_ws_fuzz.py). The invariant "client_budget ≥ wind_down ≥ down-window" is held only by copy-paste. Extracting one helper/constant would protect it from silent drift.

Downgraded (false positives / overstated)

  • "Managed start() leaks the JVM on timeout" (CodeRabbit + agent) — not a leak; the orphan is always reaped by the caller's finally (initial start) or BounceThread's defensive stop() (bounce). See m4.

Summary

Verdict: approve with minor changes. The fix correctly targets a real, well-diagnosed flake and the timing/attribution redesign is internally consistent and carefully reasoned (the inline comments are excellent). No correctness regressions. The one substantive ask is M1 (add cheap unit coverage for the probe-selection and status-predicate logic, since a silent predicate flip is the most plausible future regression); everything else is optional polish, with m1 (docker-stop float) worth a one-character fix while you're in the file even though it's out of scope.

Two follow-ups from reviewing the min-HTTP readiness work.

Add server-free unit tests for the readiness probe, which had no
direct coverage. They pin the status-predicate asymmetry that the
whole change hinges on (main /ping must answer 204 exactly, the min
health endpoint counts any 2xx), the not-up-vs-fail-fast behaviour
(a refused connection returns False so the retry loop continues, a
dead process raises before any HTTP request), and start()'s
selection of the main vs min check and its timeout. A silent flip of
the 204-vs-2xx predicates is the most plausible future regression and
now fails the suite. The tests drive QuestDbFixture with stand-in
processes and a mocked urlopen, so no JVM or server is launched.

Fix QuestDbDockerFixture.stop() passing a float to `docker stop -t`.
The bounce path calls stop() with bounce_stop_timeout_s (a float,
120.0), and docker's integer -t flag rejects "120.0". Because the
call runs with check=False, the failure was swallowed and the
container force-removed, skipping the graceful grace period the
bounce design relies on to drain WAL. Coerce to int. This is not
reachable from the current fuzz test (it skips the docker fixture),
but it is a latent foot-gun in the touched path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
system_test/fixture.py (1)

406-417: 🩺 Stability & Availability | 🟡 Minor

Also catch http.client.HTTPException in _probe_http. urlopen() can surface BadStatusLine or IncompleteRead from a transient malformed response, and those bypass the current OSError handler. Add import http.client at module scope.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@system_test/fixture.py` around lines 406 - 417, The `_probe_http` retry path
only catches `OSError`, so transient malformed HTTP responses from
`urllib.request.urlopen` can still escape. Update the exception handling in
`_probe_http` to also catch `http.client.HTTPException` alongside the existing
`OSError` cases, and add the needed `http.client` import at module scope so
`BadStatusLine` and `IncompleteRead` are handled consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@system_test/fixture.py`:
- Around line 406-417: The `_probe_http` retry path only catches `OSError`, so
transient malformed HTTP responses from `urllib.request.urlopen` can still
escape. Update the exception handling in `_probe_http` to also catch
`http.client.HTTPException` alongside the existing `OSError` cases, and add the
needed `http.client` import at module scope so `BadStatusLine` and
`IncompleteRead` are handled consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 88357c43-ec8f-44d6-8c74-6649fb13ff2d

📥 Commits

Reviewing files that changed from the base of the PR and between 833196a and dd8f30a.

📒 Files selected for processing (4)
  • system_test/fixture.py
  • system_test/qwp_ws_fuzz.py
  • system_test/test.py
  • system_test/test_fixture_unit.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • system_test/qwp_ws_fuzz.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant