fix(system_test): use min-HTTP health check in fuzz test instead of /ping on main HTTP#164
fix(system_test): use min-HTTP health check in fuzz test instead of /ping on main HTTP#164mtopolnik wants to merge 6 commits into
Conversation
Server start timeout shouldn't be larger than close-drain timeout, because the latter includes the former and will cause the test to falsely blame close-drain, when server start was the slow one.
/ping is served from the shared thread pool, and gets starved under severe ingestion load, making the server appear not-started to the fuzz test harness
📝 WalkthroughWalkthroughThe PR switches fixture startup readiness probing to a dedicated min-HTTP ChangesMin-HTTP readiness and bounce restart timeout
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
system_test/qwp_ws_fuzz.py (1)
1522-1524: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winKeep the defensive restart on the same bounded startup budget.
Lines 1522-1524 fall back to
fixture.start()'s 300s default after a lifecycle failure. If the same slow-start condition is still present, this daemon thread can outlive the two 60s joins inTestQwpWsFuzz._run_fuzz()and keep mutating the fixture after the test has already failed. Reuseself._restart_timeout_shere so the fast-fail behavior still holds on the recovery path.Suggested fix
try: - self._fixture.start() + self._fixture.start( + start_timeout_sec=self._restart_timeout_s) except Exception: pass🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@system_test/qwp_ws_fuzz.py` around lines 1522 - 1524, The defensive restart path in TestQwpWsFuzz._run_fuzz is falling back to fixture.start()’s default timeout instead of staying within the same bounded startup budget. Update the exception recovery around self._fixture.start() to pass self._restart_timeout_s so the restart uses the same fast-fail timeout as the initial startup and does not keep running past the test’s join budget.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@system_test/qwp_ws_fuzz.py`:
- Around line 1522-1524: The defensive restart path in TestQwpWsFuzz._run_fuzz
is falling back to fixture.start()’s default timeout instead of staying within
the same bounded startup budget. Update the exception recovery around
self._fixture.start() to pass self._restart_timeout_s so the restart uses the
same fast-fail timeout as the initial startup and does not keep running past the
test’s join budget.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5f490a15-fd4f-4d9a-b69a-5791dba99c96
📒 Files selected for processing (3)
system_test/fixture.pysystem_test/qwp_ws_fuzz.pysystem_test/test.py
The managed and docker fixtures each carried a near-identical copy of the min-HTTP readiness probe: same request to /status on the dedicated min-HTTP port, same 200-range status check, and the same rationale comment. Keeping the two in sync by hand risks fixture-type-dependent flakiness if one copy is changed and the other is missed. Hoist the probe into a single QuestDbFixtureBase._check_http_up(). The only fixture-specific part, the liveness pre-check, becomes an _assert_server_alive() hook that each subclass overrides (the managed fixture inspects the process, the docker fixture inspects the container). The base method uses self.host instead of a hardcoded 127.0.0.1, which QuestDbFixture already sets to that value. The two handlers were not identical: the managed copy caught socket.timeout and URLError, while the docker copy also caught OSError to absorb the connection reset that happens when docker publishes the host port before QuestDB binds. Since URLError and socket.timeout are both OSError subclasses, a single `except OSError` is the exact union, so the consolidated probe keeps both behaviors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The readiness probe was switched to the min-HTTP /status endpoint to keep a bounce restart's probe from being starved by reconnecting producers on the shared-network worker pool. But start() refines the build version via a query to the main HTTP server right after the probe passes, and the whole suite runs its SQL there. On a cold boot the dedicated min-HTTP pool answers seconds before the main HTTP server on the shared pool begins accepting, so query_version() hit a not-yet-listening port and aborted startup with ConnectionRefusedError on slow CI boxes. Gate readiness on the server each start phase actually uses. The initial start probes main HTTP /ping, restoring the guarantee that the main server is accepting before query_version() and the test SQL run; no ingest exists yet, so /ping cannot be starved. A bounce restart passes probe_min_http=True to keep gating on the starvation-proof min-HTTP pool, and that path issues no SQL after the probe. Also bound the defensive recovery restart to restart_timeout_s and the min-HTTP probe, so a still-reconnecting producer storm cannot starve it and the daemon thread cannot outlive the test's join budget. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
system_test/qwp_ws_fuzz.py (1)
1530-1539: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winLog defensive recovery failures before swallowing them.
The primary lifecycle failure is recorded, but if the best-effort
stop()orstart()also fails, the thread silently leaves recovery state ambiguous. Keep the non-fatal behavior, but log those exceptions for CI diagnosis.Proposed fix
try: self._fixture.stop(wait_timeout_sec=self._stop_timeout_s) - except Exception: - pass + except Exception as recovery_stop_error: + self._log( + 'fuzz bounce: defensive stop failed: ' + f'{type(recovery_stop_error).__name__}: ' + f'{recovery_stop_error}') try: self._fixture.start( start_timeout_sec=self._restart_timeout_s, probe_min_http=True) - except Exception: - pass + except Exception as recovery_start_error: + self._log( + 'fuzz bounce: defensive start failed: ' + f'{type(recovery_start_error).__name__}: ' + f'{recovery_start_error}')🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@system_test/qwp_ws_fuzz.py` around lines 1530 - 1539, The best-effort recovery in the fixture restart path is swallowing exceptions from stop() and start() without any trace, which makes recovery failures hard to diagnose. In the recovery block around self._fixture.stop() and self._fixture.start(), keep the existing non-fatal try/except behavior but log the caught exception before passing, using the surrounding test/logger context so CI can see why recovery did not complete. Reference the stop/start retry logic in qwp_ws_fuzz.py and preserve the current fallback behavior.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@system_test/fixture.py`:
- Around line 700-704: The readiness timeout path in the managed JVM startup
flow currently re-raises without cleaning up the spawned process or log handle,
which can leak resources on failed starts. Update the startup logic around the
retry call in the managed fixture (the same flow that uses self.stop() for
Docker) so that when startup readiness times out, it explicitly stops the
JVM/process and closes any open log before re-raising. Use the existing
startup/cleanup symbols in this fixture to keep the failure path consistent with
other managed resources.
---
Nitpick comments:
In `@system_test/qwp_ws_fuzz.py`:
- Around line 1530-1539: The best-effort recovery in the fixture restart path is
swallowing exceptions from stop() and start() without any trace, which makes
recovery failures hard to diagnose. In the recovery block around
self._fixture.stop() and self._fixture.start(), keep the existing non-fatal
try/except behavior but log the caught exception before passing, using the
surrounding test/logger context so CI can see why recovery did not complete.
Reference the stop/start retry logic in qwp_ws_fuzz.py and preserve the current
fallback behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0f91ce59-65ec-4379-b2d2-df1b4b0fe244
📒 Files selected for processing (2)
system_test/fixture.pysystem_test/qwp_ws_fuzz.py
| retry( | ||
| check_http_up, | ||
| timeout_sec=300, | ||
| msg='Timed out waiting for HTTP service to come up.') | ||
| check_up, | ||
| timeout_sec=start_timeout_sec, | ||
| msg=f'Timed out waiting for HTTP service to come up ' | ||
| f'within {start_timeout_sec}s.') |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Clean up the managed JVM when readiness times out.
The new bounded retry(..., timeout_sec=start_timeout_sec) makes start failures an expected infra signal, but the managed fixture still re-raises without stopping the spawned process or closing the log. Docker already calls self.stop() on the same path; mirror that here so failed initial starts do not leak a JVM/ports.
Proposed fix
except:
sys.stderr.write(f'QuestDB log at `{self._log_path}`:\n')
self.print_log()
+ try:
+ self.stop()
+ except Exception as cleanup_error:
+ sys.stderr.write(
+ 'Failed to clean up QuestDB after start failure: '
+ f'{type(cleanup_error).__name__}: {cleanup_error}\n')
raise🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@system_test/fixture.py` around lines 700 - 704, The readiness timeout path in
the managed JVM startup flow currently re-raises without cleaning up the spawned
process or log handle, which can leak resources on failed starts. Update the
startup logic around the retry call in the managed fixture (the same flow that
uses self.stop() for Docker) so that when startup readiness times out, it
explicitly stops the JVM/process and closes any open log before re-raising. Use
the existing startup/cleanup symbols in this fixture to keep the failure path
consistent with other managed resources.
The fuzz bounce test's timing budgets assumed only the restart leg of a bounce could starve the client, but a bounce makes the server unavailable for a whole stop() + start() cycle. With the producer close_drain/reconnect budget and the restart cap both around 120s, a slow graceful shutdown could burn a producer's entire drain window and surface as a misleading client-side close_drain timeout, with no bounce failure recorded. Size the budgets to the real worst case instead: - Derive the producer reconnect and close_drain budget from bounce_stop_timeout_s + BOUNCE_RESTART_TIMEOUT_SEC (plus headroom) and apply it to both knobs, so a healthy-but-slow bounce can never strand a producer mid-drain. Keeping reconnect and drain equal stops a reconnect from giving up mid-window while the drain still has budget. - Size the bounce-thread wind-down join to a full cycle and record a failure if the thread is still alive afterward, so a stuck lifecycle is attributed here instead of racing the verification queries and teardown on shared fixture state. Lowering bounce_stop_timeout_s was rejected: it risks force-killing a server still draining its WAL backlog. Raising the client budgets only slows a genuine failure (now up to ~240s), never a passing run, since the ceilings are reached only by a genuinely stuck server. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Reviewing at level 3. Note up front: this PR is entirely Python test infrastructure ( The core fix is sound. The root cause (main-HTTP CriticalNone. ModerateM1 — Zero unit-test coverage for the new readiness logic (in-diff). Minorm1 — m2 — m3 — Defensive recovery swallows stop()/start() exceptions with no trace (in-diff). m4 — Managed m5 — Stuck-bounce-thread wind-down: double-count + teardown race (in-diff, acknowledged, cosmetic). Nits
Downgraded (false positives / overstated)
SummaryVerdict: approve with minor changes. The fix correctly targets a real, well-diagnosed flake and the timing/attribution redesign is internally consistent and carefully reasoned (the inline comments are excellent). No correctness regressions. The one substantive ask is M1 (add cheap unit coverage for the probe-selection and status-predicate logic, since a silent predicate flip is the most plausible future regression); everything else is optional polish, with m1 (docker-stop float) worth a one-character fix while you're in the file even though it's out of scope. |
Two follow-ups from reviewing the min-HTTP readiness work. Add server-free unit tests for the readiness probe, which had no direct coverage. They pin the status-predicate asymmetry that the whole change hinges on (main /ping must answer 204 exactly, the min health endpoint counts any 2xx), the not-up-vs-fail-fast behaviour (a refused connection returns False so the retry loop continues, a dead process raises before any HTTP request), and start()'s selection of the main vs min check and its timeout. A silent flip of the 204-vs-2xx predicates is the most plausible future regression and now fails the suite. The tests drive QuestDbFixture with stand-in processes and a mocked urlopen, so no JVM or server is launched. Fix QuestDbDockerFixture.stop() passing a float to `docker stop -t`. The bounce path calls stop() with bounce_stop_timeout_s (a float, 120.0), and docker's integer -t flag rejects "120.0". Because the call runs with check=False, the failure was swallowed and the container force-removed, skipping the graceful grace period the bounce design relies on to drain WAL. Coerce to int. This is not reachable from the current fuzz test (it skips the docker fixture), but it is a latent foot-gun in the touched path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
system_test/fixture.py (1)
406-417: 🩺 Stability & Availability | 🟡 MinorAlso catch
http.client.HTTPExceptionin_probe_http.urlopen()can surfaceBadStatusLineorIncompleteReadfrom a transient malformed response, and those bypass the currentOSErrorhandler. Addimport http.clientat module scope.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@system_test/fixture.py` around lines 406 - 417, The `_probe_http` retry path only catches `OSError`, so transient malformed HTTP responses from `urllib.request.urlopen` can still escape. Update the exception handling in `_probe_http` to also catch `http.client.HTTPException` alongside the existing `OSError` cases, and add the needed `http.client` import at module scope so `BadStatusLine` and `IncompleteRead` are handled consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@system_test/fixture.py`:
- Around line 406-417: The `_probe_http` retry path only catches `OSError`, so
transient malformed HTTP responses from `urllib.request.urlopen` can still
escape. Update the exception handling in `_probe_http` to also catch
`http.client.HTTPException` alongside the existing `OSError` cases, and add the
needed `http.client` import at module scope so `BadStatusLine` and
`IncompleteRead` are handled consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 88357c43-ec8f-44d6-8c74-6649fb13ff2d
📒 Files selected for processing (4)
system_test/fixture.pysystem_test/qwp_ws_fuzz.pysystem_test/test.pysystem_test/test_fixture_unit.py
🚧 Files skipped from review as they are similar to previous changes (1)
- system_test/qwp_ws_fuzz.py
Problem
TestQwpWsFuzz.test_all_mixed_with_bounceflakes on CI with:The fixture log made it look like a QuestDB bounce restart took ~113s (vs the usual 6–40s in the same run), which consumed the producers' 120s
close_drainbudget.Root cause
The server actually started in ~9s; the fixture's
/pingreadiness probe was then starved for ~100s./pingis served by the main HTTP server, which runs on theshared-networkworker pool — the same pool that accepts and processes QWP ingest. On a bounce restart the reconnecting fuzz producers immediately replay their queued backlog and saturate that pool on the 4-core CI box, so the accept loop never runs and no new connection (including/ping) is accepted for ~80s.The probe could only succeed once the producers gave up at their
close_draindeadline and disconnected, freeing the workers — so the bounce's "server back up" was structurally gated on the producers failing. The server log confirms/pinganswered ~1s after the producers' WebSocket disconnects.Fix
http.min.enabled=true, dedicatedhttp.min.worker.count=1). It runs on its own worker pool, so the probe can't be starved by ingest on the shared pool.HealthCheckProcessoranswers any path with200. Both the managed and docker fixtures now discover/publish a 4th port and probe the min-HTTP/status.BOUNCE_RESTART_TIMEOUT_SEC = 90(below the 120sclose_drainbudget) so a genuinely stuck restart fails fast as an infra error with correct attribution, instead of surfacing later as a misleading clientclose_draintimeout.Test plan
python3 -m py_compileon the three changed files.python3 -m unittest test_qwp_ws_fuzz_unit test_fixture_unit— 19/19 pass.custom thread pool [name=minhttp, requester=min-http, workers=1, priority=8]),/statusreturns 200, any path on the min port returns 200 (theuseAsDefaultcatch-all), and the main/pingstill returns 204.Tradeoffs / scope
This isolates the readiness probe from ingest starvation. It does not change that the producers still reconnect QWP on the shared pool, nor that a bounce's graceful stop drains WAL — on a severely CPU-starved box those can still be slow. The modest per-bounce backlog drains quickly once the bounce sequence completes, and the 90s restart cap remains as a backstop for a genuinely stuck boot.
🤖 Generated with Claude Code
Summary by CodeRabbit
/pingor the additional min endpoint.