Skip to content

perf: tighten buildkitd readiness polling and gate it in regression workflow#100

Draft
taha-au wants to merge 1 commit intomainfrom
perf/buildkitd-readiness-polling
Draft

perf: tighten buildkitd readiness polling and gate it in regression workflow#100
taha-au wants to merge 1 commit intomainfrom
perf/buildkitd-readiness-polling

Conversation

@taha-au
Copy link
Copy Markdown
Contributor

@taha-au taha-au commented Apr 25, 2026

Summary

Two related improvements + a regression-workflow extension:

1. Tighten the buildkitd readiness polling

After buildkitd is launched, the action polls buildctl debug workers until the OCI worker is up. The loop used a flat 1s sleep between polls, so we paid up to ~1s of pure polling discretization on every job, regardless of how fast buildkitd actually came up.

Replace that with exponential backoff: 100→200→400→800ms, capped at 1000ms. Same 30s hard timeout. Steady-state startups observe the worker up to ~900ms sooner; cold/slow startups behave no worse than before.

2. Surface the slow path

Today, when readiness takes a long time, the action's stdout gives you no idea why — the explanation lives in /tmp/buildkitd.log on the runner.

  • Add a buildkitd workers ready in <N>ms after <K> poll(s) log line so the readiness window is directly measurable from the action's own output.
  • When readiness takes >2s, automatically tail the last 50 lines of /tmp/buildkitd.log so the slow path is self-explanatory in CI logs without spamming the fast path.

3. Extend the step-duration regression workflow

Adds a third gated metric to step-duration-regression.yml alongside setup/post step durations: the buildkitd readiness window in milliseconds, parsed out of the new telemetry line. Threshold defaults to BUILDKITD_READY_MAX_MS=8000.

This catches both:

  • regressions in the action's polling backoff (would push readiness ~1s+ higher than necessary), and
  • regressions in buildkitd warm-up itself (would blow past the 8s ceiling regardless of polling).

The check is informational on older action versions that don't emit the telemetry line.

Validation

Manual run on this branch via workflow_dispatch: https://github.com/useblacksmith/setup-docker-builder/actions/runs/24924523539

Setup step ("Setup Docker Builder under test"): 3s   (threshold 5s)
Post  step ("Post Setup Docker Builder under test"):  2s   (threshold 5s)
buildkitd readiness:          169ms in 1 poll(s)   (threshold 8000ms)

All thresholds passed.

A real production job (useblacksmith/web, large sticky disk with ~382 GB of pre-existing cache) was previously seeing ~6.8s of buildkitd readiness time. We can't shrink the buildkitd warm-up itself from this action, but we can now (a) get to the worker as soon as it's actually ready (no ~1s overshoot) and (b) see directly from CI logs how much of the readiness window was buildkitd vs polling.

Test plan

  • Unit tests pass (pnpm test)
  • Typecheck passes (pnpm typecheck)
  • dist/ rebuilt and committed
  • Manual workflow_dispatch run on this branch shows the new telemetry line, the parsed value reaches the validator, and the threshold check fires
  • Regression workflow runs green against this PR (will appear in checks below)

View in Codesmith
Codesmith can help with this PR, just tag @codesmith or enable autofix. Settings.

  • Autofix CI and bot reviews (Staging)

View in Codesmith
Codesmith can help with this PR, just tag @codesmith or enable autofix. Settings.

  • Autofix CI and bot reviews

Note

Medium Risk
Adds a new CI-gated performance metric by scraping job logs, which could introduce occasional flakiness if log output or GitHub log retrieval changes; otherwise changes are isolated to workflow/buildkitd readiness polling and diagnostics.

Overview
Extends the step-duration-regression.yml CI gate to also enforce a hard ceiling on buildkitd readiness time (default BUILDKITD_READY_MAX_MS=8000), by fetching the target job’s logs, parsing a buildkitd workers ready in <N>ms after <K> poll(s) telemetry line, and failing when the threshold is exceeded (skipping with a warning if the telemetry is missing).

Updates the action’s buildkitd worker readiness polling to use exponential backoff (100ms→1s cap), emits the new readiness telemetry, and tails the last 50 lines of /tmp/buildkitd.log when readiness is slow (>2s) to make regressions diagnosable from CI output.

Reviewed by Cursor Bugbot for commit 978b795. Bugbot is set up for automated code reviews on this repo. Configure here.

The setup step's "wait for buildkitd workers" loop polled with a flat
1s backoff. In practice buildkitd's OCI worker comes up in well under
a second on most runs, so we paid up to ~1s of polling discretization
for nothing on every job. On slow runs, we had no signal at all about
what buildkitd was doing during the wait - it all lived in
/tmp/buildkitd.log on the runner.

Changes:

- Replace the fixed 1000ms backoff with exponential 100->200->400->800
  ms, capped at 1000ms. Same 30s hard timeout. Steady-state startups
  observe the worker up to ~900ms sooner; cold/slow startups behave
  no worse than before.
- Emit a "buildkitd workers ready in <N>ms after <K> poll(s)" log line
  so the readiness window is directly measurable from the action's own
  output instead of having to subtract two adjacent log timestamps by
  hand.
- When readiness takes >2s, automatically tail the last 50 lines of
  /tmp/buildkitd.log so the slow path is self-explanatory in CI logs
  without spamming the fast path.

Also extends the step-duration-regression workflow to parse the new
telemetry line out of the job log and gate it on
BUILDKITD_READY_MAX_MS (default 8000ms). This catches both:

- a regression in the action's polling backoff (would push readiness
  ~1s+ higher than necessary), and
- a regression in buildkitd warm-up itself (would blow past the 8s
  ceiling regardless of polling).

The check is informational on older action versions that don't emit
the telemetry line.
@taha-au taha-au marked this pull request as draft April 25, 2026 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant