perf: tighten buildkitd readiness polling and gate it in regression workflow#100
Draft
perf: tighten buildkitd readiness polling and gate it in regression workflow#100
Conversation
The setup step's "wait for buildkitd workers" loop polled with a flat 1s backoff. In practice buildkitd's OCI worker comes up in well under a second on most runs, so we paid up to ~1s of polling discretization for nothing on every job. On slow runs, we had no signal at all about what buildkitd was doing during the wait - it all lived in /tmp/buildkitd.log on the runner. Changes: - Replace the fixed 1000ms backoff with exponential 100->200->400->800 ms, capped at 1000ms. Same 30s hard timeout. Steady-state startups observe the worker up to ~900ms sooner; cold/slow startups behave no worse than before. - Emit a "buildkitd workers ready in <N>ms after <K> poll(s)" log line so the readiness window is directly measurable from the action's own output instead of having to subtract two adjacent log timestamps by hand. - When readiness takes >2s, automatically tail the last 50 lines of /tmp/buildkitd.log so the slow path is self-explanatory in CI logs without spamming the fast path. Also extends the step-duration-regression workflow to parse the new telemetry line out of the job log and gate it on BUILDKITD_READY_MAX_MS (default 8000ms). This catches both: - a regression in the action's polling backoff (would push readiness ~1s+ higher than necessary), and - a regression in buildkitd warm-up itself (would blow past the 8s ceiling regardless of polling). The check is informational on older action versions that don't emit the telemetry line.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related improvements + a regression-workflow extension:
1. Tighten the buildkitd readiness polling
After
buildkitdis launched, the action pollsbuildctl debug workersuntil the OCI worker is up. The loop used a flat 1s sleep between polls, so we paid up to ~1s of pure polling discretization on every job, regardless of how fast buildkitd actually came up.Replace that with exponential backoff: 100→200→400→800ms, capped at 1000ms. Same 30s hard timeout. Steady-state startups observe the worker up to ~900ms sooner; cold/slow startups behave no worse than before.
2. Surface the slow path
Today, when readiness takes a long time, the action's stdout gives you no idea why — the explanation lives in
/tmp/buildkitd.logon the runner.buildkitd workers ready in <N>ms after <K> poll(s)log line so the readiness window is directly measurable from the action's own output./tmp/buildkitd.logso the slow path is self-explanatory in CI logs without spamming the fast path.3. Extend the step-duration regression workflow
Adds a third gated metric to
step-duration-regression.ymlalongside setup/post step durations: the buildkitd readiness window in milliseconds, parsed out of the new telemetry line. Threshold defaults toBUILDKITD_READY_MAX_MS=8000.This catches both:
The check is informational on older action versions that don't emit the telemetry line.
Validation
Manual run on this branch via
workflow_dispatch: https://github.com/useblacksmith/setup-docker-builder/actions/runs/24924523539A real production job (
useblacksmith/web, large sticky disk with ~382 GB of pre-existing cache) was previously seeing ~6.8s of buildkitd readiness time. We can't shrink the buildkitd warm-up itself from this action, but we can now (a) get to the worker as soon as it's actually ready (no ~1s overshoot) and (b) see directly from CI logs how much of the readiness window was buildkitd vs polling.Test plan
pnpm test)pnpm typecheck)dist/rebuilt and committedCodesmith can help with this PR, just tag
@codesmithor enable autofix. Settings.Codesmith can help with this PR, just tag
@codesmithor enable autofix. Settings.Note
Medium Risk
Adds a new CI-gated performance metric by scraping job logs, which could introduce occasional flakiness if log output or GitHub log retrieval changes; otherwise changes are isolated to workflow/buildkitd readiness polling and diagnostics.
Overview
Extends the
step-duration-regression.ymlCI gate to also enforce a hard ceiling on buildkitd readiness time (defaultBUILDKITD_READY_MAX_MS=8000), by fetching the target job’s logs, parsing abuildkitd workers ready in <N>ms after <K> poll(s)telemetry line, and failing when the threshold is exceeded (skipping with a warning if the telemetry is missing).Updates the action’s buildkitd worker readiness polling to use exponential backoff (100ms→1s cap), emits the new readiness telemetry, and tails the last 50 lines of
/tmp/buildkitd.logwhen readiness is slow (>2s) to make regressions diagnosable from CI output.Reviewed by Cursor Bugbot for commit 978b795. Bugbot is set up for automated code reviews on this repo. Configure here.