perf: tighten buildkitd readiness polling and gate it in regression workflow by taha-au · Pull Request #100 · useblacksmith/setup-docker-builder

taha-au · 2026-04-25T06:22:28Z

Summary

Two related improvements + a regression-workflow extension:

1. Tighten the buildkitd readiness polling

After buildkitd is launched, the action polls buildctl debug workers until the OCI worker is up. The loop used a flat 1s sleep between polls, so we paid up to ~1s of pure polling discretization on every job, regardless of how fast buildkitd actually came up.

Replace that with exponential backoff: 100→200→400→800ms, capped at 1000ms. Same 30s hard timeout. Steady-state startups observe the worker up to ~900ms sooner; cold/slow startups behave no worse than before.

2. Surface the slow path

Today, when readiness takes a long time, the action's stdout gives you no idea why — the explanation lives in /tmp/buildkitd.log on the runner.

Add a buildkitd workers ready in <N>ms after <K> poll(s) log line so the readiness window is directly measurable from the action's own output.
When readiness takes >2s, automatically tail the last 50 lines of /tmp/buildkitd.log so the slow path is self-explanatory in CI logs without spamming the fast path.

3. Extend the step-duration regression workflow

Adds a third gated metric to step-duration-regression.yml alongside setup/post step durations: the buildkitd readiness window in milliseconds, parsed out of the new telemetry line. Threshold defaults to BUILDKITD_READY_MAX_MS=8000.

This catches both:

regressions in the action's polling backoff (would push readiness ~1s+ higher than necessary), and
regressions in buildkitd warm-up itself (would blow past the 8s ceiling regardless of polling).

The check is informational on older action versions that don't emit the telemetry line.

Validation

Manual run on this branch via workflow_dispatch: https://github.com/useblacksmith/setup-docker-builder/actions/runs/24924523539

Setup step ("Setup Docker Builder under test"): 3s   (threshold 5s)
Post  step ("Post Setup Docker Builder under test"):  2s   (threshold 5s)
buildkitd readiness:          169ms in 1 poll(s)   (threshold 8000ms)

All thresholds passed.

A real production job (useblacksmith/web, large sticky disk with ~382 GB of pre-existing cache) was previously seeing ~6.8s of buildkitd readiness time. We can't shrink the buildkitd warm-up itself from this action, but we can now (a) get to the worker as soon as it's actually ready (no ~1s overshoot) and (b) see directly from CI logs how much of the readiness window was buildkitd vs polling.

Test plan

Unit tests pass (pnpm test)
Typecheck passes (pnpm typecheck)
dist/ rebuilt and committed
Manual workflow_dispatch run on this branch shows the new telemetry line, the parsed value reaches the validator, and the threshold check fires
Regression workflow runs green against this PR (will appear in checks below)

^{Codesmith can help with this PR, just tag @codesmith or enable autofix. Settings.}

Autofix CI and bot reviews (Staging)

^{Codesmith can help with this PR, just tag @codesmith or enable autofix. Settings.}

Autofix CI and bot reviews

Note

Medium Risk
Adds a new CI-gated performance metric by scraping job logs, which could introduce occasional flakiness if log output or GitHub log retrieval changes; otherwise changes are isolated to workflow/buildkitd readiness polling and diagnostics.

Overview
Extends the step-duration-regression.yml CI gate to also enforce a hard ceiling on buildkitd readiness time (default BUILDKITD_READY_MAX_MS=8000), by fetching the target job’s logs, parsing a buildkitd workers ready in <N>ms after <K> poll(s) telemetry line, and failing when the threshold is exceeded (skipping with a warning if the telemetry is missing).

Updates the action’s buildkitd worker readiness polling to use exponential backoff (100ms→1s cap), emits the new readiness telemetry, and tails the last 50 lines of /tmp/buildkitd.log when readiness is slow (>2s) to make regressions diagnosable from CI output.

^{Reviewed by Cursor Bugbot for commit 978b795. Bugbot is set up for automated code reviews on this repo. Configure here.}

The setup step's "wait for buildkitd workers" loop polled with a flat 1s backoff. In practice buildkitd's OCI worker comes up in well under a second on most runs, so we paid up to ~1s of polling discretization for nothing on every job. On slow runs, we had no signal at all about what buildkitd was doing during the wait - it all lived in /tmp/buildkitd.log on the runner. Changes: - Replace the fixed 1000ms backoff with exponential 100->200->400->800 ms, capped at 1000ms. Same 30s hard timeout. Steady-state startups observe the worker up to ~900ms sooner; cold/slow startups behave no worse than before. - Emit a "buildkitd workers ready in <N>ms after <K> poll(s)" log line so the readiness window is directly measurable from the action's own output instead of having to subtract two adjacent log timestamps by hand. - When readiness takes >2s, automatically tail the last 50 lines of /tmp/buildkitd.log so the slow path is self-explanatory in CI logs without spamming the fast path. Also extends the step-duration-regression workflow to parse the new telemetry line out of the job log and gate it on BUILDKITD_READY_MAX_MS (default 8000ms). This catches both: - a regression in the action's polling backoff (would push readiness ~1s+ higher than necessary), and - a regression in buildkitd warm-up itself (would blow past the 8s ceiling regardless of polling). The check is informational on older action versions that don't emit the telemetry line.

taha-au marked this pull request as draft April 25, 2026 06:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: tighten buildkitd readiness polling and gate it in regression workflow#100

perf: tighten buildkitd readiness polling and gate it in regression workflow#100
taha-au wants to merge 1 commit intomainfrom
perf/buildkitd-readiness-polling

taha-au commented Apr 25, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taha-au commented Apr 25, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Tighten the buildkitd readiness polling

2. Surface the slow path

3. Extend the step-duration regression workflow

Validation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

taha-au commented Apr 25, 2026 •

edited by cursor Bot

Loading