Skip to content

fix: data races, resource leaks, and validation gaps#125

Merged
alexec merged 2 commits into
mainfrom
fix/concurrency-and-leaks
Jun 26, 2026
Merged

fix: data races, resource leaks, and validation gaps#125
alexec merged 2 commits into
mainfrom
fix/concurrency-and-leaks

Conversation

@alexec

@alexec alexec commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes a batch of correctness bugs found in a review of the task runner, grouped into three areas. All findings verified; tests pass with go test -race ./....

Concurrency (data races)

  • TaskNode Phase/Message/Metrics/cancel were written by task/timer/metrics goroutines while read by the main loop and the HTTP server. Added a status sync.RWMutex with accessors and a locking MarshalJSON; routed every access through them. (-race was failing before this.)
  • logWriter.buffer mutated without a lock — the same writer is passed as both stdout and stderr, which os/exec copies on separate goroutines. Added a mutex.
  • k8s.pods appended in the informer goroutine while ranged in the metrics goroutine. Guarded with a mutex; GetMetrics snapshots under lock.

Resource leaks

  • Probe HTTP response body never closed on the success path.
  • SSE /events and /logs handlers never returned on client disconnect (goroutine + memory leak); now select on r.Context().Done(). Broadcast made non-blocking so one slow client can't stall others.
  • Docker client closed via defer when Run returned, but the metrics goroutine kept using it → use-after-close. Now closed on ctx-done.
  • Host process leaked if Getpgid failed after Start; now killed.
  • Envfiles closed per-iteration instead of via deferred close accumulating across the loop.

Correctness

  • Env precedence inverted: spec env now overrides the inherited environment (was the other way around).
  • EnvVar.Unstring used Split(s,"="), rejecting values containing = (URLs, base64). Now SplitN.
  • kit -v could nil-panic when build info is absent; check the ok bool.
  • Dependencies on unknown tasks and dependency cycles were not detected (cycle → silent deadlock). Added validation + DFS cycle detection at load time, with a test.
  • Probe.URL() nil-derefed when neither TCPSocket nor HTTPGet was set.
  • Shutdown could deadlock when goroutines blocked on a full event channel; the main loop now drains events while waiting.

Test plan

  • go build ./...
  • go vet ./...
  • go test -race ./... (passes; added TestDAG_findCycle)

🤖 Generated with Claude Code

Concurrency:
- guard TaskNode Phase/Message/Metrics/cancel with an RWMutex and route all
  access through accessors; add locking MarshalJSON (fixes -race failures)
- lock logWriter.buffer (shared stdout/stderr writer)
- guard k8s.pods with a mutex

Leaks:
- close probe HTTP response body
- return SSE/log handlers on client disconnect; non-blocking broadcast
- close docker client on ctx-done, not when Run returns (was use-after-close)
- kill process if Getpgid fails after start
- close envfiles per-iteration

Correctness:
- env precedence: spec env now overrides inherited environment
- EnvVar.Unstring uses SplitN so values may contain '='
- check ReadBuildInfo ok bool (no nil panic on -v)
- validate dependencies exist + detect dependency cycles at load time
- guard nil TCPSocket/HTTPGet in Probe.URL
- drain event channels during shutdown to avoid deadlock

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the task runner by addressing concurrency races, resource leaks, and validation/correctness gaps that could cause hangs, panics, or inconsistent UI/task state reporting.

Changes:

  • Adds synchronization around mutable task status/logging and improves shutdown behavior to avoid deadlocks.
  • Fixes multiple resource-lifecycle issues (HTTP bodies, SSE/log streaming disconnects, docker client lifetime, envfile handling).
  • Adds workflow dependency validation (unknown deps + cycle detection) and tightens parsing/edge-case handling.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
main.go Avoids nil panic when build info is unavailable (kit -v prints unknown).
internal/types/probe.go Makes Probe.URL() safe when no action is configured.
internal/types/envfile.go Prevents deferred-close accumulation by reading envfiles per-iteration via helper.
internal/types/env_var.go Allows EnvVar values containing = by using SplitN.
internal/terminal.go Routes task phase reads through synchronized accessor.
internal/task_node.go Introduces status locking + accessors and a locking MarshalJSON to eliminate task status races.
internal/server.go Makes SSE broadcast non-blocking and exits /events + /logs on client disconnect.
internal/run.go Validates unknown deps + cycles; routes task status/cancel/metrics through accessors; drains event channels during shutdown wait.
internal/proc/kubernetes.go Guards concurrent access to the pod list and snapshots for metrics.
internal/proc/host.go Fixes env precedence and kills started process when pgid capture fails.
internal/proc/container.go Defers docker client close until ctx cancellation to avoid use-after-close by metrics goroutine.
internal/probe.go Closes probe HTTP response bodies on the success path.
internal/log_writer.go Adds a mutex to avoid data races when the same writer is used for stdout+stderr.
internal/dag.go Adds DFS-based cycle detection helper.
internal/dag_test.go Adds a test covering cycle detection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/proc/host.go
Comment thread internal/server.go
Resolved main.go conflict by taking the upstream rewrite (#124), which
already includes the debug.ReadBuildInfo ok-check fix.
@alexec alexec marked this pull request as ready for review June 26, 2026 00:39
@alexec alexec merged commit 87d80c2 into main Jun 26, 2026
2 checks passed
@alexec alexec deleted the fix/concurrency-and-leaks branch June 26, 2026 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants