Skip to content

feat: per-model datastore locks for concurrent model operations#713

Merged
stack72 merged 6 commits intomainfrom
fix/per-model-datastore-locks-706
Mar 16, 2026
Merged

feat: per-model datastore locks for concurrent model operations#713
stack72 merged 6 commits intomainfrom
fix/per-model-datastore-locks-706

Conversation

@stack72
Copy link
Copy Markdown
Contributor

@stack72 stack72 commented Mar 15, 2026

Summary

Replaces the single global datastore lock with per-model locks so that operations on different models can run concurrently. This fixes the core issue where running model method run on one model blocked all other model operations, even on completely unrelated models.

  • Introduces per-model lock files at data/{modelType}/{modelId}/.lock — operations on different models no longer block each other
  • Refactors the sync coordinator from module-level singletons to a Map<string, SyncEntry> supporting multiple concurrent locks
  • Adds model-scoped S3 sync (pullChangedForModel) so S3 datastores only pull/push the relevant model's files
  • Extracts model references from workflows to acquire only the locks needed, with automatic fallback to global lock for dynamic CEL references
  • Updates breakglass commands (lock status, lock release --model) to show and manage per-model locks

Closes #706

Why per-model locks?

The global lock was designed for simplicity, but it creates an O(n) bottleneck for fleet operations. When managing 100+ VMs where each method call takes 30-90 seconds, a single global lock means:

  • Before: Running healthCheck on prod-vm-1 blocks a quick checkService on dev-vm-2 — the second command waits or times out
  • After: Both run concurrently because they acquire separate locks (data/command/shell/prod-vm-1/.lock and data/command/shell/dev-vm-2/.lock)

Structural commands (data gc, repo init, model create/delete) still use the global lock, and per-model lock acquisition checks for a held global lock first — so data integrity is preserved.

Impact on users

  • model method run on different models runs concurrently (the primary win)
  • workflow run acquires only the locks for models referenced in the workflow
  • model evaluate and workflow evaluate use per-model locks for single-model/workflow evaluation
  • S3 datastores benefit equally — model-scoped pull on lock acquisition, brief global lock only during push (seconds, not minutes)
  • No breaking changes — the lock file format and CLI interface are unchanged; existing repos work without migration
  • Deadlock prevention — locks are always acquired in sorted order when multiple models are involved

What changed

File Change
datastore_sync_coordinator.ts Refactored to Map<string, SyncEntry> with named lock support
repo_context.ts Added requireInitializedRepoUnlocked(), createModelLock(), acquireModelLocks()
s3_cache_sync.ts Added pullChangedForModel() for model-scoped S3 sync
model_method_run.ts Uses per-model lock instead of global lock
workflow_run.ts Extracts model refs, acquires per-model locks (falls back to global for dynamic refs)
model_evaluate.ts Single-model evaluate uses per-model lock; --all keeps global
workflow_evaluate.ts Single-workflow evaluate uses per-model lock; --all keeps global
datastore_lock.ts lock status scans per-model locks; lock release --model added
datastore_output.ts Shows lock scope (global vs per-model) in status output
model_reference_extractor.ts New: extracts model references from workflows for lock targeting

Test plan

  • All 2968 existing tests pass
  • deno check, deno lint, deno fmt clean
  • New unit tests for coordinator, repo context helpers, and model reference extractor
  • Manual verification: two concurrent model method run on different models both proceed without blocking, each with its own .lock file
  • Binary compiles successfully

🤖 Generated with Claude Code

@stack72
Copy link
Copy Markdown
Contributor Author

stack72 commented Mar 15, 2026

Manual testing of compiled binary

Tested the compiled binary against a fresh filesystem datastore to verify per-model locks work end-to-end.

Setup

mkdir /tmp/swamp-lock-test
cd /tmp/swamp-lock-test
swamp repo init

Created two command/shell models with a 5-second sleep:

# models/command/shell/model-a.yaml
type: "command/shell"
typeVersion: 1
id: "a0000000-0000-4000-a000-000000000001"
name: model-a
version: 1
tags: {}
globalArguments: {}
methods:
  execute:
    arguments:
      run: "sleep 5"
# models/command/shell/model-b.yaml
type: "command/shell"
typeVersion: 1
id: "b0000000-0000-4000-b000-000000000002"
name: model-b
version: 1
tags: {}
globalArguments: {}
methods:
  execute:
    arguments:
      run: "sleep 5"

Concurrent execution test

Launched both model method runs in the background simultaneously:

swamp model method run model-a execute &
swamp model method run model-b execute &

After 2 seconds (while both were still running), checked for lock files:

$ find .swamp -name ".lock"
.swamp/data/command/shell/a0000000-0000-4000-a000-000000000001/.lock
.swamp/data/command/shell/b0000000-0000-4000-b000-000000000002/.lock

Two separate per-model lock files — one per model, no global lock.

Results

Both processes started at the same time and completed at the same time — no blocking:

# model-a
00:16:47.540 INF model·method·run·model-a·execute Found model "model-a" ("command/shell")
00:16:47.545 INF model·method·run·model-a·execute Executing method "execute"
00:16:52.729 INF model·method·run·model-a·execute Method executed
00:16:52.733 INF model·method·run·model-a·execute Method "execute" completed on "model-a"

# model-b
00:16:47.540 INF model·method·run·model-b·execute Found model "model-b" ("command/shell")
00:16:47.545 INF model·method·run·model-b·execute Executing method "execute"
00:16:52.723 INF model·method·run·model-b·execute Method executed
00:16:52.727 INF model·method·run·model-b·execute Method "execute" completed on "model-b"

Both started at 00:16:47.540 and finished at ~00:16:52.7 — running fully in parallel. With the old global lock, model-b would have waited ~5 seconds for model-a to release the lock before starting.

@stack72 stack72 marked this pull request as ready for review March 16, 2026 15:46
github-actions[bot]

This comment was marked as outdated.

stack72 and others added 2 commits March 16, 2026 16:19
Replace the single global datastore lock with per-model locks so that
operations on different models can run concurrently. Previously, running
`model method run` on one model blocked all other model operations, even
on completely unrelated models. With per-model locks, only operations on
the same model wait for each other.

Closes #706

Co-authored-by: Blake Irvin <blakeirvin@me.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap lock-acquiring code paths in try/finally blocks so per-model locks
are always released even when method execution, evaluation, or workflow
evaluation throws. Also add staleness detection to the global lock
polling loop so per-model commands don't hang indefinitely when a
structural command crashes without releasing its lock.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@stack72 stack72 force-pushed the fix/per-model-datastore-locks-706 branch from 1e75c71 to c8ebcc7 Compare March 16, 2026 16:21
github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

1. Global lock now waits for held per-model locks before acquiring,
   preventing races between structural commands and in-progress
   per-model operations.

2. S3 sync failures (pull and push) now throw instead of silently
   continuing with stale/lost data.

3. workflow_run.ts: flushModelLocks() in catch block wrapped in
   try/catch so original error is preserved if lock release fails.

4. model_method_run.ts: flushModelLocks() in finally block wrapped
   in try/catch so original error is preserved if lock release fails.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

1. Move per-model lock release AFTER S3 push completes. Previously
   locks were released before push, so if push failed another process
   could acquire the lock, pull stale S3 data, and overwrite local
   changes.

2. Re-check global lock after acquiring each per-model lock. If a
   structural command acquired the global lock between the initial
   check and per-model acquisition, release all per-model locks,
   wait for the global lock, and retry from scratch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

1. pullChanged() and pullChangedForModel() now track failed downloads
   and throw with the list of failed files instead of returning a
   partial count. Prevents operating on incomplete/stale data.

2. pushChanged() now tracks failed uploads and throws instead of
   silently dropping files. Prevents data loss when S3 uploads fail.

3. Global lock path in registerDatastoreSyncNamed() now re-throws
   S3 pull errors instead of logging a warning and continuing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

Wrap the flush function body in try/finally so per-model locks are
released regardless of whether the S3 push succeeds. Previously, a
push failure would throw before reaching lock release, leaving locks
held until TTL expiry and blocking other operations on those models.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

I've traced through all code paths, edge cases, and failure scenarios in this per-model locking PR. The implementation is well-designed with appropriate safeguards.

Medium

  1. src/cli/repo_context.ts:468-474 — waitForPerModelLocks unbounded wait

    • The while(true) loop waits indefinitely for per-model locks to be released before acquiring the global lock.
    • Breaking scenario: If another process holds a per-model lock with an extremely large custom TTL (e.g., 24 hours configured via LockOptions), or if the staleness check has a clock skew issue (system clock jumps backward), this function blocks indefinitely.
    • Mitigation: The staleness check at line 448 (Date.now() - acquiredAt <= info.ttlMs) means stale locks are ignored. Default TTL is 30s. This is acceptable given typical usage.
    • Suggested improvement (optional): Add a configurable maximum wait time with fallback to warning + proceed.
  2. src/cli/repo_context.ts:592 — Recursive retry without depth limit

    • If acquireModelLocks detects a global lock during per-model acquisition, it releases everything and recursively calls itself.
    • Breaking scenario: Pathological case where a structural command repeatedly acquires/releases the global lock at exactly the wrong timing, causing unbounded recursion.
    • Mitigation: The wait loop at lines 577-589 blocks until global lock is released or stale, making the pathological case extremely unlikely in practice.
    • Suggested improvement (optional): Add max retry count (e.g., 10) before throwing.
  3. src/cli/repo_context.ts:608-617 — Partial lock acquisition on S3 pull failure

    • If S3 pullChangedForModel fails mid-way through the lock acquisition loop, some per-model locks are held but the returned flush function is never constructed.
    • Breaking scenario: Process A acquires lock for model/X, S3 pull for model/Y fails, error propagates up. Locks for model/X remain in the global entries map until CLI exit or SIGINT.
    • Why it's OK: The locks ARE registered in the coordinator's entries map (line 558), so flushDatastoreSync at CLI exit releases them. SIGINT handler also covers this. TTL expiration handles truly orphaned locks. This is acceptable latency, not a leak.
  4. src/cli/commands/datastore_lock.ts:128 — Per-model lock status only works for filesystem

    • scanModelLocks only scans filesystem datastores. S3 per-model locks are not shown in lock status.
    • Impact: Users with S3 datastores cannot inspect per-model lock state via the breakglass command.
    • Why acceptable: S3 locks can be inspected directly via AWS console/CLI. This is a feature gap, not a bug.
  5. src/cli/commands/workflow_run.ts:278-283 — No locks when model references don't resolve

    • If modelRefs is non-empty but findDefinitionByIdOrName returns null for all (models don't exist), resolvedModels is empty and no locks are acquired.
    • Why acceptable: The workflow will fail at runtime when the model_method task executes, surfacing a clear error. Acquiring no locks is the correct behavior for this error case.

Low

  1. src/domain/workflows/model_reference_extractor.ts:83-93 — Missing nested workflow silently skipped

    • If a workflow references a nested workflow that doesn't exist, nestedWorkflow is null and its model references aren't extracted.
    • Impact: Potential under-locking if the nested workflow is created after extraction but before execution.
    • Why acceptable: The workflow will fail at runtime anyway since the nested workflow doesn't exist. This is a deferred error, not a correctness issue.
  2. Potential race between S3 index mutation and persistence

    • pullChangedForModel mutates this.index.entries[rel].localMtime in memory (line 271) but doesn't persist immediately.
    • Why acceptable: The index is best-effort optimization. Worst case is redundant file transfers, not data corruption.

Verified Correctness

  • Deadlock prevention: Lock ordering via sorted acquisition (line 526-529) is correct.
  • TOCTOU mitigation: Re-checking global lock after each per-model acquisition (lines 564-593) minimizes but doesn't eliminate the race. The remaining window is microseconds. Acceptable.
  • Path traversal protection: assertSafePath validates all paths derived from S3 index entries.
  • SIGINT cleanup: Handler at lines 66-76 releases all held locks with 5s timeout before force exit.
  • Lock heartbeat management: Recursive retry properly releases locks via flushDatastoreSyncNamed, which stops heartbeats.
  • S3 push atomicity: Global lock held during push (line 628-649) prevents concurrent pushers from clobbering each other.

Verdict

PASS — The per-model locking implementation is solid. The medium-severity items are edge cases with appropriate mitigations (TTL expiration, CLI-level cleanup, SIGINT handler). No blocking issues found.

@stack72 stack72 dismissed github-actions[bot]’s stale review March 16, 2026 17:37

it's been passed but not updated

@stack72 stack72 enabled auto-merge (squash) March 16, 2026 17:37
@stack72 stack72 disabled auto-merge March 16, 2026 17:37
@stack72 stack72 merged commit 1abd500 into main Mar 16, 2026
5 of 6 checks passed
@stack72 stack72 deleted the fix/per-model-datastore-locks-706 branch March 16, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Datastore lock: allow concurrent reads across different models

1 participant