Skip to content

fix: data GC never fully removes expired data from disk#722

Merged
stack72 merged 4 commits intomainfrom
fix/data-gc-hard-delete-expired
Mar 17, 2026
Merged

fix: data GC never fully removes expired data from disk#722
stack72 merged 4 commits intomainfrom
fix/data-gc-hard-delete-expired

Conversation

@stack72
Copy link
Contributor

@stack72 stack72 commented Mar 16, 2026

Summary

Fixes #720

swamp data gc reported expired data but never reclaimed any disk space. The deleteExpiredData() method only called removeLatestMarker(), which removed the latest symlink but left all version directories and content files on disk.

This was a two-part problem:

  1. Phase 1 (expired data cleanup) only performed a soft-delete — no version directories were removed, zero bytes freed
  2. Phase 2 (collectGarbage()) always protects the latest version, and the single-version delete() path recreates the latest marker, effectively undoing Phase 1's soft-delete

User impact

Before this fix, running swamp data gc would report expired entries but disk usage would remain unchanged. Users had no way to reclaim space from expired data without manually deleting directories from the datastore.

After this fix, swamp data gc fully removes expired data directories, correctly reports versionsDeleted and bytesReclaimed, and actually frees disk space.

What changed

  • data_lifecycle_service.ts: Replaced removeLatestMarker() (soft-delete) with delete(type, modelId, dataName) (hard-delete) in deleteExpiredData(). Before deleting, stats each version's content file to report accurate bytes reclaimed. Updated LifecycleGCResult doc comments to reflect the new behavior.

  • data_lifecycle_service_test.ts: Added delete and getContentPath mocks to MockDataRepository. Added two new tests:

    • Verifies expired data triggers delete() (not removeLatestMarker()) and reports correct versionsDeleted
    • Verifies dry run does not call delete()

Why this is correct

The no-version delete(type, modelId, dataName) path in UnifiedDataRepository does a clean Deno.remove(dataNameDir, { recursive: true }) — it removes the entire data directory without recreating any markers. After Phase 1 hard-deletes expired entries, Phase 2's findAllForModel() naturally skips them since their directories no longer exist. No changes needed to collectGarbage() or the delete() implementation.

Test plan

  • deno check — type checking passes
  • deno lint — no lint errors
  • deno fmt — properly formatted
  • deno run test — all 2997 tests pass (including 2 new tests)
  • deno run compile — binary compiles successfully

🤖 Generated with Claude Code

…#720)

`deleteExpiredData()` only called `removeLatestMarker()` (soft-delete),
which removed the latest symlink but left all version directories on
disk. This meant `swamp data gc` reported expired data but never freed
any space.

Replace the soft-delete with a full `delete(type, modelId, dataName)`
call that recursively removes the entire data directory. Before deleting,
stat each version's content file to report accurate bytes reclaimed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@keeb
Copy link
Contributor

keeb commented Mar 16, 2026

Performance observation from real-world GC run

Tested this fix on a repo with ~4GB of accumulated data (game server metrics collected every 2 min via cron). The expired data phase works correctly — delete() with rm -rf is fast. However, the version GC phase in collectGarbage is very slow in practice.

Root cause

In collectGarbage, old versions are removed one at a time (line 1371–1387):

for (const version of versionsToRemove) {
  const stat = await Deno.stat(contentPath);   // 1 syscall per version
  await this.delete(type, modelId, data.name, version); // 1+ syscalls per version
  versionsRemoved++;
}

With terraria/minecraft metrics at version ~14,000+, this means ~14,000 individual stat + delete calls per data item. On NVMe the disk isn't the bottleneck — it's the per-call overhead from thousands of sequential async Deno syscalls.

Additional overhead

  • findAllGlobal() is called twice in deleteExpiredData — once in findExpiredData() and again at line 269 for the version GC pass
  • collectGarbage calls findAllForModel() internally even though the caller just loaded all data

Suggested fix direction

For the version GC path: when removing N consecutive old versions from a data-name directory, a single Deno.remove(versionDir, { recursive: true }) per batch (or even Deno.remove on each version dir directly rather than going through the full delete() abstraction) would be much faster. The byte count can still be accumulated with a single du-style walk before deletion.

For the double scan: the second findAllGlobal() call could reuse the results already loaded in findExpiredData(), or both phases could be merged into one pass.

In our case GC went from 3.6GB → ~1GB over ~5 minutes on NVMe — functional but noticeably slow given the hardware.

Eliminate double findAllGlobal() directory walk in deleteExpiredData() by
inlining expired-data detection and reusing the result for version GC.
Batch version removals in parallel (concurrency=20) with a single
latest-marker update per data name, reducing syscalls from ~6/version to
~2/version.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@stack72 stack72 marked this pull request as ready for review March 16, 2026 23:56
@keeb
Copy link
Contributor

keeb commented Mar 16, 2026

Follow-up: tested the perf commit (d27c322)

Recompiled and redeployed after perf: optimize data GC with single scan and parallel batch deletion. Results on the same repo:

  • First run (original fix commit): ~5-10 min to clean 3.6GB → 2.4GB — functional but slow
  • Second run (perf commit): cleaned 2.4GB → 12.5MB very quickly

The single-scan + parallel batch deletion (concurrency=20, ~2 syscalls/version vs ~6) addressed exactly the bottlenecks I described. No further issues observed.

github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

Previously, Promise.allSettled rejected results (e.g. PermissionDenied)
were silently ignored, making it appear GC succeeded when it partially
failed. Now logs errors for rejected batch items so failures are visible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

After parallel batch deletions, re-read listVersions() instead of
computing remaining from the pre-deletion snapshot. Prevents a race
where concurrent writes create new versions that get hidden by the
marker being set to a stale value.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The PR correctly fixes the data GC issue where expired data was soft-deleted but never actually removed from disk. The fix is well-implemented and properly tested.

No Blocking Issues

Code adheres to all project requirements:

  • TypeScript strict mode: No any types
  • Named exports throughout (no default exports)
  • License headers present
  • Unit tests added next to source files
  • DDD principles followed: DataLifecycleService is a proper domain service, repository abstraction maintained

Suggestions (non-blocking)

  1. Code duplication in deleteExpiredData(): The expired data detection logic is now inlined (duplicated from findExpiredData()). This is an acceptable trade-off for the performance benefit of avoiding a second findAllGlobal() call. Consider extracting to a shared helper in the future if maintenance becomes burdensome.

  2. Test coverage for bytes calculation: The deleteExpiredData - hard-deletes expired data via delete() test doesn't verify bytesReclaimed calculation. Since the mock's getContentPath returns a non-existent path, Deno.stat() fails silently and bytesReclaimed is 0. Consider adding a test that verifies bytes are correctly accumulated when content files exist (could be an integration test).

What was reviewed

  • src/domain/data/data_lifecycle_service.ts: Hard-delete via delete() instead of removeLatestMarker(), optimized single scan
  • src/infrastructure/persistence/unified_data_repository.ts: Parallel batch deletions in collectGarbage(), proper marker updates
  • src/domain/data/data_lifecycle_service_test.ts: New tests for hard-delete and dry-run behavior

The parallel batch execution in collectGarbage() with Promise.allSettled and the post-deletion version re-scan are good improvements for correctness and performance.

🤖 Generated with Claude Code

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

I've traced through every code path in this PR looking for logic errors, race conditions, error handling gaps, and security issues. The core fix is correct - replacing the soft-delete (removeLatestMarker) with hard-delete (delete) properly reclaims disk space.

Critical / High

None found. The implementation is sound for the stated purpose.

Medium

  1. Race condition in Phase 1 metrics (data_lifecycle_service.ts:264-288)

    Between listVersions() and the final delete() call, the code sequentially stats each version's content file. If another process creates or deletes a version during this window:

    • New version created: Gets deleted but not counted in versionsDeleted or bytesReclaimed
    • Version deleted by another process: Over-counted in metrics

    Breaking example: Concurrent swamp data gc invocations. Process A stats versions [1,2,3], Process B deletes version 3, Process A reports versionsDeleted: 3 but only removed 2.

    Impact: Metrics inaccuracy, not data loss. Acceptable for GC operations.

    Suggested fix: If accurate metrics are important, could use a single atomic stat-then-delete pattern, or accept that GC metrics are best-effort.

Low

  1. Test gap for bytesReclaimed (data_lifecycle_service_test.ts:457-487)

    The mock returns "/tmp/fake-content-path" which doesn't exist, so Deno.stat() always fails silently. The test asserts versionsDeleted: 3 but never verifies bytesReclaimed for the non-dry-run case.

    Impact: No code bug, but bytesReclaimed counting is untested. Consider adding an integration test that creates real files.

  2. Sequential stat calls (data_lifecycle_service.ts:272-285)

    Stats are called sequentially for each version. Could be parallelized with Promise.all for marginal performance improvement on data with many versions.

    Impact: Negligible - typical data has few versions.

  3. GC parallel batching counts already-deleted versions (unified_data_repository.ts:1381-1395)

    If Deno.remove gets NotFound (version already deleted by concurrent process), the code still returns bytes and counts as versionsRemoved. This is pre-existing behavior from the old sequential code, not introduced by this PR.

Verdict

PASS - The fix correctly solves the reported issue (#720). The soft-delete to hard-delete change is implemented correctly, Phase 2 GC correctly reuses the scan to avoid double traversal, and the new parallel batch processing in collectGarbage() is well-structured with proper error handling via Promise.allSettled. No blocking issues found.

@stack72 stack72 merged commit 2f69cc5 into main Mar 17, 2026
6 checks passed
@stack72 stack72 deleted the fix/data-gc-hard-delete-expired branch March 17, 2026 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data garbage collection never fully removes expired data from disk

2 participants