fix: data GC never fully removes expired data from disk#722
Conversation
…#720) `deleteExpiredData()` only called `removeLatestMarker()` (soft-delete), which removed the latest symlink but left all version directories on disk. This meant `swamp data gc` reported expired data but never freed any space. Replace the soft-delete with a full `delete(type, modelId, dataName)` call that recursively removes the entire data directory. Before deleting, stat each version's content file to report accurate bytes reclaimed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Performance observation from real-world GC runTested this fix on a repo with ~4GB of accumulated data (game server metrics collected every 2 min via cron). The expired data phase works correctly — Root causeIn for (const version of versionsToRemove) {
const stat = await Deno.stat(contentPath); // 1 syscall per version
await this.delete(type, modelId, data.name, version); // 1+ syscalls per version
versionsRemoved++;
}With terraria/minecraft metrics at version ~14,000+, this means ~14,000 individual stat + delete calls per data item. On NVMe the disk isn't the bottleneck — it's the per-call overhead from thousands of sequential async Deno syscalls. Additional overhead
Suggested fix directionFor the version GC path: when removing N consecutive old versions from a data-name directory, a single For the double scan: the second In our case GC went from 3.6GB → ~1GB over ~5 minutes on NVMe — functional but noticeably slow given the hardware. |
Eliminate double findAllGlobal() directory walk in deleteExpiredData() by inlining expired-data detection and reusing the result for version GC. Batch version removals in parallel (concurrency=20) with a single latest-marker update per data name, reducing syscalls from ~6/version to ~2/version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Follow-up: tested the perf commit (d27c322)Recompiled and redeployed after
The single-scan + parallel batch deletion (concurrency=20, ~2 syscalls/version vs ~6) addressed exactly the bottlenecks I described. No further issues observed. |
Previously, Promise.allSettled rejected results (e.g. PermissionDenied) were silently ignored, making it appear GC succeeded when it partially failed. Now logs errors for rejected batch items so failures are visible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After parallel batch deletions, re-read listVersions() instead of computing remaining from the pre-deletion snapshot. Prevents a race where concurrent writes create new versions that get hidden by the marker being set to a stale value. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Review Summary
The PR correctly fixes the data GC issue where expired data was soft-deleted but never actually removed from disk. The fix is well-implemented and properly tested.
No Blocking Issues
Code adheres to all project requirements:
- TypeScript strict mode: No
anytypes - Named exports throughout (no default exports)
- License headers present
- Unit tests added next to source files
- DDD principles followed:
DataLifecycleServiceis a proper domain service, repository abstraction maintained
Suggestions (non-blocking)
-
Code duplication in
deleteExpiredData(): The expired data detection logic is now inlined (duplicated fromfindExpiredData()). This is an acceptable trade-off for the performance benefit of avoiding a secondfindAllGlobal()call. Consider extracting to a shared helper in the future if maintenance becomes burdensome. -
Test coverage for bytes calculation: The
deleteExpiredData - hard-deletes expired data via delete()test doesn't verifybytesReclaimedcalculation. Since the mock'sgetContentPathreturns a non-existent path,Deno.stat()fails silently andbytesReclaimedis 0. Consider adding a test that verifies bytes are correctly accumulated when content files exist (could be an integration test).
What was reviewed
src/domain/data/data_lifecycle_service.ts: Hard-delete viadelete()instead ofremoveLatestMarker(), optimized single scansrc/infrastructure/persistence/unified_data_repository.ts: Parallel batch deletions incollectGarbage(), proper marker updatessrc/domain/data/data_lifecycle_service_test.ts: New tests for hard-delete and dry-run behavior
The parallel batch execution in collectGarbage() with Promise.allSettled and the post-deletion version re-scan are good improvements for correctness and performance.
🤖 Generated with Claude Code
There was a problem hiding this comment.
Adversarial Review
I've traced through every code path in this PR looking for logic errors, race conditions, error handling gaps, and security issues. The core fix is correct - replacing the soft-delete (removeLatestMarker) with hard-delete (delete) properly reclaims disk space.
Critical / High
None found. The implementation is sound for the stated purpose.
Medium
-
Race condition in Phase 1 metrics (
data_lifecycle_service.ts:264-288)Between
listVersions()and the finaldelete()call, the code sequentially stats each version's content file. If another process creates or deletes a version during this window:- New version created: Gets deleted but not counted in
versionsDeletedorbytesReclaimed - Version deleted by another process: Over-counted in metrics
Breaking example: Concurrent
swamp data gcinvocations. Process A stats versions [1,2,3], Process B deletes version 3, Process A reportsversionsDeleted: 3but only removed 2.Impact: Metrics inaccuracy, not data loss. Acceptable for GC operations.
Suggested fix: If accurate metrics are important, could use a single atomic stat-then-delete pattern, or accept that GC metrics are best-effort.
- New version created: Gets deleted but not counted in
Low
-
Test gap for bytesReclaimed (
data_lifecycle_service_test.ts:457-487)The mock returns
"/tmp/fake-content-path"which doesn't exist, soDeno.stat()always fails silently. The test assertsversionsDeleted: 3but never verifiesbytesReclaimedfor the non-dry-run case.Impact: No code bug, but bytesReclaimed counting is untested. Consider adding an integration test that creates real files.
-
Sequential stat calls (
data_lifecycle_service.ts:272-285)Stats are called sequentially for each version. Could be parallelized with
Promise.allfor marginal performance improvement on data with many versions.Impact: Negligible - typical data has few versions.
-
GC parallel batching counts already-deleted versions (
unified_data_repository.ts:1381-1395)If
Deno.removegets NotFound (version already deleted by concurrent process), the code still returnsbytesand counts asversionsRemoved. This is pre-existing behavior from the old sequential code, not introduced by this PR.
Verdict
PASS - The fix correctly solves the reported issue (#720). The soft-delete to hard-delete change is implemented correctly, Phase 2 GC correctly reuses the scan to avoid double traversal, and the new parallel batch processing in collectGarbage() is well-structured with proper error handling via Promise.allSettled. No blocking issues found.
Summary
Fixes #720
swamp data gcreported expired data but never reclaimed any disk space. ThedeleteExpiredData()method only calledremoveLatestMarker(), which removed thelatestsymlink but left all version directories and content files on disk.This was a two-part problem:
collectGarbage()) always protects the latest version, and the single-versiondelete()path recreates thelatestmarker, effectively undoing Phase 1's soft-deleteUser impact
Before this fix, running
swamp data gcwould report expired entries but disk usage would remain unchanged. Users had no way to reclaim space from expired data without manually deleting directories from the datastore.After this fix,
swamp data gcfully removes expired data directories, correctly reportsversionsDeletedandbytesReclaimed, and actually frees disk space.What changed
data_lifecycle_service.ts: ReplacedremoveLatestMarker()(soft-delete) withdelete(type, modelId, dataName)(hard-delete) indeleteExpiredData(). Before deleting, stats each version's content file to report accurate bytes reclaimed. UpdatedLifecycleGCResultdoc comments to reflect the new behavior.data_lifecycle_service_test.ts: AddeddeleteandgetContentPathmocks toMockDataRepository. Added two new tests:delete()(notremoveLatestMarker()) and reports correctversionsDeleteddelete()Why this is correct
The no-version
delete(type, modelId, dataName)path inUnifiedDataRepositorydoes a cleanDeno.remove(dataNameDir, { recursive: true })— it removes the entire data directory without recreating any markers. After Phase 1 hard-deletes expired entries, Phase 2'sfindAllForModel()naturally skips them since their directories no longer exist. No changes needed tocollectGarbage()or thedelete()implementation.Test plan
deno check— type checking passesdeno lint— no lint errorsdeno fmt— properly formatteddeno run test— all 2997 tests pass (including 2 new tests)deno run compile— binary compiles successfully🤖 Generated with Claude Code