Skip to content

Stale activity cleanup: TTL or background sweep for undeliverable worker queue items #3

@affandar

Description

@affandar

Problem

Activities scheduled with a tag that no worker is configured to handle (or activities whose target worker goes offline permanently) will sit in the worker queue indefinitely. The orchestration that scheduled them hangs forever unless the user manually implements a select2(activity, timer) starvation guard.

This is especially relevant now that activity tagging is implemented -- it is easy to schedule a .with_tag("gpu") activity in an environment where no GPU worker is running.

Desired Behavior

Undeliverable or stale activities should not block orchestrations forever. The runtime should detect activities that exceed a configurable time limit and fail them back to the orchestrator with a clear error.

Proposed Approaches

Option A: Activity TTL (per-item expiry)

  • Add an optional expires_at timestamp to worker queue items (set at enqueue time based on a configurable TTL)
  • Provider fetch_work_item() skips expired items
  • A periodic sweep (or check at fetch time) marks expired items as failed
  • The orchestration receives an ActivityExpired error it can match on

Option B: Background cleanup process

  • A runtime background task periodically scans for worker queue items older than a configurable threshold
  • Stale items are failed back to the orchestrator with a timeout error
  • Simpler to implement but less granular (global threshold vs per-activity)

Option C: Hybrid

  • Default global TTL from RuntimeOptions (e.g., 1 hour)
  • Per-activity override via .with_ttl(Duration) on the activity builder
  • Background sweep handles the cleanup

Design Considerations

  • Provider trait changes: Need expires_at field or equivalent on worker queue items
  • Event model: New ActivityExpired or reuse existing error infrastructure
  • Backward compatibility: TTL should be optional, default to no expiry (current behavior) for existing users
  • CosmosDB / Postgres providers: Both need the expiry field; CosmosDB has native TTL support that could be leveraged
  • Interaction with retries: Should TTL apply per-attempt or total? Probably total elapsed since first enqueue

Related

  • Activity tagging feature (.with_tag() / TagFilter)
  • select2(activity, timer) starvation-safe pattern (current workaround)
  • TODO.md entry added for tracking

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions