-
Notifications
You must be signed in to change notification settings - Fork 47
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
Activities scheduled with a tag that no worker is configured to handle (or activities whose target worker goes offline permanently) will sit in the worker queue indefinitely. The orchestration that scheduled them hangs forever unless the user manually implements a select2(activity, timer) starvation guard.
This is especially relevant now that activity tagging is implemented -- it is easy to schedule a .with_tag("gpu") activity in an environment where no GPU worker is running.
Desired Behavior
Undeliverable or stale activities should not block orchestrations forever. The runtime should detect activities that exceed a configurable time limit and fail them back to the orchestrator with a clear error.
Proposed Approaches
Option A: Activity TTL (per-item expiry)
- Add an optional
expires_attimestamp to worker queue items (set at enqueue time based on a configurable TTL) - Provider
fetch_work_item()skips expired items - A periodic sweep (or check at fetch time) marks expired items as failed
- The orchestration receives an
ActivityExpirederror it can match on
Option B: Background cleanup process
- A runtime background task periodically scans for worker queue items older than a configurable threshold
- Stale items are failed back to the orchestrator with a timeout error
- Simpler to implement but less granular (global threshold vs per-activity)
Option C: Hybrid
- Default global TTL from
RuntimeOptions(e.g., 1 hour) - Per-activity override via
.with_ttl(Duration)on the activity builder - Background sweep handles the cleanup
Design Considerations
- Provider trait changes: Need
expires_atfield or equivalent on worker queue items - Event model: New
ActivityExpiredor reuse existing error infrastructure - Backward compatibility: TTL should be optional, default to no expiry (current behavior) for existing users
- CosmosDB / Postgres providers: Both need the expiry field; CosmosDB has native TTL support that could be leveraged
- Interaction with retries: Should TTL apply per-attempt or total? Probably total elapsed since first enqueue
Related
- Activity tagging feature (
.with_tag()/TagFilter) select2(activity, timer)starvation-safe pattern (current workaround)- TODO.md entry added for tracking
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request