Stale activity cleanup: TTL or background sweep for undeliverable worker queue items

## Problem

Activities scheduled with a tag that no worker is configured to handle (or activities whose target worker goes offline permanently) will sit in the worker queue **indefinitely**. The orchestration that scheduled them hangs forever unless the user manually implements a `select2(activity, timer)` starvation guard.

This is especially relevant now that activity tagging is implemented -- it is easy to schedule a `.with_tag("gpu")` activity in an environment where no GPU worker is running.

## Desired Behavior

Undeliverable or stale activities should **not** block orchestrations forever. The runtime should detect activities that exceed a configurable time limit and fail them back to the orchestrator with a clear error.

## Proposed Approaches

### Option A: Activity TTL (per-item expiry)
- Add an optional `expires_at` timestamp to worker queue items (set at enqueue time based on a configurable TTL)
- Provider `fetch_work_item()` skips expired items
- A periodic sweep (or check at fetch time) marks expired items as failed
- The orchestration receives an `ActivityExpired` error it can match on

### Option B: Background cleanup process
- A runtime background task periodically scans for worker queue items older than a configurable threshold
- Stale items are failed back to the orchestrator with a timeout error
- Simpler to implement but less granular (global threshold vs per-activity)

### Option C: Hybrid
- Default global TTL from `RuntimeOptions` (e.g., 1 hour)
- Per-activity override via `.with_ttl(Duration)` on the activity builder
- Background sweep handles the cleanup

## Design Considerations

- **Provider trait changes**: Need `expires_at` field or equivalent on worker queue items
- **Event model**: New `ActivityExpired` or reuse existing error infrastructure
- **Backward compatibility**: TTL should be optional, default to no expiry (current behavior) for existing users
- **CosmosDB / Postgres providers**: Both need the expiry field; CosmosDB has native TTL support that could be leveraged
- **Interaction with retries**: Should TTL apply per-attempt or total? Probably total elapsed since first enqueue

## Related

- Activity tagging feature (`.with_tag()` / `TagFilter`)
- `select2(activity, timer)` starvation-safe pattern (current workaround)
- TODO.md entry added for tracking


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stale activity cleanup: TTL or background sweep for undeliverable worker queue items #3

Problem

Desired Behavior

Proposed Approaches

Option A: Activity TTL (per-item expiry)

Option B: Background cleanup process

Option C: Hybrid

Design Considerations

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stale activity cleanup: TTL or background sweep for undeliverable worker queue items #3

Description

Problem

Desired Behavior

Proposed Approaches

Option A: Activity TTL (per-item expiry)

Option B: Background cleanup process

Option C: Hybrid

Design Considerations

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions