Skip to content

Runtime should explicitly handle orphan orchestrator queue messages #4

@affandar

Description

@affandar

Problem

When queue messages (e.g., QueueMessage from enqueue_event) arrive in the orchestrator queue before StartOrchestration for a new instance, the runtime's orchestration dispatcher encounters a batch with no instance, no history, and no StartOrchestration/ContinueAsNew message. The current behavior:

  1. fetch_orchestration_item returns the batch with orchestration_name="Unknown"
  2. The runtime logs "completion messages for unstarted instance" and "empty effective batch"
  3. The runtime acks the batch, which permanently deletes the queue rows
  4. The events are lost forever

This was discovered via the sample_config_hot_reload_persistent_events_fs e2e test, which enqueues events before starting an orchestration.

Current Provider-Side Workaround

Both duroxide-pg and duroxide-pg-opt have implemented a provider-side fix in their fetch_orchestration_item stored procedure:

  1. Scan ALL messages for StartOrchestration/ContinueAsNew (not just messages[0]), matching the SQLite provider's work_items.iter().find() behavior
  2. If no start item found: release locks and return nothing, leaving messages in the queue until StartOrchestration arrives

This works but pushes responsibility to the provider, which:

  • Is fragile (providers must each implement this correctly)
  • Cannot add a visible_at delay to prevent tight re-fetching (any delay risks events being lost if the orchestration completes before the delay expires)
  • Relies on LISTEN/NOTIFY for backpressure to prevent tight-looping

Proposed Runtime-Level Fix

The runtime's orchestration dispatcher should handle this case explicitly:

  1. When fetch_orchestration_item returns a batch with no instance and no StartOrchestration/ContinueAsNew in the messages, the runtime should abandon the batch (not ack it)
  2. The abandon should use a reasonable delay (e.g., 500ms) so items become available again later
  3. This keeps the contract simple: providers return whatever is in the queue, and the runtime decides what to do

This would also allow removing the provider-side workarounds.

Affected Code

  • Runtime: dispatchers/orchestration.rs - the "completion messages for unstarted instance" code path
  • Provider trait: abandon_orchestration_item is already available for this purpose

References

  • duroxide-pg-opt migration 0006_fix_orphan_queue_messages.sql
  • duroxide-pg migration 0016_fix_orphan_queue_messages.sql
  • Test: sample_config_hot_reload_persistent_events_fs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions