-
Notifications
You must be signed in to change notification settings - Fork 47
Description
Problem
When queue messages (e.g., QueueMessage from enqueue_event) arrive in the orchestrator queue before StartOrchestration for a new instance, the runtime's orchestration dispatcher encounters a batch with no instance, no history, and no StartOrchestration/ContinueAsNew message. The current behavior:
fetch_orchestration_itemreturns the batch withorchestration_name="Unknown"- The runtime logs
"completion messages for unstarted instance"and"empty effective batch" - The runtime acks the batch, which permanently deletes the queue rows
- The events are lost forever
This was discovered via the sample_config_hot_reload_persistent_events_fs e2e test, which enqueues events before starting an orchestration.
Current Provider-Side Workaround
Both duroxide-pg and duroxide-pg-opt have implemented a provider-side fix in their fetch_orchestration_item stored procedure:
- Scan ALL messages for
StartOrchestration/ContinueAsNew(not justmessages[0]), matching the SQLite provider'swork_items.iter().find()behavior - If no start item found: release locks and return nothing, leaving messages in the queue until
StartOrchestrationarrives
This works but pushes responsibility to the provider, which:
- Is fragile (providers must each implement this correctly)
- Cannot add a
visible_atdelay to prevent tight re-fetching (any delay risks events being lost if the orchestration completes before the delay expires) - Relies on
LISTEN/NOTIFYfor backpressure to prevent tight-looping
Proposed Runtime-Level Fix
The runtime's orchestration dispatcher should handle this case explicitly:
- When
fetch_orchestration_itemreturns a batch with no instance and noStartOrchestration/ContinueAsNewin the messages, the runtime should abandon the batch (not ack it) - The abandon should use a reasonable delay (e.g., 500ms) so items become available again later
- This keeps the contract simple: providers return whatever is in the queue, and the runtime decides what to do
This would also allow removing the provider-side workarounds.
Affected Code
- Runtime:
dispatchers/orchestration.rs- the"completion messages for unstarted instance"code path - Provider trait:
abandon_orchestration_itemis already available for this purpose
References
duroxide-pg-optmigration0006_fix_orphan_queue_messages.sqlduroxide-pgmigration0016_fix_orphan_queue_messages.sql- Test:
sample_config_hot_reload_persistent_events_fs