Skip to content

Turn recovery can't resume an interruption that lands mid-tool-input (input-streaming) — exhausts to a stable-timeout instead of regenerating the step #1781

@rwdaigle

Description

@rwdaigle

Summary

When an in-flight chat turn is interrupted (client WebSocket drops mid-stream) at a point where the last streamed assistant part is a tool call still in the input-streaming state — i.e. the model had begun emitting a tool-use block but never finished streaming its input, so the call was never finalized or dispatched — the turn-recovery path cannot make progress. The recovery uses a "continue"/resume strategy that tries to resume the in-flight generation, but a non-finalized tool call has no resumption point, so every attempt yields zero new tokens, is judged "no progress" (stable), and after a fixed number of attempts the turn terminates with a stable-timeout and surfaces a "Session interrupted — send a new message to continue" error to the user.

This state is logically recoverable — the transcript before the partial tool call is fully valid — so recovery should succeed automatically rather than dead-ending.

Production evidence

Observed once in production (a long, multi-step app-build turn):

  • Interruption was a client WebSocket close mid-stream; the persisted partial ended on a tool call in input-streaming (input never completed; tool never executed).
  • Recovery attempted 7 resume attempts over ~88s, every one producing no new tokens; the persisted stream status stayed partial the entire time, ending in a stable-timeout.
  • The agent/Durable Object itself was healthy — sibling turns in the same session completed successfully both shortly before and ~8s after the exhausted turn. The failure was isolated to recovering this one mid-tool-input interruption, not a crashed or evicted agent.
  • A platform deploy was rolling out concurrently, but this agent instance was never reset by it (it kept running and completed other turns) — the deploy only contributed by nudging the client to reconnect. The root cause is the recovery strategy, not the interruption source.

Net user impact: one turn's output is permanently lost and the user must manually re-send, despite the state being recoverable.

Why "continue" can't work here

A tool call that is still input-streaming has no continuation token — there is nothing to resume. The resume strategy implicitly assumes the persisted stream ends at a resumable boundary (completed text, completed tool call, or a tool result). When it ends mid-tool-input, "continue" is a no-op that loops until the attempt budget is exhausted.

Proposed resolution

Add a fallback so recovery routes a mid-tool-input interruption to regenerate-from-last-valid-step instead of resume:

  1. Detect the non-continuable boundary. On recovery, if the trailing persisted assistant part is an unfinished tool call (input-streaming, no finalized input, no result), classify it as not resumable rather than attempting to continue it.
  2. Truncate to the last clean boundary. Drop that partial tool part (and any orphaned step-start) back to the last complete part, yielding a valid message history.
  3. Re-run inference (regenerate the step) from the truncated history, rather than resuming a stream that has no continuation point.
  4. No reconciliation needed in this case. An input-streaming tool call never executed, so there are no side effects to undo — this is a pure regenerate. (The harder case — recovery landing after a tool has already executed — is out of scope here and would need idempotency handling.)

In short: when the interruption lands mid-tool-input, fall back from "continue" to "regenerate from the last valid step" instead of spinning to a stable-timeout.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions