Turn recovery can't resume an interruption that lands mid-tool-input (`input-streaming`) — exhausts to a stable-timeout instead of regenerating the step

### Summary

When an in-flight chat turn is interrupted (client WebSocket drops mid-stream) at a point where the **last streamed assistant part is a tool call still in the `input-streaming` state** — i.e. the model had begun emitting a tool-use block but never finished streaming its input, so the call was never finalized or dispatched — the turn-recovery path cannot make progress. The recovery uses a "continue"/resume strategy that tries to resume the in-flight generation, but a non-finalized tool call has no resumption point, so every attempt yields zero new tokens, is judged "no progress" (stable), and after a fixed number of attempts the turn terminates with a stable-timeout and surfaces a *"Session interrupted — send a new message to continue"* error to the user.

This state is **logically recoverable** — the transcript *before* the partial tool call is fully valid — so recovery should succeed automatically rather than dead-ending.

### Production evidence

Observed once in production (a long, multi-step app-build turn):

- Interruption was a client WebSocket close mid-stream; the persisted partial ended on a tool call in `input-streaming` (input never completed; tool never executed).
- Recovery attempted **7 resume attempts over ~88s**, every one producing no new tokens; the persisted stream status stayed `partial` the entire time, ending in a stable-timeout.
- **The agent/Durable Object itself was healthy** — sibling turns in the *same* session completed successfully both shortly before and ~8s *after* the exhausted turn. The failure was isolated to recovering this one mid-tool-input interruption, not a crashed or evicted agent.
- A platform deploy was rolling out concurrently, but this agent instance was never reset by it (it kept running and completed other turns) — the deploy only contributed by nudging the client to reconnect. The root cause is the recovery strategy, not the interruption source.

Net user impact: one turn's output is permanently lost and the user must manually re-send, despite the state being recoverable.

### Why "continue" can't work here

A tool call that is still `input-streaming` has no continuation token — there is nothing to resume. The resume strategy implicitly assumes the persisted stream ends at a resumable boundary (completed text, completed tool call, or a tool result). When it ends mid-tool-input, "continue" is a no-op that loops until the attempt budget is exhausted.

### Proposed resolution

Add a fallback so recovery routes a mid-tool-input interruption to **regenerate-from-last-valid-step** instead of resume:

1. **Detect the non-continuable boundary.** On recovery, if the trailing persisted assistant part is an unfinished tool call (`input-streaming`, no finalized input, no result), classify it as *not resumable* rather than attempting to continue it.
2. **Truncate to the last clean boundary.** Drop that partial tool part (and any orphaned `step-start`) back to the last complete part, yielding a valid message history.
3. **Re-run inference (regenerate the step)** from the truncated history, rather than resuming a stream that has no continuation point.
4. **No reconciliation needed in this case.** An `input-streaming` tool call never executed, so there are no side effects to undo — this is a pure regenerate. (The harder case — recovery landing *after* a tool has already executed — is out of scope here and would need idempotency handling.)

In short: when the interruption lands mid-tool-input, fall back from "continue" to "regenerate from the last valid step" instead of spinning to a stable-timeout.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turn recovery can't resume an interruption that lands mid-tool-input (`input-streaming`) — exhausts to a stable-timeout instead of regenerating the step #1781

Summary

Production evidence

Why "continue" can't work here

Proposed resolution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Turn recovery can't resume an interruption that lands mid-tool-input (input-streaming) — exhausts to a stable-timeout instead of regenerating the step #1781

Description

Summary

Production evidence

Why "continue" can't work here

Proposed resolution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Turn recovery can't resume an interruption that lands mid-tool-input (`input-streaming`) — exhausts to a stable-timeout instead of regenerating the step #1781