Skip to content

[Bugfix] Enforce strict timeout for Posix AIO#1022

Merged
ygwpz merged 18 commits into
ModelEngine-Group:developfrom
dante159753:aio-strict-timeout
Jun 17, 2026
Merged

[Bugfix] Enforce strict timeout for Posix AIO#1022
ygwpz merged 18 commits into
ModelEngine-Group:developfrom
dante159753:aio-strict-timeout

Conversation

@dante159753

@dante159753 dante159753 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Enforce hard timeouts for Posix AIO cache operations so a stuck mount no longer leaves vLLM requests waiting forever on external KV cache loads.

Changes

  • Add task-level terminal timeout handling for Posix AIO tasks.
  • Make AIO submit/completion, task wait/check, and timeout cleanup tolerate stuck kernel IO.
  • Add timeout metrics and vLLM integration handling so load failures can be rescheduled through kv_load_failure_policy=recompute.
  • Keep AIO timeout configuration simple by using the existing timeout_ms default instead of separate AIO timeout knobs.
  • Add Posix store tests for open timeout, IO timeout, repeated timeout, and saturated stuck-task behavior.

Why

When a mounted cache backend stops responding, existing AIO requests can be submitted successfully but never complete. Before this change, requests that hit external KV cache could remain blocked until the mount recovered, making inference requests hang under storage incidents.

Impact

With the default timeout_ms, stuck AIO load tasks are force-completed as timeout failures. In vLLM, deployments using kv_load_failure_policy=recompute can continue serving by recomputing the affected KV blocks instead of waiting indefinitely. If the storage backend remains stuck, cache loads may be slower because they time out and fall back to recompute, but the service remains available.

Verification

  • A2 online vLLM+UCM hang simulation:
    • Setup: loop-backed ext4 mount through device mapper, UCM storage_backends on that mount, posix_io_engine: aio, io_direct: true, no explicit timeout_ms.
    • Current branch: dmsetup suspend for ~75s during external cache load; /v1/models stayed responsive; 8 concurrent completion requests returned 200 OK in ~60.8s; post-resume requests completed in ~0.64s.
    • Baseline origin/develop: same setup; 8 concurrent completion requests all hit client timeout at ~55s while UCM repeatedly logged Task(1) has not finished after (2000) ms; post-resume requests completed after the mount recovered.

@dante159753 dante159753 marked this pull request as ready for review June 12, 2026 03:09
Comment thread ucm/store/posix/cc/aio_impl.cc
Comment thread ucm/store/posix/cc/aio_impl.cc
Comment thread ucm/store/posix/cc/io_engine_aio.h
Comment thread ucm/store/posix/cc/io_engine_aio.h
Comment thread ucm/shared/infra/thread/latch.h
Comment thread ucm/store/posix/cc/backend_health.h Outdated
Comment thread ucm/store/detail/template/task_wrapper.h
Comment thread ucm/integration/vllm/ucm_connector.py
qyh111
qyh111 previously approved these changes Jun 16, 2026
Comment thread ucm/store/detail/template/task_wrapper.h Outdated
Comment thread ucm/store/posix/cc/aio_impl.cc
Comment thread ucm/store/detail/template/task_wrapper.h
@qyh111 qyh111 self-requested a review June 17, 2026 02:53
@ygwpz ygwpz merged commit ea4c9f0 into ModelEngine-Group:develop Jun 17, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants