[Bugfix] Enforce strict timeout for Posix AIO by dante159753 · Pull Request #1022 · ModelEngine-Group/unified-cache-management

dante159753 · 2026-06-12T02:00:57Z

Summary

Enforce hard timeouts for Posix AIO cache operations so a stuck mount no longer leaves vLLM requests waiting forever on external KV cache loads.

Changes

Add task-level terminal timeout handling for Posix AIO tasks.
Make AIO submit/completion, task wait/check, and timeout cleanup tolerate stuck kernel IO.
Add timeout metrics and vLLM integration handling so load failures can be rescheduled through kv_load_failure_policy=recompute.
Keep AIO timeout configuration simple by using the existing timeout_ms default instead of separate AIO timeout knobs.
Add Posix store tests for open timeout, IO timeout, repeated timeout, and saturated stuck-task behavior.

Why

When a mounted cache backend stops responding, existing AIO requests can be submitted successfully but never complete. Before this change, requests that hit external KV cache could remain blocked until the mount recovered, making inference requests hang under storage incidents.

Impact

With the default timeout_ms, stuck AIO load tasks are force-completed as timeout failures. In vLLM, deployments using kv_load_failure_policy=recompute can continue serving by recomputing the affected KV blocks instead of waiting indefinitely. If the storage backend remains stuck, cache loads may be slower because they time out and fall back to recompute, but the service remains available.

Verification

A2 online vLLM+UCM hang simulation:
- Setup: loop-backed ext4 mount through device mapper, UCM storage_backends on that mount, posix_io_engine: aio, io_direct: true, no explicit timeout_ms.
- Current branch: dmsetup suspend for ~75s during external cache load; /v1/models stayed responsive; 8 concurrent completion requests returned 200 OK in ~60.8s; post-resume requests completed in ~0.64s.
- Baseline origin/develop: same setup; 8 concurrent completion requests all hit client timeout at ~55s while UCM repeatedly logged Task(1) has not finished after (2000) ms; post-resume requests completed after the mount recovered.

…d-cache-management into aio-strict-timeout

This reverts commit 3711bac.

This reverts commit a0c53cd.

…d-cache-management into aio-strict-timeout

This reverts commit 5f386ef.

# Conflicts: # ucm/integration/vllm/hma_connector.py

dante159753 added 5 commits June 10, 2026 11:41

Enforce strict timeout for posix AIO

067ec8b

Add strict AIO timeout metrics

93a2cb3

Merge branch 'develop' of https://github.com/ModelEngine-Group/unifie…

b961dab

…d-cache-management into aio-strict-timeout

Simplify AIO timeout configuration

9756aa3

Fix AIO timeout lint formatting

3c78b99

dante159753 marked this pull request as ready for review June 12, 2026 03:09

dante159753 requested review from FangRun2, Infinite666, Tarrei, harrisonyhq, mag1c-h, qyh111 and ygwpz as code owners June 12, 2026 03:09

dante159753 added 3 commits June 12, 2026 16:56

Pause Posix AIO backend after timeouts

a0c53cd

Merge origin/develop into aio-strict-timeout

04e88d6

Fix Posix AIO clang-format issues

3711bac