UI/UX: Poor Error Handling UX in Async Dispatcher#1823
Conversation
In `async_utils.py`, the `wait_all` function cancels pending tasks on timeout but uses a generic `asyncio.TimeoutError` without context about which tasks timed out or how many were pending. This provides poor UX for debugging distributed training failures. Signed-off-by: tomaioo <203048277+tomaioo@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request improves the timeout error reporting in wait_all by including details about the pending tasks. The reviewer suggests a valuable improvement to extract the actual coroutine names instead of generic task names, which will make the timeout error messages significantly more informative.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| pending_ids = [task.get_name() for task in pending] | ||
| raise asyncio.TimeoutError( | ||
| f"Timeout waiting for {len(pending)} pending task(s) out of {len(tasks)} total. " | ||
| f"Pending task IDs: {pending_ids}" | ||
| ) |
There was a problem hiding this comment.
Since the tasks are created inside wait_all using asyncio.create_task(c) without explicit names, task.get_name() will return generic names like 'Task-1', 'Task-2', etc. This does not provide useful context about which coroutines actually timed out.
To make the error message much more informative, you can extract the name of the underlying coroutine using task.get_coro().__name__ with a fallback to task.get_name().
pending_names = [
getattr(task.get_coro(), "__name__", None) or task.get_name()
for task in pending
]
raise asyncio.TimeoutError(
f"Timeout waiting for {len(pending)} pending task(s) out of {len(tasks)} total. "
f"Pending tasks: {pending_names}
)
Summary
UI/UX: Poor Error Handling UX in Async Dispatcher
Problem
Severity:
Medium| File:skyrl-agent/skyrl_agent/dispatcher/async_utils.py:L53In
async_utils.py, thewait_allfunction cancels pending tasks on timeout but uses a genericasyncio.TimeoutErrorwithout context about which tasks timed out or how many were pending. This provides poor UX for debugging distributed training failures.Solution
Include the number of pending tasks and their identifiers in the exception message. Consider creating a custom
TimeoutErrorsubclass with metadata about pending vs. completed tasks.Changes
skyrl-agent/skyrl_agent/dispatcher/async_utils.py(modified)