Skip to content

UI/UX: Poor Error Handling UX in Async Dispatcher#1823

Open
tomaioo wants to merge 1 commit into
NovaSky-AI:mainfrom
tomaioo:fix/ui/poor-error-handling-ux-in-async-dispatch
Open

UI/UX: Poor Error Handling UX in Async Dispatcher#1823
tomaioo wants to merge 1 commit into
NovaSky-AI:mainfrom
tomaioo:fix/ui/poor-error-handling-ux-in-async-dispatch

Conversation

@tomaioo

@tomaioo tomaioo commented Jun 22, 2026

Copy link
Copy Markdown

Summary

UI/UX: Poor Error Handling UX in Async Dispatcher

Problem

Severity: Medium | File: skyrl-agent/skyrl_agent/dispatcher/async_utils.py:L53

In async_utils.py, the wait_all function cancels pending tasks on timeout but uses a generic asyncio.TimeoutError without context about which tasks timed out or how many were pending. This provides poor UX for debugging distributed training failures.

Solution

Include the number of pending tasks and their identifiers in the exception message. Consider creating a custom TimeoutError subclass with metadata about pending vs. completed tasks.

Changes

  • skyrl-agent/skyrl_agent/dispatcher/async_utils.py (modified)

In `async_utils.py`, the `wait_all` function cancels pending tasks on timeout but uses a generic `asyncio.TimeoutError` without context about which tasks timed out or how many were pending. This provides poor UX for debugging distributed training failures.

Signed-off-by: tomaioo <203048277+tomaioo@users.noreply.github.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the timeout error reporting in wait_all by including details about the pending tasks. The reviewer suggests a valuable improvement to extract the actual coroutine names instead of generic task names, which will make the timeout error messages significantly more informative.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +71 to +75
pending_ids = [task.get_name() for task in pending]
raise asyncio.TimeoutError(
f"Timeout waiting for {len(pending)} pending task(s) out of {len(tasks)} total. "
f"Pending task IDs: {pending_ids}"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the tasks are created inside wait_all using asyncio.create_task(c) without explicit names, task.get_name() will return generic names like 'Task-1', 'Task-2', etc. This does not provide useful context about which coroutines actually timed out.

To make the error message much more informative, you can extract the name of the underlying coroutine using task.get_coro().__name__ with a fallback to task.get_name().

        pending_names = [
            getattr(task.get_coro(), "__name__", None) or task.get_name()
            for task in pending
        ]
        raise asyncio.TimeoutError(
            f"Timeout waiting for {len(pending)} pending task(s) out of {len(tasks)} total. "
            f"Pending tasks: {pending_names}
        )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant