Skip to content

feat(beta): Network tracing#2895

Open
marklysze wants to merge 9 commits into
mainfrom
feat/beta-network-tracing
Open

feat(beta): Network tracing#2895
marklysze wants to merge 9 commits into
mainfrom
feat/beta-network-tracing

Conversation

@marklysze

Copy link
Copy Markdown
Collaborator

Why are these changes needed?

This PR adds tracing what happens between agents — the hub's envelope dispatch, channel lifecycles, agent registration, and tasks — and stitches the two together so a single trace can follow a message from hub dispatch, into the receiving agent's LLM calls and tools, and back out.

In a multi-agent network the interesting failures live in the seams — a message that never gets dispatched, a channel that expires before a reply, a task that silently exceeds its TTL, an agent that stalls under inbox pressure. Agent-level spans can't see any of that because the Envelope never reaches middleware. We needed hub-emitted spans on the same TracerProvider the agents use, so an existing OTLP backend (Jaeger, Tempo, Datadog, Honeycomb, Langfuse) shows one coherent picture instead of two disconnected ones.

It is fully opt-in: nothing is traced unless you hand the hub a TracerProvider and register a HubTelemetryListener. With tracing off, the network packages never import OpenTelemetry.

Includes: Shared, OTel-free vocabulary

  • New autogen/beta/_telemetry_consts.py — a single source of truth for every telemetry string that crosses a package boundary (the propagation key, tracer identity, the closed span-type vocabulary, link kinds, and the ag2.network.* / ag2.agent.* attribute keys).

Agent ↔ network trace stitching

  • TelemetryMiddleware now parents the invoke_agent span under the inbound envelope's span when the turn was triggered by the network. The hub stamps a W3C traceparent onto Envelope.trace_id before the WAL write; the network dispatch handler relays it via context.dependencies[TRACEPARENT_DEP_KEY]. Absent → fresh root span, exactly as before.
  • The middleware also switched its hardcoded span strings over to the shared constants module — no behavioural change, just deduplication.

Hub-side spans

  • New HubTelemetryListener (autogen/beta/network/hub/telemetry.py) emits spans as a HubListener, mirroring the existing AuditLog pattern. Each entity is its own bounded trace: network.channel {type} (open from created to closed/expired, with expectation fires, rejections, and dispatch/turn failures attached), agent.lifetime {name} (open from registered to unregistered, with resume/skill/rule changes nested and inbox-pressure events attached), and network.task {capability} (single-shot at the terminal event, backdated to the task's started_at).
  • New autogen/beta/network/hub/_envelope_tracing.py centralises every OpenTelemetry import so core.py can guard a single try/except ImportError and stay OTel-free when tracing isn't configured. It owns the network.envelope span the hub brackets around WAL append + dispatch, and the span→JSONL serialisation shared by the hub and the listener.
  • hub/core.py wires the tracer_provider through and brackets envelope dispatch; hub/layout.py adds telemetry_root() / spans_path() for the /telemetry/spans.jsonl mirror on the hub's KnowledgeStore; hub/__init__.py exports HubTelemetryListener behind the standard missing_optional_dependency("…", "tracing") fallback.

Task TTL enforcement on the network (bug fix)

  • A networked task's TTL was never enforced: the hub stored expires_at=None because the agent-side absolute deadline was never carried across. TaskStarted now carries an expires_at field, Task.__aenter__ populates it, and TaskMirror hands it to the hub so the TTL sweeper (expire_due) can actually expire the task.

Related issue number

N/A

Checks

AI assistance

  • I understand the changes in this PR and can explain them in my own words.
  • I have verified that the PR description accurately reflects the actual diff.
  • If AI assistance was used, I reviewed, tested, and validated the generated code/text before submitting.

@github-actions github-actions Bot added documentation Improvements or additions to documentation beta labels May 22, 2026
marklysze and others added 7 commits June 16, 2026 13:53
…cing

# Conflicts:
#	autogen/beta/network/hub/core.py
#	website/mint-json-template.json.jinja
Checkpoints bypass the envelope path (direct store writes), so they were
invisible to tracing. HubBackedCheckpointStore now pins checkpoint.write /
checkpoint.read span-events on the active span with task_id, byte size, and
hit/miss — the read event's task_id is the link back to a resumed-from run.

Soft OTel guard: no-op when the tracing extras aren't installed or no
provider is configured, so checkpointing carries no cost unless opted in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the emitted-but-undocumented attributes (audience, causation_id,
dispatch_failures, creator_id, owner_id, agent.capability/outcome/skill_removed),
document the dispatch_failed event payload, and cross-link the checkpoint
events section. Verified against live span dumps across 10 network topologies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@marklysze marklysze marked this pull request as ready for review June 25, 2026 10:19
@marklysze marklysze requested a review from Lancetnik as a code owner June 25, 2026 10:19
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.84270% with 86 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
autogen/beta/network/hub/telemetry.py 62.50% 46 Missing and 14 partials ⚠️
autogen/beta/network/hub/_envelope_tracing.py 87.32% 5 Missing and 4 partials ⚠️
autogen/beta/network/hub/core.py 82.92% 3 Missing and 4 partials ⚠️
autogen/beta/network/client/checkpoint.py 71.42% 2 Missing and 2 partials ⚠️
autogen/beta/middleware/builtin/telemetry.py 76.92% 3 Missing ⚠️
autogen/beta/network/hub/__init__.py 60.00% 2 Missing ⚠️
autogen/beta/network/hub/layout.py 75.00% 1 Missing ⚠️
Files with missing lines Coverage Δ
autogen/beta/_telemetry_consts.py 100.00% <100.00%> (ø)
autogen/beta/events/task_events.py 96.66% <100.00%> (+0.11%) ⬆️
autogen/beta/network/client/handlers.py 93.26% <100.00%> (+0.19%) ⬆️
autogen/beta/network/task_mirror.py 55.37% <ø> (ø)
autogen/beta/task.py 87.63% <ø> (ø)
autogen/beta/network/hub/layout.py 82.97% <75.00%> (-0.75%) ⬇️
autogen/beta/network/hub/__init__.py 83.33% <60.00%> (-16.67%) ⬇️
autogen/beta/middleware/builtin/telemetry.py 83.44% <76.92%> (+0.34%) ⬆️
autogen/beta/network/client/checkpoint.py 81.81% <71.42%> (-18.19%) ⬇️
autogen/beta/network/hub/core.py 80.48% <82.92%> (+0.34%) ⬆️
... and 2 more

... and 52 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

beta documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant