feat(beta): Network tracing#2895
Open
marklysze wants to merge 9 commits into
Open
Conversation
…cing # Conflicts: # autogen/beta/network/hub/core.py # website/mint-json-template.json.jinja
Checkpoints bypass the envelope path (direct store writes), so they were invisible to tracing. HubBackedCheckpointStore now pins checkpoint.write / checkpoint.read span-events on the active span with task_id, byte size, and hit/miss — the read event's task_id is the link back to a resumed-from run. Soft OTel guard: no-op when the tracing extras aren't installed or no provider is configured, so checkpointing carries no cost unless opted in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the emitted-but-undocumented attributes (audience, causation_id, dispatch_failures, creator_id, owner_id, agent.capability/outcome/skill_removed), document the dispatch_failed event payload, and cross-link the checkpoint events section. Verified against live span dumps across 10 network topologies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
This PR adds tracing what happens between agents — the hub's envelope dispatch, channel lifecycles, agent registration, and tasks — and stitches the two together so a single trace can follow a message from hub dispatch, into the receiving agent's LLM calls and tools, and back out.
In a multi-agent network the interesting failures live in the seams — a message that never gets dispatched, a channel that expires before a reply, a task that silently exceeds its TTL, an agent that stalls under inbox pressure. Agent-level spans can't see any of that because the
Envelopenever reaches middleware. We needed hub-emitted spans on the sameTracerProviderthe agents use, so an existing OTLP backend (Jaeger, Tempo, Datadog, Honeycomb, Langfuse) shows one coherent picture instead of two disconnected ones.It is fully opt-in: nothing is traced unless you hand the hub a
TracerProviderand register aHubTelemetryListener. With tracing off, the network packages never import OpenTelemetry.Includes: Shared, OTel-free vocabulary
autogen/beta/_telemetry_consts.py— a single source of truth for every telemetry string that crosses a package boundary (the propagation key, tracer identity, the closed span-type vocabulary, link kinds, and theag2.network.*/ag2.agent.*attribute keys).Agent ↔ network trace stitching
TelemetryMiddlewarenow parents theinvoke_agentspan under the inbound envelope's span when the turn was triggered by the network. The hub stamps a W3CtraceparentontoEnvelope.trace_idbefore the WAL write; the network dispatch handler relays it viacontext.dependencies[TRACEPARENT_DEP_KEY]. Absent → fresh root span, exactly as before.Hub-side spans
HubTelemetryListener(autogen/beta/network/hub/telemetry.py) emits spans as aHubListener, mirroring the existingAuditLogpattern. Each entity is its own bounded trace:network.channel {type}(open fromcreatedtoclosed/expired, with expectation fires, rejections, and dispatch/turn failures attached),agent.lifetime {name}(open fromregisteredtounregistered, with resume/skill/rule changes nested and inbox-pressure events attached), andnetwork.task {capability}(single-shot at the terminal event, backdated to the task'sstarted_at).autogen/beta/network/hub/_envelope_tracing.pycentralises every OpenTelemetry import socore.pycan guard a singletry/except ImportErrorand stay OTel-free when tracing isn't configured. It owns thenetwork.envelopespan the hub brackets around WAL append + dispatch, and the span→JSONL serialisation shared by the hub and the listener.hub/core.pywires thetracer_providerthrough and brackets envelope dispatch;hub/layout.pyaddstelemetry_root()/spans_path()for the/telemetry/spans.jsonlmirror on the hub'sKnowledgeStore;hub/__init__.pyexportsHubTelemetryListenerbehind the standardmissing_optional_dependency("…", "tracing")fallback.Task TTL enforcement on the network (bug fix)
expires_at=Nonebecause the agent-side absolute deadline was never carried across.TaskStartednow carries anexpires_atfield,Task.__aenter__populates it, andTaskMirrorhands it to the hub so the TTL sweeper (expire_due) can actually expire the task.Related issue number
N/A
Checks
AI assistance