Skip to content

feat(observability): propagate W3C trace context across nico-api's network boundaries#2700

Open
adnandnv wants to merge 2 commits into
NVIDIA:mainfrom
adnandnv:feature/2438-w3c-trace-propagation
Open

feat(observability): propagate W3C trace context across nico-api's network boundaries#2700
adnandnv wants to merge 2 commits into
NVIDIA:mainfrom
adnandnv:feature/2438-w3c-trace-propagation

Conversation

@adnandnv

Copy link
Copy Markdown
Contributor

Implements W3C Trace Context propagation for nico-api so a request traced upstream keeps one trace as
it passes through nico-api and into the services it calls.

  • Propagator: install the standard W3C TraceContextPropagator at startup (the OTel default is no-op).
  • Ingress (REST + gRPC): one shared per-request layer extracts the inbound traceparent/tracestate
    and parents the request span to it; no valid inbound context → fresh root.
  • Egress: gRPC clients are wrapped in a tower TraceInjectService; reqwest clients go through
    reqwest-tracing's middleware; the two OAuth2 token providers inject manually (the oauth2 crate
    builds their requests).
  • Enable flag unchanged: the local tracing-enabled switch stays the master control — an inbound
    sampled flag can't force local recording.

Per-client details and how to add a new client are in docs/observability/tracing.md.

Related issues

#2438

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

  • Scope is the nico-api process. bmc-proxy and hardware-health are intentionally not instrumented:
    they build no span exporter, so injecting there would be dead code. docs/observability/tracing.md §1.7
    documents how to add propagation to a new client if that changes.
  • New crate trace-propagation holds the shared glue + tower middleware.
  • tracing.md marker references are a doc-staleness fix, not part of this feature. The doc still
    described the sampler's old code.namespace marker; fix: Fixes for OTLP tracing #2268 replaced it with carbide.trace_root
    without updating the doc. The code.namespacecarbide.trace_root doc edits just correct that.
    (The doc's ParentBased mention is also updated, since this PR unwraps the sampler.)
  • Dependencies added: opentelemetry-http 0.31, reqwest-middleware 0.5, reqwest-tracing 0.7.

…twork boundaries

nico-api now accepts and produces W3C Trace Context headers (traceparent,
tracestate) at its network boundaries, so a request already traced upstream
stays one trace as it flows through nico-api and on to the services it calls.

- Install the standard W3C TraceContextPropagator at startup (the OTel default
  is no-op).
- Ingress (REST + gRPC): one shared per-request layer extracts the inbound
  context and parents the request span to it; with no valid inbound context the
  span is a fresh root.
- Egress: gRPC clients are wrapped in a tower TraceInjectService; reqwest
  clients are built through reqwest-tracing's middleware; the two OAuth2 token
  providers inject manually, since the oauth2 crate builds their requests.
- The local tracing-enabled flag stays the master switch: an inbound sampled
  flag never forces recording here.

Shared extract/inject glue and the tower middleware live in a new
trace-propagation crate, with unit tests and an ingress->egress round-trip.
Per-client details and maintainer guidance are in docs/observability/tracing.md.

Closes NVIDIA#2438.

Signed-off-by: Adnan Dosani <258082943+adnandnv@users.noreply.github.com>
@adnandnv adnandnv requested review from a team and Coco-Ben as code owners June 19, 2026 02:00
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eee0af96-1193-4691-b6ba-57dc10d7f220

📥 Commits

Reviewing files that changed from the base of the PR and between 426cd76 and 9de4ad7.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • Cargo.toml
  • crates/api-core/Cargo.toml
  • crates/component-manager/Cargo.toml
  • crates/component-manager/src/nsm.rs
  • crates/component-manager/src/psm.rs
💤 Files with no reviewable changes (3)
  • crates/component-manager/Cargo.toml
  • crates/component-manager/src/nsm.rs
  • crates/component-manager/src/psm.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • crates/api-core/Cargo.toml
  • Cargo.toml

Summary by CodeRabbit

  • New Features

    • Enable end-to-end distributed tracing by propagating W3C traceparent/tracestate across REST and gRPC boundaries, including HTTP middleware and gRPC transports.
    • Outbound requests now carry the correct trace identifiers even when span recording is disabled.
  • Bug Fixes

    • Improved trace span parenting and sampling behavior based on inbound context validity and the carbide.trace_root marker.
    • Updated trace handling to safely behave as a no-op when no tracing propagator is configured.
  • Documentation

    • Expanded tracing documentation covering propagation, sampling semantics, and troubleshooting.

Walkthrough

Adds a new trace-propagation crate and wires W3C trace context propagation through logging, gRPC, HTTP, and OAuth2 request paths. It also installs a global TraceContextPropagator, updates CarbideSpanSampler, and refreshes tracing documentation.

Changes

W3C Trace Context Propagation

Layer / File(s) Summary
New trace-propagation crate: primitives, TraceInjectService, and tests
crates/trace-propagation/Cargo.toml, crates/trace-propagation/src/lib.rs, crates/trace-propagation/tests/no_global_propagator.rs
Creates the trace-propagation crate with ingress/egress helpers, a TraceInjectService<S> wrapper, and tests covering extraction, injection, propagation, and no-global-propagator behavior.
Global TraceContextPropagator installation and CarbideSpanSampler rewrite
Cargo.toml, crates/api-core/src/logging/setup.rs
setup_logging installs TraceContextPropagator globally. CarbideSpanSampler::should_sample is rewritten to use parent_context, inherit from valid in-process parents, record only carbide.trace_root-marked spans, and propagate parent tracestate.
Ingress: span parent adoption from inbound W3C headers
crates/api-core/src/logging/api_logs.rs
LogService::call uses set_span_parent_from_headers on the incoming request headers immediately after creating the request span.
gRPC egress: TraceInjectService wiring across component-manager, libnmxc, and rpc
crates/component-manager/Cargo.toml, crates/component-manager/src/tls.rs, crates/component-manager/src/nsm.rs, crates/component-manager/src/psm.rs, crates/libnmxc/Cargo.toml, crates/libnmxc/src/lib.rs, crates/libnmxc/src/nmxc_api.rs, crates/rpc/Cargo.toml, crates/rpc/src/forge_tls_client.rs
build_channel returns TraceInjectService<Channel>, and that traced service type propagates through NsmSwitchBackend, PsmPowerShelfBackend, NmxcApi, and both ForgeTlsClient build paths.
HTTP egress: reqwest-middleware + TracingMiddleware across all HTTP clients
Cargo.toml, crates/firmware/Cargo.toml, crates/firmware/src/downloader.rs, crates/libmlx/Cargo.toml, crates/libmlx/src/firmware/source.rs, crates/libnmxm/Cargo.toml, crates/libnmxm/src/lib.rs, crates/nras/Cargo.toml, crates/nras/src/client.rs, crates/nras/src/lib.rs, crates/api-core/Cargo.toml, crates/api-core/src/handlers/redfish.rs, crates/api-core/src/machine_identity/token_exchange.rs
All updated HTTP clients now use reqwest_middleware::ClientBuilder with TracingMiddleware::default(), and the affected return types, field types, error conversions, and tests move to reqwest_middleware types.
OAuth2 manual trace header injection
crates/api-web/Cargo.toml, crates/api-web/src/auth.rs, crates/mqttea/Cargo.toml, crates/mqttea/src/auth/oauth2_provider.rs
api-web and mqttea now mutate outgoing OAuth2 requests and inject the current trace context before the HTTP exchange.
Observability documentation update
docs/observability/tracing.md
docs/observability/tracing.md is expanded to describe W3C trace context propagation, CarbideSpanSampler behavior, forwarding versus recording, the exporter limitation, and client integration guidance.

Sequence Diagram(s)

sequenceDiagram
  participant InboundRequest
  participant extract_context
  participant request_span
  participant TraceInjectService
  participant TracingMiddleware
  participant OutboundRequest

  InboundRequest->>extract_context: traceparent / tracestate headers
  extract_context->>request_span: set_span_parent_from_headers
  request_span->>TraceInjectService: gRPC request
  TraceInjectService->>OutboundRequest: inject trace headers

  request_span->>TracingMiddleware: HTTP request
  TracingMiddleware->>OutboundRequest: inject trace headers
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: W3C trace context propagation across nico-api network boundaries.
Description check ✅ Passed The description is directly aligned with the implementation and accurately summarizes ingress, egress, sampling, and documentation changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@github-actions

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/nras/src/client.rs`:
- Around line 52-57: The HTTP client created in the new_with_config method lacks
timeout configuration, which can cause indefinite blocking during network issues
or NRAS service outages. Add a timeout field to the Config struct to make it
configurable, then apply this timeout to the reqwest client builder by calling
the .timeout() method on the ClientBuilder chain before calling .build(). Ensure
the timeout value is propagated from Config and applied either at the client
level in new_with_config or on individual requests made through methods like
attest_gpu to enforce bounded wait times for external service calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 931e99d9-7c3a-48f6-9587-76664d251c90

📥 Commits

Reviewing files that changed from the base of the PR and between 47e42bc and 426cd76.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (32)
  • Cargo.toml
  • crates/api-core/Cargo.toml
  • crates/api-core/src/handlers/redfish.rs
  • crates/api-core/src/logging/api_logs.rs
  • crates/api-core/src/logging/setup.rs
  • crates/api-core/src/machine_identity/token_exchange.rs
  • crates/api-web/Cargo.toml
  • crates/api-web/src/auth.rs
  • crates/component-manager/Cargo.toml
  • crates/component-manager/src/nsm.rs
  • crates/component-manager/src/psm.rs
  • crates/component-manager/src/tls.rs
  • crates/firmware/Cargo.toml
  • crates/firmware/src/downloader.rs
  • crates/libmlx/Cargo.toml
  • crates/libmlx/src/firmware/source.rs
  • crates/libnmxc/Cargo.toml
  • crates/libnmxc/src/lib.rs
  • crates/libnmxc/src/nmxc_api.rs
  • crates/libnmxm/Cargo.toml
  • crates/libnmxm/src/lib.rs
  • crates/mqttea/Cargo.toml
  • crates/mqttea/src/auth/oauth2_provider.rs
  • crates/nras/Cargo.toml
  • crates/nras/src/client.rs
  • crates/nras/src/lib.rs
  • crates/rpc/Cargo.toml
  • crates/rpc/src/forge_tls_client.rs
  • crates/trace-propagation/Cargo.toml
  • crates/trace-propagation/src/lib.rs
  • crates/trace-propagation/tests/no_global_propagator.rs
  • docs/observability/tracing.md

Comment thread crates/nras/src/client.rs
Comment on lines 52 to +57
pub fn new_with_config(config: &crate::Config) -> NrasVerifierClient {
NrasVerifierClient {
config: config.clone(),
http_client: reqwest::Client::new(),
http_client: reqwest_middleware::ClientBuilder::new(reqwest::Client::new())
.with(reqwest_tracing::TracingMiddleware::default())
.build(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify timeout configuration on the NRAS egress path.
rg -n "NrasVerifierClient|Client::new\\(|Client::builder\\(|connect_timeout\\(|\\.timeout\\(" crates/nras/src/client.rs -C3

Repository: NVIDIA/infra-controller

Length of output: 1072


🏁 Script executed:

#!/bin/bash
# Search for Config definition and any timeout fields
rg -n "struct Config|timeout|Timeout" crates/nras/src/ -A2 -B2 | head -100

Repository: NVIDIA/infra-controller

Length of output: 329


🏁 Script executed:

#!/bin/bash
# Search for attest_gpu implementation and how the http_client is used
rg -n "async fn attest_gpu" crates/nras/src/client.rs -A 30

Repository: NVIDIA/infra-controller

Length of output: 2719


🏁 Script executed:

#!/bin/bash
# Check if reqwest::Client has any builder methods or default timeout behavior
fd -e rs -e toml . crates/nras/ | xargs rg -l "reqwest" | head -5

Repository: NVIDIA/infra-controller

Length of output: 167


🏁 Script executed:

#!/bin/bash
# Get the full Config struct definition
rg -n "pub struct Config" crates/nras/src/lib.rs -A 20

Repository: NVIDIA/infra-controller

Length of output: 699


🏁 Script executed:

#!/bin/bash
# Search across all crates for any timeout configuration patterns for reqwest clients
rg -n "\.timeout\(|connect_timeout\(" --type rs crates/ | grep -i "nras\|http_client" | head -20

Repository: NVIDIA/infra-controller

Length of output: 94


🏁 Script executed:

#!/bin/bash
# Check Cargo.toml for reqwest version and any timeout-related dependencies
cat crates/nras/Cargo.toml

Repository: NVIDIA/infra-controller

Length of output: 1427


🏁 Script executed:

#!/bin/bash
# Verify reqwest default timeout behavior and check if any global timeout patterns exist elsewhere
rg -n "timeout\(" crates/nras/ -i

Repository: NVIDIA/infra-controller

Length of output: 49


Configure explicit timeout bounds on the NRAS HTTP client.

The reqwest::Client::new() at line 55 lacks any timeout configuration, permitting indefinite blocking during NRAS outages or network degradation. The Config struct contains no timeout fields, and individual requests in attest_gpu (line 81) are not wrapped with timeout constraints. Propagate a configurable timeout through Config and apply it via .timeout() on the client or per-request to ensure bounded wait times during external service calls.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/nras/src/client.rs` around lines 52 - 57, The HTTP client created in
the new_with_config method lacks timeout configuration, which can cause
indefinite blocking during network issues or NRAS service outages. Add a timeout
field to the Config struct to make it configurable, then apply this timeout to
the reqwest client builder by calling the .timeout() method on the ClientBuilder
chain before calling .build(). Ensure the timeout value is propagated from
Config and applied either at the client level in new_with_config or on
individual requests made through methods like attest_gpu to enforce bounded wait
times for external service calls.

@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 265 6 24 98 7 130
machine-validation-runner 717 32 188 267 36 194
machine_validation 717 32 188 267 36 194
machine_validation-aarch64 717 32 188 267 36 194
nvmetal-carbide 717 32 188 267 36 194
TOTAL 3139 134 776 1172 151 906

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@ajf

ajf commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

@akorobkov-nvda do you want to take a look at this?

@ajf ajf requested a review from akorobkov-nvda June 22, 2026 20:48

@FrankSpitulski FrankSpitulski left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants