Skip to content

MAF-20095: feat(deploy): collect AIGateway pod logs via Vector#133

Merged
hhk7734 merged 4 commits into
mainfrom
MAF-20095-aigateway-log-vector-collection
Jun 5, 2026
Merged

MAF-20095: feat(deploy): collect AIGateway pod logs via Vector#133
hhk7734 merged 4 commits into
mainfrom
MAF-20095-aigateway-log-vector-collection

Conversation

@seongsu-dev

@seongsu-dev seongsu-dev commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

AIGateway pods emit JSON logs (scorer scores, picker decisions, request.completed) to stdout, but they could not be viewed in Grafana/Loki. Vector only collects pods carrying the mif.moreh.io/log.collect=true opt-in label, and the AIGateway CRD exposes no field to set pod labels (CR labels are not propagated either). This left logs asymmetric with AIGateway metrics and traces, which the Heimdall controller auto-provisions (ServiceMonitor/PodMonitor) with no per-pod opt-in.

This collects AIGateway logs automatically by the immutable app.kubernetes.io/name=aigateway label the controller always sets — entirely within the MIF chart, with no change to heimdall / heimdall-aigateway.

Ticket: MAF-20095 (epic MAF-19497)

Changes

1. feat(deploy) — Vector config (values.yaml)84ac8b9

  • Add a second kubernetes_logs source aigateway_logs selecting app.kubernetes.io/name=aigateway.
  • Feed it into the existing mif_log_transform (add to inputs).
  • OR the JSON-parse gate with .app == "aigateway" so AIGateway's flat JSON (timestamp/level/target/message plus request_id/trace_id span fields) is always parsed. {app="aigateway"} lands in Loki with level populated and trace_id in the log body.
  • Regenerated README.md via make helm-docs.

2. fix(deploy) — preserve message/timestamp in the transform31e785f

  • Found during the kind-cluster verification below: AIGateway log bodies were stored with "message": null.
  • Root cause: the transform promoted Go slog msg/time into message/timestamp under an if err == null guard from get(., ["msg"]). VRL get returns (null, null) — null without an error — when the key is absent, so the guard fired even for logs lacking msg/time and overwrote the values merge! had populated from the parsed JSON. Harmless for Go components (always emit msg), but AIGateway's Rust logs use message/timestamp, so their message/timestamp were nulled.
  • Fix: guard the slog promotion on field existence (exists(.msg) / exists(.time)). AIGateway's already-correct message/timestamp (placed by merge!) are preserved; the Go slog path is unchanged.

3. docsoperations/monitoring/logs guide

  • New "Automatically collected components" section explaining AIGateway needs no opt-in label.
  • Field-mapping table now covers both msg/time (Go) and message/timestamp (Rust / AIGateway).
  • Grafana Explore LogQL examples ({app="aigateway"} | json, level/request_id filters, Tempo trace_id correlation). No dashboard panel added (Explore-only, by design).

Verification (kind cluster)

Deployed the chart slice minio+loki+vector (all else disabled) on a kind cluster and drove it with a synthetic pod that mirrors the real AIGateway pod label set and emits the exact flat-JSON shape of aigateway's JsonTraceFormatter. The real aigateway image is private and needs the controller + a GPU backend, so a faithful stand-in exercises the change (the Vector source + transform) directly.

Check Result
Vector loads aigateway_logs source + the || .app == "aigateway" gate (config valid)
Pod labelled app.kubernetes.io/name=aigateway with no opt-in label is collected
Queryable as {app="aigateway"}; level / inference_service / namespace (cross-namespace) labels populated
JSON parsed: message / target / request_id / trace_id present in body ✅ (after fix 31e785f)
Loki entry time equals the event time
Negative control: app=vllm pod with no opt-in is not collected ✅ (0 lines)
Regression: existing Go slog opt-in path (log.collect=true+log.format=json) still parses

Root cause of the message: null defect was confirmed with vector vrl on the real Vector 0.43.1 binary (current transform → message: null; exists-guarded transform → message: "picker.decision"; slog event → unchanged). helm lint / helm template validate rendering only, not VRL runtime semantics — the kind run is what surfaced it.

Notes

  • No heimdall / heimdall-aigateway change: app.kubernetes.io/name=aigateway is a hardcoded, immutable Deployment selector label set by the controller (selectorLabels()), so it is always present on AIGateway pods.
  • A design alternative (adding a podLabels field to the AIGateway CRD for explicit opt-in) was considered and deferred in favor of this mif-only approach that is symmetric with how AIGateway metrics/traces are already auto-collected.

🤖 Generated with Claude Code

AIGateway pods emit JSON logs to stdout but cannot carry the
mif.moreh.io/log.collect opt-in label (the AIGateway CRD has no pod-label
field and CR labels are not propagated), so Vector never collected them.

Add a second kubernetes_logs source selecting app.kubernetes.io/name=aigateway
-- the immutable label the Heimdall controller always stamps on AIGateway pods
-- and feed it through the existing mif_log_transform, parsing JSON
unconditionally for AIGateway. AIGateway logs are now auto-collected into Loki
and queryable as {app="aigateway"}, symmetric with its metrics/traces which the
controller auto-provisions without per-pod opt-in.

- values.yaml: add aigateway_logs source, add it to mif_log_transform inputs,
  and OR the JSON-parse gate with `.app == "aigateway"`.
- README.md: regenerated via `make helm-docs`.
- docs (monitoring/logs): document AIGateway auto-collection, the dual
  msg/message + time/timestamp field mapping, and the Grafana Explore LogQL
  query ({app="aigateway"} | json).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 5, 2026 07:51
@seongsu-dev seongsu-dev requested a review from a team as a code owner June 5, 2026 07:51

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds automatic Loki/Vector log collection for AIGateway pods (which can’t opt-in via mif.moreh.io/log.collect) by selecting them via the controller-stamped app.kubernetes.io/name=aigateway label, and updates documentation to describe the behavior and how to query AIGateway logs in Grafana.

Changes:

  • Helm: Add a dedicated Vector kubernetes_logs source for AIGateway pods and include it in the existing mif_log_transform.
  • Helm: Always attempt JSON parsing for AIGateway events in the remap transform.
  • Docs: Update the log collection guide with an “Automatically collected components” section and AIGateway LogQL examples; regenerate chart README.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
deploy/helm/moai-inference-framework/values.yaml Adds aigateway_logs Vector source and expands JSON-parse gating to include AIGateway.
website/docs/operations/monitoring/logs/index.mdx Documents AIGateway automatic collection and adds Grafana Explore query examples and field mapping notes.
deploy/helm/moai-inference-framework/README.md Regenerated helm-docs output reflecting new Vector values.

Comment thread deploy/helm/moai-inference-framework/values.yaml Outdated
Comment thread website/docs/operations/monitoring/logs/index.mdx Outdated
Comment thread website/docs/operations/monitoring/logs/index.mdx Outdated
Comment thread website/docs/operations/monitoring/logs/index.mdx Outdated
seongsu-dev and others added 3 commits June 5, 2026 17:17
…ector transform

mif_log_transform promoted Go slog `msg`/`time` into `message`/`timestamp`
under an `if err == null` guard from `get(., ["msg"])`. But VRL `get` returns
(null, null) — null WITHOUT an error — when the key is absent, so the guard
fired even for logs lacking `msg`/`time` and overwrote the values merge! had
just populated from the parsed JSON. Harmless for Go components (always emit
`msg`), but AIGateway's Rust logs use `message`/`timestamp`, so their message
and timestamp were stored as null in Loki.

Guard the slog promotion on field existence (`exists(.msg)` / `exists(.time)`)
so it runs only when those keys are present; AIGateway's already-correct
`message`/`timestamp` (placed by merge!) are then preserved.

Verified on a kind cluster (minio+loki+vector): a pod labelled
app.kubernetes.io/name=aigateway emitting the aigateway JSON shape, with no
opt-in label, is collected and queryable as {app="aigateway"} with
message/level/target/request_id/trace_id populated and the event time kept as
the Loki entry timestamp. The Go slog opt-in path and the negative control
(uncollected) both still hold.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…trics

ServiceMonitor/PodMonitor are metrics-scraping resources; they do not expose
traces. Narrow the aigateway_logs comment to metrics so it does not imply the
controller exposes traces via ServiceMonitor/PodMonitor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ogs guide

- Scope the auto-collection note to metrics: ServiceMonitor/PodMonitor are
  metrics-scraping resources and do not expose traces.
- Soften the trace_id correlation claim: log-to-trace linking needs a
  derivedFields entry on the Loki datasource, which the chart does not set up.
- Highlight the variable line in the LogQL example (request_id="<requestId>")
  per website/AGENTS.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 5, 2026 08:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@seongsu-dev seongsu-dev added the ready for review Agentic review process is over. Human reviewer can review this PR. label Jun 5, 2026
@hhk7734 hhk7734 merged commit 38335a5 into main Jun 5, 2026
4 checks passed
@hhk7734 hhk7734 deleted the MAF-20095-aigateway-log-vector-collection branch June 5, 2026 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for review Agentic review process is over. Human reviewer can review this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants