MAF-20095: feat(deploy): collect AIGateway pod logs via Vector#133
Merged
Conversation
AIGateway pods emit JSON logs to stdout but cannot carry the
mif.moreh.io/log.collect opt-in label (the AIGateway CRD has no pod-label
field and CR labels are not propagated), so Vector never collected them.
Add a second kubernetes_logs source selecting app.kubernetes.io/name=aigateway
-- the immutable label the Heimdall controller always stamps on AIGateway pods
-- and feed it through the existing mif_log_transform, parsing JSON
unconditionally for AIGateway. AIGateway logs are now auto-collected into Loki
and queryable as {app="aigateway"}, symmetric with its metrics/traces which the
controller auto-provisions without per-pod opt-in.
- values.yaml: add aigateway_logs source, add it to mif_log_transform inputs,
and OR the JSON-parse gate with `.app == "aigateway"`.
- README.md: regenerated via `make helm-docs`.
- docs (monitoring/logs): document AIGateway auto-collection, the dual
msg/message + time/timestamp field mapping, and the Grafana Explore LogQL
query ({app="aigateway"} | json).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds automatic Loki/Vector log collection for AIGateway pods (which can’t opt-in via mif.moreh.io/log.collect) by selecting them via the controller-stamped app.kubernetes.io/name=aigateway label, and updates documentation to describe the behavior and how to query AIGateway logs in Grafana.
Changes:
- Helm: Add a dedicated Vector
kubernetes_logssource for AIGateway pods and include it in the existingmif_log_transform. - Helm: Always attempt JSON parsing for AIGateway events in the remap transform.
- Docs: Update the log collection guide with an “Automatically collected components” section and AIGateway LogQL examples; regenerate chart README.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
deploy/helm/moai-inference-framework/values.yaml |
Adds aigateway_logs Vector source and expands JSON-parse gating to include AIGateway. |
website/docs/operations/monitoring/logs/index.mdx |
Documents AIGateway automatic collection and adds Grafana Explore query examples and field mapping notes. |
deploy/helm/moai-inference-framework/README.md |
Regenerated helm-docs output reflecting new Vector values. |
…ector transform
mif_log_transform promoted Go slog `msg`/`time` into `message`/`timestamp`
under an `if err == null` guard from `get(., ["msg"])`. But VRL `get` returns
(null, null) — null WITHOUT an error — when the key is absent, so the guard
fired even for logs lacking `msg`/`time` and overwrote the values merge! had
just populated from the parsed JSON. Harmless for Go components (always emit
`msg`), but AIGateway's Rust logs use `message`/`timestamp`, so their message
and timestamp were stored as null in Loki.
Guard the slog promotion on field existence (`exists(.msg)` / `exists(.time)`)
so it runs only when those keys are present; AIGateway's already-correct
`message`/`timestamp` (placed by merge!) are then preserved.
Verified on a kind cluster (minio+loki+vector): a pod labelled
app.kubernetes.io/name=aigateway emitting the aigateway JSON shape, with no
opt-in label, is collected and queryable as {app="aigateway"} with
message/level/target/request_id/trace_id populated and the event time kept as
the Loki entry timestamp. The Go slog opt-in path and the negative control
(uncollected) both still hold.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…trics ServiceMonitor/PodMonitor are metrics-scraping resources; they do not expose traces. Narrow the aigateway_logs comment to metrics so it does not imply the controller exposes traces via ServiceMonitor/PodMonitor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ogs guide - Scope the auto-collection note to metrics: ServiceMonitor/PodMonitor are metrics-scraping resources and do not expose traces. - Soften the trace_id correlation claim: log-to-trace linking needs a derivedFields entry on the Loki datasource, which the chart does not set up. - Highlight the variable line in the LogQL example (request_id="<requestId>") per website/AGENTS.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hhk7734
approved these changes
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
AIGateway pods emit JSON logs (scorer scores, picker decisions,
request.completed) to stdout, but they could not be viewed in Grafana/Loki. Vector only collects pods carrying themif.moreh.io/log.collect=trueopt-in label, and the AIGateway CRD exposes no field to set pod labels (CR labels are not propagated either). This left logs asymmetric with AIGateway metrics and traces, which the Heimdall controller auto-provisions (ServiceMonitor/PodMonitor) with no per-pod opt-in.This collects AIGateway logs automatically by the immutable
app.kubernetes.io/name=aigatewaylabel the controller always sets — entirely within the MIF chart, with no change to heimdall / heimdall-aigateway.Ticket: MAF-20095 (epic MAF-19497)
Changes
1.
feat(deploy)— Vector config (values.yaml) —84ac8b9kubernetes_logssourceaigateway_logsselectingapp.kubernetes.io/name=aigateway.mif_log_transform(add toinputs)..app == "aigateway"so AIGateway's flat JSON (timestamp/level/target/messageplusrequest_id/trace_idspan fields) is always parsed.{app="aigateway"}lands in Loki withlevelpopulated andtrace_idin the log body.README.mdviamake helm-docs.2.
fix(deploy)— preserve message/timestamp in the transform —31e785f"message": null.slogmsg/timeintomessage/timestampunder anif err == nullguard fromget(., ["msg"]). VRLgetreturns(null, null)— null without an error — when the key is absent, so the guard fired even for logs lackingmsg/timeand overwrote the valuesmerge!had populated from the parsed JSON. Harmless for Go components (always emitmsg), but AIGateway's Rust logs usemessage/timestamp, so their message/timestamp were nulled.exists(.msg)/exists(.time)). AIGateway's already-correctmessage/timestamp(placed bymerge!) are preserved; the Goslogpath is unchanged.3.
docs—operations/monitoring/logsguidemsg/time(Go) andmessage/timestamp(Rust / AIGateway).{app="aigateway"} | json, level/request_idfilters, Tempotrace_idcorrelation). No dashboard panel added (Explore-only, by design).Verification (kind cluster)
Deployed the chart slice
minio+loki+vector(all else disabled) on a kind cluster and drove it with a synthetic pod that mirrors the real AIGateway pod label set and emits the exact flat-JSON shape of aigateway'sJsonTraceFormatter. The real aigateway image is private and needs the controller + a GPU backend, so a faithful stand-in exercises the change (the Vector source + transform) directly.aigateway_logssource + the|| .app == "aigateway"gate (config valid)app.kubernetes.io/name=aigatewaywith no opt-in label is collected{app="aigateway"};level/inference_service/namespace(cross-namespace) labels populatedmessage/target/request_id/trace_idpresent in body31e785f)app=vllmpod with no opt-in is not collectedslogopt-in path (log.collect=true+log.format=json) still parsesRoot cause of the
message: nulldefect was confirmed withvector vrlon the real Vector 0.43.1 binary (current transform →message: null;exists-guarded transform →message: "picker.decision"; slog event → unchanged).helm lint/helm templatevalidate rendering only, not VRL runtime semantics — the kind run is what surfaced it.Notes
app.kubernetes.io/name=aigatewayis a hardcoded, immutable Deployment selector label set by the controller (selectorLabels()), so it is always present on AIGateway pods.podLabelsfield to the AIGateway CRD for explicit opt-in) was considered and deferred in favor of this mif-only approach that is symmetric with how AIGateway metrics/traces are already auto-collected.🤖 Generated with Claude Code