feat: add productionstack-status-reporter component by rambohe-ch · Pull Request #105 · kaito-project/production-stack

rambohe-ch · 2026-06-21T23:31:08Z

Reason for Change:

Introduce the productionstack-status-reporter, a leader-elected controller-runtime component that surfaces the health of a Production Stack deployment as aggregated Kubernetes Events.

On every resync the reporter scrapes vLLM pod metrics and evaluates a reason catalogue across three layers (control plane, cluster/harness, and model deployment), applies a priority ordering and cross-layer suppression so that a single root cause is reported instead of a cascade of downstream symptoms, and emits the result as aggregated Events in kube-system. It requires only read-only API access to the resources it inspects, plus permission to create Events.

To keep the event stream quiet during the legitimately long startup window, transient states are not reported as Warnings: cluster, harness, EPP and route findings are gated by a configurable startup grace period(object-age gating when the backing resource's age is known, otherwise a debounce), while Workspace and model-pod findings discriminate terminal failures from in-progress provisioning/download states.

Includes:

cmd/ entrypoint and pkg/ controllers, reason, and scraper packages with unit tests
startup grace gating to suppress transient startup-state Warnings, configurable via --startup-grace-seconds / startupGraceSeconds
productionstack-status-reporter Helm subchart (deployment, RBAC, serviceaccount, values schema) wired into the productionstack chart
Dockerfile.status-reporter and Makefile build/image targets
e2e tests covering cluster status, control-plane errors, harness status, upstream gating, weight-download slowness, and event message hygiene

Requirements

added unit tests and e2e tests (if applicable).

Issue Fixed:

Fixes #87

Notes for Reviewers:

Introduce the productionstack-status-reporter, a leader-elected controller-runtime component that surfaces the health of a Production Stack deployment as aggregated Kubernetes Events. On every resync the reporter scrapes vLLM pod metrics and evaluates a reason catalogue across three layers (control plane, cluster/harness, and model deployment), applies a priority ordering and cross-layer suppression so that a single root cause is reported instead of a cascade of downstream symptoms, and emits the result as aggregated Events in kube-system. It requires only read-only API access to the resources it inspects, plus permission to create Events. To keep the event stream quiet during the legitimately long startup window, transient states are not reported as Warnings: cluster, harness, EPP and route findings are gated by a configurable startup grace period (object-age gating when the backing resource's age is known, otherwise a debounce), while Workspace and model-pod findings discriminate terminal failures from in-progress provisioning/download states. Includes: - cmd/ entrypoint and pkg/ controllers, reason, and scraper packages with unit tests - startup grace gating to suppress transient startup-state Warnings, configurable via --startup-grace-seconds / startupGraceSeconds - productionstack-status-reporter Helm subchart (deployment, RBAC, serviceaccount, values schema) wired into the productionstack chart - least-privilege RBAC scoped to the resources the reporter actually reads as objects: CRD presence is probed via the discovery API, and KAITO NodeClaim health is read from Workspace status, so neither customresourcedefinitions, keda.sh, karpenter.sh, nor unused core/apps subresources are granted - Dockerfile.status-reporter and Makefile build/image targets - e2e wiring that builds, pushes, and installs the reporter image via the productionstack chart (prepare-image.sh, install-components.sh, validate-components.sh), with the reporter's default config aligned to the e2e install topology (istio-system / kaito-system namespaces) - e2e tests covering cluster status, control-plane errors, harness status, upstream gating, weight-download slowness, and event message hygiene; each owns a dedicated namespace and runs as part of the standard (non-nightly) e2e suite Signed-off-by: rambohe-ch <rambohe.ch@gmail.com>

…-scoped objects pass validation The emitter publishes control-plane Events into kube-system with a cluster-scoped involvedObject (Namespace/CRD, so involvedObject.namespace is empty). Without EventTime set, the apiserver's legacyValidateEvent takes the old-style branch which requires the Event namespace to be "" or "default" for cluster-scoped objects, rejecting kube-system with "involvedObject.namespace: Invalid value: \"\": does not match event.namespace". As a result no reporter Events were ever created and all StatusReporter e2e tests timed out. Set EventTime on the created Event so the new-style validation branch applies, which permits kube-system for cluster-scoped involvedObjects.

…etect deleted parent Gateway - requiredCRDs probed inferenceobjectives in the graduated inference.networking.k8s.io/v1alpha2 group, which never exists (only InferencePool graduated; InferenceObjective stays experimental in inference.networking.x-k8s.io/v1alpha2). This made clusterCRDMissing perpetually active so clusterReady never fired, timing out the BBR recovery e2e test. Probe the correct x-k8s.io group/version. - routeNotReady only inspected HTTPRoute status.parents conditions for Accepted/ResolvedRefs=False. When the parent Gateway is deleted, Istio removes the status.parents entry instead of flipping it to False, so no inferencesetRouteNotReady was emitted. Add missingParentGateway() to resolve each HTTPRoute parentRef Gateway and report not-ready when it is NotFound.

Revert the routeNotReady Gateway-existence probe in the status reporter and instead drive inferencesetRouteNotReady from the test using a mechanism the existing detection already handles. Deleting the per-case Gateway is undetectable because Istio removes the HTTPRoute status.parents[] entry instead of flipping Accepted=False. Deleting the InferencePool (the HTTPRoute backendRef) instead makes the route's parent status report ResolvedRefs=False (BackendNotFound), which routeNotReady already surfaces. The InferencePool is rendered by the modeldeployment chart (not KAITO-reconciled), so it stays deleted until the case is reinstalled.

InferenceObjective is only served from the experimental inference.networking.x-k8s.io/v1alpha2 group (InferencePool graduated to the stable inference.networking.k8s.io/v1 group, InferenceObjective did not). Probing it as a required CRD made clusterCRDMissing perpetually active so clusterReady never fired, and coupling the probe to an experimental alpha version is fragile across GAIE bumps. Remove the entry; only the stable InferencePool is probed for GAIE. Also drop the now-stale inferenceobjectives mention from the RBAC comment.

rambohe-ch requested review from Fei-Guo, techworldhello and zhuangqh as code owners June 21, 2026 23:31

rambohe-ch had a problem deploying to e2e-test June 21, 2026 23:31 — with GitHub Actions Failure

rambohe-ch marked this pull request as draft June 21, 2026 23:31

rambohe-ch force-pushed the dev-issue-87 branch from 8031139 to 1178721 Compare June 22, 2026 08:55

rambohe-ch had a problem deploying to e2e-test June 22, 2026 08:55 — with GitHub Actions Failure

rambohe-ch force-pushed the dev-issue-87 branch from 1178721 to 2bdf635 Compare June 22, 2026 11:07

rambohe-ch had a problem deploying to e2e-test June 22, 2026 11:07 — with GitHub Actions Error

rambohe-ch marked this pull request as ready for review June 22, 2026 11:08

rambohe-ch force-pushed the dev-issue-87 branch from 2bdf635 to 8c1d78a Compare June 22, 2026 11:27

rambohe-ch had a problem deploying to e2e-test June 22, 2026 11:27 — with GitHub Actions Failure

rambohe-ch force-pushed the dev-issue-87 branch from 8c1d78a to 3232b47 Compare June 22, 2026 11:45

rambohe-ch had a problem deploying to e2e-test June 22, 2026 11:46 — with GitHub Actions Failure

rambohe-ch had a problem deploying to e2e-test June 22, 2026 12:40 — with GitHub Actions Failure

rambohe-ch added 3 commits June 22, 2026 23:33

rambohe-ch had a problem deploying to e2e-test June 22, 2026 13:47 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add productionstack-status-reporter component#105

feat: add productionstack-status-reporter component#105
rambohe-ch wants to merge 5 commits into
kaito-project:mainfrom
rambohe-ch:dev-issue-87

rambohe-ch commented Jun 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rambohe-ch commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rambohe-ch commented Jun 21, 2026 •

edited

Loading