Skip to content

feat: add productionstack-status-reporter component#105

Open
rambohe-ch wants to merge 5 commits into
kaito-project:mainfrom
rambohe-ch:dev-issue-87
Open

feat: add productionstack-status-reporter component#105
rambohe-ch wants to merge 5 commits into
kaito-project:mainfrom
rambohe-ch:dev-issue-87

Conversation

@rambohe-ch

@rambohe-ch rambohe-ch commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator

Reason for Change:

Introduce the productionstack-status-reporter, a leader-elected controller-runtime component that surfaces the health of a Production Stack deployment as aggregated Kubernetes Events.

On every resync the reporter scrapes vLLM pod metrics and evaluates a reason catalogue across three layers (control plane, cluster/harness, and model deployment), applies a priority ordering and cross-layer suppression so that a single root cause is reported instead of a cascade of downstream symptoms, and emits the result as aggregated Events in kube-system. It requires only read-only API access to the resources it inspects, plus permission to create Events.

To keep the event stream quiet during the legitimately long startup window, transient states are not reported as Warnings: cluster, harness, EPP and route findings are gated by a configurable startup grace period(object-age gating when the backing resource's age is known, otherwise a debounce), while Workspace and model-pod findings discriminate terminal failures from in-progress provisioning/download states.

Includes:

  • cmd/ entrypoint and pkg/ controllers, reason, and scraper packages with unit tests
  • startup grace gating to suppress transient startup-state Warnings, configurable via --startup-grace-seconds / startupGraceSeconds
  • productionstack-status-reporter Helm subchart (deployment, RBAC, serviceaccount, values schema) wired into the productionstack chart
  • Dockerfile.status-reporter and Makefile build/image targets
  • e2e tests covering cluster status, control-plane errors, harness status, upstream gating, weight-download slowness, and event message hygiene

Requirements

  • added unit tests and e2e tests (if applicable).

Issue Fixed:

Fixes #87

Notes for Reviewers:

Introduce the productionstack-status-reporter, a leader-elected
controller-runtime component that surfaces the health of a Production
Stack deployment as aggregated Kubernetes Events.

On every resync the reporter scrapes vLLM pod metrics and evaluates
a reason catalogue across three layers (control plane, cluster/harness,
and model deployment), applies a priority ordering and cross-layer
suppression so that a single root cause is reported instead of a cascade
of downstream symptoms, and emits the result as aggregated Events in
kube-system. It requires only read-only API access to the resources it
inspects, plus permission to create Events.

To keep the event stream quiet during the legitimately long startup
window, transient states are not reported as Warnings: cluster, harness,
EPP and route findings are gated by a configurable startup grace period
(object-age gating when the backing resource's age is known, otherwise a
debounce), while Workspace and model-pod findings discriminate terminal
failures from in-progress provisioning/download states.

Includes:
- cmd/ entrypoint and pkg/ controllers, reason, and scraper packages
  with unit tests
- startup grace gating to suppress transient startup-state Warnings,
  configurable via --startup-grace-seconds / startupGraceSeconds
- productionstack-status-reporter Helm subchart (deployment, RBAC,
  serviceaccount, values schema) wired into the productionstack chart
- least-privilege RBAC scoped to the resources the reporter actually
  reads as objects: CRD presence is probed via the discovery API, and
  KAITO NodeClaim health is read from Workspace status, so neither
  customresourcedefinitions, keda.sh, karpenter.sh, nor unused core/apps
  subresources are granted
- Dockerfile.status-reporter and Makefile build/image targets
- e2e wiring that builds, pushes, and installs the reporter image via
  the productionstack chart (prepare-image.sh, install-components.sh,
  validate-components.sh), with the reporter's default config aligned to
  the e2e install topology (istio-system / kaito-system namespaces)
- e2e tests covering cluster status, control-plane errors, harness
  status, upstream gating, weight-download slowness, and event message
  hygiene; each owns a dedicated namespace and runs as part of the
  standard (non-nightly) e2e suite

Signed-off-by: rambohe-ch <rambohe.ch@gmail.com>
…-scoped objects pass validation

The emitter publishes control-plane Events into kube-system with a
cluster-scoped involvedObject (Namespace/CRD, so involvedObject.namespace
is empty). Without EventTime set, the apiserver's legacyValidateEvent takes
the old-style branch which requires the Event namespace to be "" or
"default" for cluster-scoped objects, rejecting kube-system with
"involvedObject.namespace: Invalid value: \"\": does not match
event.namespace". As a result no reporter Events were ever created and all
StatusReporter e2e tests timed out.

Set EventTime on the created Event so the new-style validation branch
applies, which permits kube-system for cluster-scoped involvedObjects.
…etect deleted parent Gateway

- requiredCRDs probed inferenceobjectives in the graduated
  inference.networking.k8s.io/v1alpha2 group, which never exists (only
  InferencePool graduated; InferenceObjective stays experimental in
  inference.networking.x-k8s.io/v1alpha2). This made clusterCRDMissing
  perpetually active so clusterReady never fired, timing out the BBR
  recovery e2e test. Probe the correct x-k8s.io group/version.

- routeNotReady only inspected HTTPRoute status.parents conditions for
  Accepted/ResolvedRefs=False. When the parent Gateway is deleted, Istio
  removes the status.parents entry instead of flipping it to False, so no
  inferencesetRouteNotReady was emitted. Add missingParentGateway() to
  resolve each HTTPRoute parentRef Gateway and report not-ready when it is
  NotFound.
Revert the routeNotReady Gateway-existence probe in the status reporter and
instead drive inferencesetRouteNotReady from the test using a mechanism the
existing detection already handles.

Deleting the per-case Gateway is undetectable because Istio removes the
HTTPRoute status.parents[] entry instead of flipping Accepted=False. Deleting
the InferencePool (the HTTPRoute backendRef) instead makes the route's parent
status report ResolvedRefs=False (BackendNotFound), which routeNotReady already
surfaces. The InferencePool is rendered by the modeldeployment chart (not
KAITO-reconciled), so it stays deleted until the case is reinstalled.
InferenceObjective is only served from the experimental
inference.networking.x-k8s.io/v1alpha2 group (InferencePool graduated to the
stable inference.networking.k8s.io/v1 group, InferenceObjective did not).
Probing it as a required CRD made clusterCRDMissing perpetually active so
clusterReady never fired, and coupling the probe to an experimental alpha
version is fragile across GAIE bumps. Remove the entry; only the stable
InferencePool is probed for GAIE. Also drop the now-stale inferenceobjectives
mention from the RBAC comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

productionstack-status-reporter: new control-plane event reporter Deployment

1 participant