feat: add productionstack-status-reporter component#105
Open
rambohe-ch wants to merge 5 commits into
Open
Conversation
Introduce the productionstack-status-reporter, a leader-elected controller-runtime component that surfaces the health of a Production Stack deployment as aggregated Kubernetes Events. On every resync the reporter scrapes vLLM pod metrics and evaluates a reason catalogue across three layers (control plane, cluster/harness, and model deployment), applies a priority ordering and cross-layer suppression so that a single root cause is reported instead of a cascade of downstream symptoms, and emits the result as aggregated Events in kube-system. It requires only read-only API access to the resources it inspects, plus permission to create Events. To keep the event stream quiet during the legitimately long startup window, transient states are not reported as Warnings: cluster, harness, EPP and route findings are gated by a configurable startup grace period (object-age gating when the backing resource's age is known, otherwise a debounce), while Workspace and model-pod findings discriminate terminal failures from in-progress provisioning/download states. Includes: - cmd/ entrypoint and pkg/ controllers, reason, and scraper packages with unit tests - startup grace gating to suppress transient startup-state Warnings, configurable via --startup-grace-seconds / startupGraceSeconds - productionstack-status-reporter Helm subchart (deployment, RBAC, serviceaccount, values schema) wired into the productionstack chart - least-privilege RBAC scoped to the resources the reporter actually reads as objects: CRD presence is probed via the discovery API, and KAITO NodeClaim health is read from Workspace status, so neither customresourcedefinitions, keda.sh, karpenter.sh, nor unused core/apps subresources are granted - Dockerfile.status-reporter and Makefile build/image targets - e2e wiring that builds, pushes, and installs the reporter image via the productionstack chart (prepare-image.sh, install-components.sh, validate-components.sh), with the reporter's default config aligned to the e2e install topology (istio-system / kaito-system namespaces) - e2e tests covering cluster status, control-plane errors, harness status, upstream gating, weight-download slowness, and event message hygiene; each owns a dedicated namespace and runs as part of the standard (non-nightly) e2e suite Signed-off-by: rambohe-ch <rambohe.ch@gmail.com>
…-scoped objects pass validation The emitter publishes control-plane Events into kube-system with a cluster-scoped involvedObject (Namespace/CRD, so involvedObject.namespace is empty). Without EventTime set, the apiserver's legacyValidateEvent takes the old-style branch which requires the Event namespace to be "" or "default" for cluster-scoped objects, rejecting kube-system with "involvedObject.namespace: Invalid value: \"\": does not match event.namespace". As a result no reporter Events were ever created and all StatusReporter e2e tests timed out. Set EventTime on the created Event so the new-style validation branch applies, which permits kube-system for cluster-scoped involvedObjects.
…etect deleted parent Gateway - requiredCRDs probed inferenceobjectives in the graduated inference.networking.k8s.io/v1alpha2 group, which never exists (only InferencePool graduated; InferenceObjective stays experimental in inference.networking.x-k8s.io/v1alpha2). This made clusterCRDMissing perpetually active so clusterReady never fired, timing out the BBR recovery e2e test. Probe the correct x-k8s.io group/version. - routeNotReady only inspected HTTPRoute status.parents conditions for Accepted/ResolvedRefs=False. When the parent Gateway is deleted, Istio removes the status.parents entry instead of flipping it to False, so no inferencesetRouteNotReady was emitted. Add missingParentGateway() to resolve each HTTPRoute parentRef Gateway and report not-ready when it is NotFound.
Revert the routeNotReady Gateway-existence probe in the status reporter and instead drive inferencesetRouteNotReady from the test using a mechanism the existing detection already handles. Deleting the per-case Gateway is undetectable because Istio removes the HTTPRoute status.parents[] entry instead of flipping Accepted=False. Deleting the InferencePool (the HTTPRoute backendRef) instead makes the route's parent status report ResolvedRefs=False (BackendNotFound), which routeNotReady already surfaces. The InferencePool is rendered by the modeldeployment chart (not KAITO-reconciled), so it stays deleted until the case is reinstalled.
InferenceObjective is only served from the experimental inference.networking.x-k8s.io/v1alpha2 group (InferencePool graduated to the stable inference.networking.k8s.io/v1 group, InferenceObjective did not). Probing it as a required CRD made clusterCRDMissing perpetually active so clusterReady never fired, and coupling the probe to an experimental alpha version is fragile across GAIE bumps. Remove the entry; only the stable InferencePool is probed for GAIE. Also drop the now-stale inferenceobjectives mention from the RBAC comment.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reason for Change:
Introduce the productionstack-status-reporter, a leader-elected controller-runtime component that surfaces the health of a Production Stack deployment as aggregated Kubernetes Events.
On every resync the reporter scrapes vLLM pod metrics and evaluates a reason catalogue across three layers (control plane, cluster/harness, and model deployment), applies a priority ordering and cross-layer suppression so that a single root cause is reported instead of a cascade of downstream symptoms, and emits the result as aggregated Events in kube-system. It requires only read-only API access to the resources it inspects, plus permission to create Events.
To keep the event stream quiet during the legitimately long startup window, transient states are not reported as Warnings: cluster, harness, EPP and route findings are gated by a configurable startup grace period(object-age gating when the backing resource's age is known, otherwise a debounce), while Workspace and model-pod findings discriminate terminal failures from in-progress provisioning/download states.
Includes:
Requirements
Issue Fixed:
Fixes #87
Notes for Reviewers: