feat: prepare helm chart for kubernetes deployment options by crookedstorm · Pull Request #319 · NVIDIA-NeMo/nemo-platform

crookedstorm · 2026-06-12T21:11:06Z

Summary by CodeRabbit

New Features
- Full Helm chart for the platform, including ingress/HTTPRoute/OpenShift Route, optional Envoy proxy, PostgreSQL options, and Kyverno-based multi-node GPU networking.
- Added an NCCL multi-node allreduce test suite (orchestrated for CI validation).
Documentation
- Expanded Helm README and richer post-install instructions, including encryption-key guidance.
Bug Fixes
- Improved API environment secret rendering and upgrade-time encryption-key safeguards; enhanced image-pull secret handling.
Chores
- CI now detects Helm changes and runs dedicated Helm lint/verification jobs; lint/build tooling and Helm packaging defaults updated.

github-actions · 2026-06-12T21:22:13Z

Suite	Lines Covered	Line Rate	Branch Rate
Unit Tests	19462/25829	75.3%	60.9%
Integration Tests	11372/24601	46.2%	20.2%

coderabbitai · 2026-06-12T21:28:47Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds complete k8s/helm Helm chart with templates for API, core controller, PostgreSQL, Envoy proxy, ingress/routes, and Kyverno policies. Includes NCCL multi-node test orchestration, CI validation fixtures, and change-detection-gated linting/verification jobs. Repository paths and tooling updated to align with k8s/helm location.

Changes

Helm chart implementation and CI integration

Layer / File(s)	Summary
CI Helm detection and validation `.github/actions/changes/action.yaml`, `.github/workflows/ci.yaml`	Change detector marks `k8s/**` as Helm-related; CI adds conditional `helm-lint` and `helm-chart-verifier` jobs, includes them in `ci-status`.
Chart scaffold, packaging, and docs `k8s/helm/Chart.yaml`, `k8s/helm/.helmignore`, `k8s/helm/helm-docs-template/*`, `k8s/helm/README.md`, `.gitignore`, `.pre-commit-config.yaml`, `AGENTS.md`	Chart metadata/dependency, packaging ignore rules, README template and rendered output, repo ignores switched to `k8s/helm`.
Global defaults and shared helpers `k8s/helm/values.yaml`, `k8s/helm/templates/_helpers.tpl`, `k8s/helm/templates/_config-render.tpl`	Comprehensive defaults for API/core/Envoy/PostgreSQL; 40+ helpers for naming, config/env rendering, security context merging, telemetry, backend selection, secret handling.
Platform config, secrets, and lifecycle `k8s/helm/templates/platform-configmap.yaml`, `k8s/helm/templates/api-env-secret*.yaml`, `k8s/helm/templates/NOTES.txt`	Calculated ConfigMap, conditional Secret with generated/explicit key paths, pre-install key-generator Job, upgrade validation, release NOTES.
API service stack `k8s/helm/templates/api/*`	Deployment with conditional DB/env wiring, Service, ServiceAccount, HPA, PDB, ServiceMonitor, platform-seed Job hook.
Core controller and data layer `k8s/helm/templates/core/`, `k8s/helm/templates/postgres/`	Controller Deployment/RBAC/service, ServiceAccounts, shared PVC; PostgreSQL StatefulSet/Service/Secret with persistence toggle.
Envoy proxy and external routing `k8s/helm/templates/proxy/*`, `k8s/helm/templates/ingress.yaml`, `k8s/helm/templates/httproute.yaml`, `k8s/helm/templates/openshift-route.yaml`	Envoy ConfigMap/Deployment/Service/HPA/monitor; Ingress/HTTPRoute/OpenShift Route conditional on auth/enablement.
Kyverno multinode policies `k8s/helm/templates/networking/kyverno-policy.yaml`, `k8s/helm/templates/networking/nccl-topology-configmap.yaml`	Four cloud-specific Policies (AWS/Azure/GCP/OCI) mutate Pods for GPU networking; Azure topology ConfigMap.
NCCL allreduce test `k8s/helm/files/nccl-test/*`, `k8s/helm/templates/tests/nccl-test.yaml`	Environment setup, hardware discovery entrypoint, allreduce benchmark, Python orchestrator for node discovery/pod coordination; Helm test hooks with RBAC/ConfigMap/Jobs.
CI validation fixtures `k8s/helm/ci/*.yaml`	20 CI values files exercising auth, security contexts, cloud toggles, env injection, topology spread, ingress/route variants, PDBs, encryption-key paths.
Repository tooling updates `tools/lint/lint-helm.sh`, `web/packages/studio/scripts/feature-flag-matrix.ts`, `services/core/jobs/README.md`, `pyproject.toml`	Lint script defaults to `k8s/helm`; feature-flag matrix supports optional `HELM_DEPLOYMENT_VALUES_DIR`; docs/image names updated; type-check excludes NCCL test files.

Suggested labels

chore

Suggested reviewers

mckornfield
matthewgrossman

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch setup-helm/crookedstorm

coderabbitai

Actionable comments posted: 2

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (21)

k8s/helm/files/nccl-test/orchestrator.py-547-555 (1)

547-555: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Track pods in created_names immediately after creation.

If Kyverno assertion or wait logic fails before append, created pods are not cleaned up.

Suggested fix

         print(f"Creating pod {name0} on node {n0}")
         v1.create_namespaced_pod(namespace, body=pod0)
+        created_names.append(name0)
         if worker_net and wn_name and not injection_disabled:
             _wait_assert_kyverno_interconnect(v1, namespace, name0, wn_name, wn_req, wn_lim)
         p0 = _wait_pod_running(v1, namespace, name0, timeout_s=300)
@@
-        created_names.append(name0)
@@
             print(f"Creating pod {name} on node {hostname}")
             v1.create_namespaced_pod(namespace, body=pod)
+            created_names.append(name)
             if worker_net and wn_name and not injection_disabled:
                 _wait_assert_kyverno_interconnect(v1, namespace, name, wn_name, wn_req, wn_lim)
-            created_names.append(name)

Also applies to: 562-565

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/files/nccl-test/orchestrator.py` around lines 547 - 555, The code
creates pods via v1.create_namespaced_pod but only appends the pod name to
created_names after subsequent waits/assertions
(_wait_assert_kyverno_interconnect and _wait_pod_running), so if those fail the
pod is not tracked for cleanup; immediately append the pod identifier (e.g.,
name0) to created_names right after v1.create_namespaced_pod(...) and before
calling _wait_assert_kyverno_interconnect/_wait_pod_running, and make the same
change for the other symmetric creation block (the one that uses name1) so all
created pods are recorded even if later waits fail.

.github/workflows/ci.yaml-102-103 (1)

102-103: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Disable credential persistence in checkout steps for Helm jobs.

These jobs run repository-controlled scripts; leaving auth in .git/config unnecessarily expands token exposure on PR runs.

Suggested fix

   - name: Checkout code
     uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3
+    with:
+      persist-credentials: false

Also applies to: 152-153

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yaml around lines 102 - 103, The checkout steps
currently use actions/checkout (the "Checkout code" step with uses:
actions/checkout@...) and leave credentials persisted in .git/config; update
those checkout steps (including the similar occurrences around the later lines
noted) to set persist-credentials: false so the runner does not write repository
auth into .git/config for Helm/PR jobs, keeping the step name "Checkout code"
and the uses: actions/checkout entries but adding persist-credentials: false to
the step inputs.

Source: Linters/SAST tools

k8s/helm/files/nccl-test/orchestrator.py-170-183 (1)

170-183: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Cleanup exceptions in finally can mask the real test outcome.

If _delete_workers times out, it can override the original return/error path from config_test.

Suggested fix

     finally:
-        _delete_workers(v1, namespace, created_names)
+        try:
+            _delete_workers(v1, namespace, created_names)
+        except Exception as cleanup_error:
+            print(f"WARN: worker cleanup failed: {cleanup_error}", file=sys.stderr)

Also applies to: 587-588

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/files/nccl-test/orchestrator.py` around lines 170 - 183, _fix:
cleanup code in _delete_workers can raise (e.g. _wait_pod_absent timeout) and
thereby mask the original failure from config_test; change _delete_workers to
never let cleanup exceptions propagate: wrap the call to _wait_pod_absent (and
the delete_namespaced_pod block) in try/except that catches ApiException and
generic Exception, log warnings to stderr with context (pod name and exception),
and continue rather than re-raising; apply the same defensive try/except pattern
to the other cleanup site referenced (lines 587-588) so cleanup never hides the
original test error.

.github/workflows/ci.yaml-112-117 (1)

112-117: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Verify release asset integrity before executing downloaded binaries.

kubeconform and chart-verifier are downloaded and executed without checksum/signature validation. That is a supply-chain risk.

Suggested fix pattern

           curl -fsSL \
             -o "${RUNNER_TEMP}/kubeconform.tar.gz" \
             "https://github.com/yannh/kubeconform/releases/download/${KUBECONFORM_VERSION}/kubeconform-linux-amd64.tar.gz"
+          echo "${KUBECONFORM_SHA256}  ${RUNNER_TEMP}/kubeconform.tar.gz" | sha256sum -c -
           tar -xzf "${RUNNER_TEMP}/kubeconform.tar.gz" -C "${RUNNER_TEMP}/kubeconform" kubeconform

           curl -fsSL \
             -o "${RUNNER_TEMP}/chart-verifier.tar.gz" \
             "https://github.com/redhat-certification/chart-verifier/releases/download/${CHART_VERIFIER_VERSION}/chart-verifier-${CHART_VERIFIER_VERSION}.tgz"
+          echo "${CHART_VERIFIER_SHA256}  ${RUNNER_TEMP}/chart-verifier.tar.gz" | sha256sum -c -
           tar -xzf "${RUNNER_TEMP}/chart-verifier.tar.gz" -C "${RUNNER_TEMP}/chart-verifier" chart-verifier

Also applies to: 162-167

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yaml around lines 112 - 117, The workflow downloads and
executes kubeconform (and chart-verifier at the other block) without verifying
release integrity; update the CI steps that fetch
"${RUNNER_TEMP}/kubeconform.tar.gz" (KUBECONFORM_VERSION) and the chart-verifier
download to also fetch the corresponding checksum or signature file from the
release, verify it before extracting/executing (e.g., download .sha256 or .asc
and validate with shasum -c or gpg --verify), and fail the job if verification
fails so the subsequent tar, chmod and adding to "${GITHUB_PATH}" only run after
a successful integrity/signature check.

k8s/helm/files/nccl-test/entrypoint.sh-72-81 (1)

72-81: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Strict IB mode is bypassed when ibv_devinfo fails.

In strict mode, a failed ibv_devinfo call should fail fast; currently it falls through silently.

Suggested fix

     if command -v ibv_devinfo >/dev/null 2>&1; then
       if ibv_devinfo > /tmp/ibv_precheck.out 2>&1; then
         if ! grep -q "PORT_ACTIVE" /tmp/ibv_precheck.out; then
           echo "ERROR: InfiniBand devices present but ibv_devinfo shows no PORT_ACTIVE (see network-operator README)."
           cat /tmp/ibv_precheck.out
           exit 1
         fi
         echo "✓ ibv_devinfo reports PORT_ACTIVE"
+      else
+        echo "ERROR: NCCL_TEST_STRICT_IB_PORT_ACTIVE=true but ibv_devinfo failed"
+        cat /tmp/ibv_precheck.out 2>/dev/null || true
+        exit 1
       fi
     fi

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/files/nccl-test/entrypoint.sh` around lines 72 - 81, The script
currently ignores failures from ibv_devinfo; modify the ibv_devinfo check in
entrypoint.sh so that if the ibv_devinfo invocation itself fails and strict IB
mode is enabled (use the existing strict-mode flag/variable used elsewhere in
the script, e.g., STRICT_IB or similar), the script should print an error
including the ibv_devinfo output and exit non-zero; otherwise (non-strict mode)
continue as before. Specifically, update the block that runs `ibv_devinfo >
/tmp/ibv_precheck.out 2>&1` to test its exit status, and on failure when strict
mode is true call `echo` (with context) and `exit 1`, while preserving the
existing PORT_ACTIVE grep path when ibv_devinfo succeeds.

k8s/helm/templates/networking/kyverno-policy.yaml-187-188 (1)

187-188: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

NCCL_TOPO_FILE points to a placeholder, not the mounted file.

Line 188 should reference the actual mounted ConfigMap path, otherwise NCCL topology injection is ineffective.

Suggested fix

                 - name: "NCCL_TOPO_FILE"
-                  value: NcclTopoFilePath
+                  value: "/etc/nccl/nccl-topo.xml"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/networking/kyverno-policy.yaml` around lines 187 - 188,
The environment variable NCCL_TOPO_FILE currently contains the literal
placeholder "NcclTopoFilePath" instead of pointing at the actual mounted
ConfigMap file; change the value to the actual mount path or wire it to the
ConfigMap key so the container can read the topology file. Locate the
NCCL_TOPO_FILE entry in networking/kyverno-policy.yaml and either set value to
the exact file path where the ConfigMap is mounted (for example the folder used
by your volumeMount) or replace the value with a valueFrom configMapKeyRef that
references the ConfigMap name and key that holds the topology file so the
process reads the real file at runtime.

k8s/helm/templates/networking/kyverno-policy.yaml-263-288 (1)

263-288: 🎯 Functional Correctness | 🟠 Major

GCP mutate patchStrategicMerge writes Pod annotations at the wrong level—use metadata.annotations.

In k8s/helm/templates/networking/kyverno-policy.yaml (lines 263-288), the Pod patchStrategicMerge sets annotations: at the patch root; for Pod resources these must be under metadata.annotations so Kyverno can apply them.

Suggested fix

           patchStrategicMerge:
-            annotations:
-              devices.gke.io/container.tcpxo-daemon: |
+            metadata:
+              annotations:
+                devices.gke.io/container.tcpxo-daemon: |
                 - path: /dev/nvidia0
                 - path: /dev/nvidia1
                 - path: /dev/nvidia2
                 - path: /dev/nvidia3
                 - path: /dev/nvidia4
                 - path: /dev/nvidia5
                 - path: /dev/nvidia6
                 - path: /dev/nvidia7
                 - path: /dev/nvidiactl
                 - path: /dev/nvidia-uvm
                 - path: /dev/dmabuf_import_helper
-              networking.gke.io/default-interface: eth0
-              networking.gke.io/interfaces: |
+                networking.gke.io/default-interface: eth0
+                networking.gke.io/interfaces: |
                 [
                   {"interfaceName":"eth0","network":"default"},
                   {"interfaceName":"eth1","network":"gpu-nic0"},
                   {"interfaceName":"eth2","network":"gpu-nic1"},
                   {"interfaceName":"eth3","network":"gpu-nic2"},
                   {"interfaceName":"eth4","network":"gpu-nic3"},
                   {"interfaceName":"eth5","network":"gpu-nic4"},
                   {"interfaceName":"eth6","network":"gpu-nic5"},
                   {"interfaceName":"eth7","network":"gpu-nic6"},
                   {"interfaceName":"eth8","network":"gpu-nic7"}
                 ]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/networking/kyverno-policy.yaml` around lines 263 - 288,
The Kyverno patch under patchStrategicMerge is placing annotations at the patch
root instead of under metadata.annotations; update the patch used in the Kyverno
policy (patchStrategicMerge) so that the existing annotations block (including
devices.gke.io/container.tcpxo-daemon, networking.gke.io/default-interface and
networking.gke.io/interfaces) is nested beneath metadata.annotations for the Pod
template/resource; keep the annotation keys and values unchanged but move them
into metadata.annotations so Kyverno will apply them correctly.

k8s/helm/templates/tests/nccl-test.yaml-187-189 (1)

187-189: 🩺 Stability & Availability | 🟠 Major

Pin Kubernetes Python client version in helm test.

k8s/helm/templates/tests/nccl-test.yaml installs kubernetes>=28.0.0 at test runtime, making helm test non-deterministic (different runs can pull different kubernetes client versions). Pin an exact version (or bake it into the orchestrator image and remove the runtime pip install).

Suggested change

-              pip install --no-cache-dir 'kubernetes>=28.0.0' >/dev/null
+              pip install --no-cache-dir 'kubernetes==<pinned-version>' >/dev/null

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/tests/nccl-test.yaml` around lines 187 - 189, The test
YAML currently runs pip install 'kubernetes>=28.0.0' at runtime (the pip install
line in nccl-test.yaml) which makes helm test nondeterministic; change that line
to pin an exact client version (e.g. pip install --no-cache-dir
'kubernetes==28.0.0') or remove the runtime pip install entirely and ensure
/scripts/orchestrator.py runs with a prebaked image that already contains the
pinned kubernetes client; update the nccl-test.yaml pip install invocation or
the orchestrator image accordingly so tests always use a fixed kubernetes client
version.

k8s/helm/templates/proxy/envoy-service.yaml-13-22 (1)

13-22: 🔒 Security & Privacy | 🟠 Major

Gate Envoy admin Service exposure behind an explicit opt-in (prevents NodePort/LoadBalancer leakage).

k8s/helm/templates/proxy/envoy-service.yaml always publishes the Envoy admin port via a Service whose type is configurable (.Values.envoyProxy.service.type). Envoy binds the admin listener to 0.0.0.0, so with NodePort/LoadBalancer the admin endpoint can become externally reachable.
Proposed direction
   ports:
     - name: proxy
       port: {{ .Values.envoyProxy.service.port }}
       targetPort: proxy
       protocol: TCP
+    {{- if .Values.envoyProxy.service.exposeAdminPort }}
     - name: admin
       port: {{ .Values.envoyProxy.adminPort }}
       targetPort: admin
       protocol: TCP
+    {{- end }}
Add envoyProxy.service.exposeAdminPort defaulting to false and enable only where needed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/proxy/envoy-service.yaml` around lines 13 - 22, The
Service currently always exposes the Envoy admin port (port name
"admin"/template envoy-service.yaml) which can leak externally when
.Values.envoyProxy.service.type is NodePort/LoadBalancer; add a new boolean
value .Values.envoyProxy.service.exposeAdminPort (default false in values.yaml)
and change the template to include the "admin" port entry only when that flag is
true (keep the existing "proxy" port unconditional), ensuring you reference the
env var names admin port (.Values.envoyProxy.adminPort) and service type
(.Values.envoyProxy.service.type) so maintainers can opt-in to exposing admin.

k8s/helm/templates/ingress.yaml-25-30 (1)

25-30: 🎯 Functional Correctness | 🟠 Major

Guard .Values.ingress.defaultHost against empty .Values.ingress.hosts to prevent render-time Helm failure.

k8s/helm/templates/ingress.yaml indexes (index .Values.ingress.hosts 0).paths when defaultHost is set, without checking that ingress.hosts has at least one element. This can crash helm template if users override ingress.hosts: [].

Proposed fix

   rules:
     {{- if .Values.ingress.defaultHost }}
+    {{- if eq (len .Values.ingress.hosts) 0 }}
+    {{- fail "ingress.hosts must contain at least one item when ingress.defaultHost is set" }}
+    {{- end }}
     - host: {{ .Values.ingress.defaultHost | quote }}
       http:
         paths:
           {{- range (index .Values.ingress.hosts 0).paths }}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/ingress.yaml` around lines 25 - 30, The template accesses
(index .Values.ingress.hosts 0).paths when .Values.ingress.defaultHost is
present, which crashes if .Values.ingress.hosts is empty; update the ingress
template to guard that .Values.ingress.hosts has at least one element before
indexing (e.g., check with "and .Values.ingress.defaultHost (gt (len
.Values.ingress.hosts) 0)" or equivalent conditional) and only iterate over
(index .Values.ingress.hosts 0).paths when that condition is true so helm
template won't fail when ingress.hosts is [].

k8s/helm/templates/proxy/_helpers.tpl-21-21 (1)

21-21: 🎯 Functional Correctness | 🟠 Major

Truncate fullname before appending -envoy in nmp-envoy.serviceAccountName

File: k8s/helm/templates/proxy/_helpers.tpl
Lines: 21-21
{{- default (printf "%s-envoy" (include "nemo-platform.fullname" .)) .Values.envoyProxy.serviceAccount.name }}
nemo-platform.fullname is truncated to 63 chars, but nmp-envoy.serviceAccountName appends -envoy without truncation, so the resulting metadata.name can exceed the DNS label 63-char limit and fail resource creation. Same un-truncated pattern also exists for nmp-api.apiServiceAccountName and nmp-core.*ServiceAccountName helpers.
Proposed fix
-{{- default (printf "%s-envoy" (include "nemo-platform.fullname" .)) .Values.envoyProxy.serviceAccount.name }}
+{{- default (printf "%s-envoy" (include "nemo-platform.fullname" . | trunc 57)) .Values.envoyProxy.serviceAccount.name }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/proxy/_helpers.tpl` at line 21, Truncate the fullname
before appending suffixes so the final serviceAccount name stays within the
63-char DNS label limit; replace the current expression with one that truncates
the result of (include "nemo-platform.fullname" .) to 57 chars then appends
"-envoy" (63 - len("-envoy") = 57), e.g. use the template truncation function on
nemo-platform.fullname inside the printf; apply the same pattern to the
nmp-api.apiServiceAccountName and all nmp-core.*ServiceAccountName helpers,
adjusting the truncation length by subtracting the suffix length from 63 for
each suffix.

k8s/helm/templates/openshift-route.yaml-20-20 (1)

20-20: 🎯 Functional Correctness | 🟠 Major

Don’t coerce OpenShift Route targetPort to int

route.openshift.io/v1 spec.port.targetPort is IntOrString (named or numeric), but the template forces | int, so a named port becomes 0 and breaks routing.

Suggested fix

-    targetPort: {{ tpl .Values.openshiftRoute.targetPort $ | int }}
+    targetPort: {{ tpl .Values.openshiftRoute.targetPort $ }}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/openshift-route.yaml` at line 20, The template coerces
.Values.openshiftRoute.targetPort to an int which breaks named ports; update the
openshift-route.yaml so spec.port.targetPort uses the rendered tpl value
directly (remove the | int coercion) and preserve the original type/string
(e.g., use {{ tpl .Values.openshiftRoute.targetPort $ }} or quote it) so that
.Values.openshiftRoute.targetPort can be a named port or a numeric port without
being cast to 0.

k8s/helm/values.yaml-22-25 (1)

22-25: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Secret defaults disable chart-generated secrets by default.

Line 22 and Line 25 are non-empty, so ngc-api-secret.yaml and imagepull-secret.yaml do not render, while workloads still reference those secret names. Fresh installs fail unless both secrets already exist. Default these to "" or hard-fail with explicit validation when missing.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/values.yaml` around lines 22 - 25, The values existingSecret and
existingImagePullSecret are set to non-empty defaults which disable rendering of
the chart templates ngc-api-secret.yaml and imagepull-secret.yaml while
workloads still reference those secret names; change the defaults for
existingSecret and existingImagePullSecret in values.yaml to "" (empty string)
so the chart generates the secrets, or alternatively add explicit validation in
the chart (e.g., in _helpers or a pre-install hook) that fails fast when those
values are non-empty but the referenced secrets are missing; update references
to the symbols existingSecret and existingImagePullSecret and the templates
ngc-api-secret.yaml/imagepull-secret.yaml accordingly so fresh installs do not
fail.

k8s/helm/templates/_helpers.tpl-237-247 (1)

237-247: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

External DB path can render an invalid empty secret reference.

When postgresql.enabled=false and no externalDatabase.existingSecret/uriSecret is set, Line 268 renders secretKeyRef.name empty. Fail early with required instead of emitting invalid manifests.

Suggested fix

 {{- define "nemo-common.postgresql.password" -}}
 {{- if not (and .Values.externalDatabase.uriSecret .Values.externalDatabase.uriSecret.name .Values.externalDatabase.uriSecret.key) }}
+{{- if and (not .Values.postgresql.enabled) (not .Values.externalDatabase.existingSecret) -}}
+{{- required "externalDatabase.existingSecret is required when postgresql.enabled=false and externalDatabase.uriSecret is not set" .Values.externalDatabase.existingSecret -}}
+{{- end }}
 - name: POSTGRES_DB_PASSWORD
   valueFrom:
     secretKeyRef:
       name: {{ include "nemo-common.postgresql.secret-name" .}}
       key: {{ include "nemo-common.postgresql.password-key" .}}
 {{- end }}
 {{- end -}}

Also applies to: 252-270

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/_helpers.tpl` around lines 237 - 247, The helper template
"nemo-common.postgresql.secret-name" can return an empty string when
postgresql.enabled is false and neither externalDatabase.existingSecret nor
uriSecret (or the postgresql auth existingSecret) is set, which leads to invalid
manifests (empty secretKeyRef.name); change the template to fail early using
Helm's required function so it errors when no secret is provided (e.g., wrap the
externalDatabase.existingSecret or uriSecret lookup with required and a clear
message), and apply the same fix to the other helper occurrences referenced
around lines 252-270 so any missing secret causes a clear required error instead
of emitting an empty name.

k8s/helm/templates/_helpers.tpl-183-185 (1)

183-185: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

PostgreSQL service account name can exceed 63 chars.

Line 184 appends -postgres to nemo-platform.fullname without truncation. Long release names will generate invalid ServiceAccount names.

Suggested fix

 {{- define "nemo-common.postgresql.serviceAccountName" -}}
 {{- if .Values.postgresql.serviceAccount.create -}}
-{{- default (printf "%s-postgres" (include "nemo-platform.fullname" .)) .Values.postgresql.serviceAccount.name }}
+{{- default (include "nemo-common.postgresql.fullname" .) .Values.postgresql.serviceAccount.name }}
 {{- else -}}
 {{- default "default" .Values.postgresql.serviceAccount.name }}
 {{- end -}}
 {{- end -}}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/_helpers.tpl` around lines 183 - 185, The ServiceAccount
name currently appends "-postgres" to (include "nemo-platform.fullname" .)
without truncation, which can exceed the 63-char DNS limit; update the template
producing the default name to truncate the fullname to 54 characters before
appending "-postgres" (63 - len("-postgres") = 54) so the final name is <=63
chars, e.g. replace the default expression that uses (include
"nemo-platform.fullname" .) with one that calls the truncation function (e.g.
trunc/transpose appropriate Helm/Sprig trunc function) on that include and then
printf "%s-postgres" against the truncated value, keeping the existing fallback
to .Values.postgresql.serviceAccount.name.

k8s/helm/templates/_helpers.tpl-294-297 (1)

294-297: 🩺 Stability & Availability | 🟠 Major

Sort map keys before rendering env vars to keep output deterministic.

These range blocks iterate over Go template maps ($merged is built via dict + set, and other .Values.*.env fields are kindIs "map"), so env var ordering can flap between renders.
{{- range $key, $val := $merged }}
- name: {{ $key }}
  value: {{ $val | quote }}
{{- end }}
Sort keys before rendering (same issue also at 355-363, 376-384, 397-405, 417-425).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/_helpers.tpl` around lines 294 - 297, The env var
rendering iterates over the map $merged which makes output order
nondeterministic; change to iterate over a sorted list of keys (use keys $merged
piped to sortAlpha) and then index into $merged for each key so the - name/value
blocks are emitted in deterministic alphabetical order; apply the same pattern
for the other range blocks mentioned (around the occurrences at lines ~355-363,
~376-384, ~397-405, ~417-425) using the sorted keys and index function.

k8s/helm/templates/api/api-servicemonitor.yaml-21-24 (1)

21-24: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Selector mismatch: using full labels instead of selectorLabels.

Line 24 uses nemo-platform.labels which includes chart version and managed-by. ServiceMonitor selector should use nemo-platform.selectorLabels to match the Service selector (line 22 in api-service.yaml).

🔧 Fix selector

   selector:
     matchLabels:
       app.kubernetes.io/component: nmp-api
-      {{- include "nemo-platform.labels" . | nindent 6 }}
+      {{- include "nemo-platform.selectorLabels" . | nindent 6 }}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/api/api-servicemonitor.yaml` around lines 21 - 24, The
ServiceMonitor selector is using the full label set via the template helper
nemo-platform.labels (inside matchLabels) which includes version/managed-by keys
and therefore mismatches the Service's selector; change the selector in
api-servicemonitor.yaml to use the selector-only helper
nemo-platform.selectorLabels instead of nemo-platform.labels so matchLabels
exactly matches the Service selector (update the template reference where
matchLabels includes {{- include "nemo-platform.labels" . | nindent 6 }} to use
{{- include "nemo-platform.selectorLabels" . | nindent 6 }}).

k8s/helm/templates/api/api-hpa.yaml-21-37 (1)

21-37: 🎯 Functional Correctness | 🟠 Major

Guard against rendering an HPA with an empty metrics list
The template only emits CPU/memory metrics when api.autoscaling.targetCPUUtilizationPercentage / ...targetMemoryUtilizationPercentage are set; if both are unset while api.autoscaling.enabled=true, metrics: becomes empty and Kubernetes rejects the HPA. Current defaults are safe (api.autoscaling.enabled=false, targetCPUUtilizationPercentage=80), but add a guard/required (or ensure one default target is set) to prevent the misconfigured invalid manifest.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/api/api-hpa.yaml` around lines 21 - 37, The HPA template
can render an empty metrics: block if neither
.Values.api.autoscaling.targetCPUUtilizationPercentage nor
.Values.api.autoscaling.targetMemoryUtilizationPercentage are set; update the
api-hpa.yaml template to guard rendering of the entire metrics: section by
checking that at least one of those values exists (or provide a safe default
metric) and ensure .Values.api.autoscaling.enabled is considered when deciding
to emit metrics; locate the metrics block using the symbols
.Values.api.autoscaling.targetCPUUtilizationPercentage and
.Values.api.autoscaling.targetMemoryUtilizationPercentage and change the
template logic so metrics: is only emitted when it will contain at least one
valid entry.

k8s/helm/templates/api/api-deployment.yaml-95-99 (1)

95-99: 🩺 Stability & Availability | 🟠 Major

Add fail-fast/documentation for the NGC_API_KEY secret dependency

k8s/helm/templates/api/api-deployment.yaml (lines 95-99) sets NGC_API_KEY via secretKeyRef using {{ .Values.existingSecret | default "ngc-api" }}; k8s/helm/values.yaml defaults existingSecret: ngc-api. The chart provides no install-time warning guidance in k8s/helm/templates/NOTES.txt for this dependency. Document the prerequisite (or add validation/fail-fast) so installs don’t crashloop on a missing Secret.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/api/api-deployment.yaml` around lines 95 - 99, The chart
currently mounts NGC_API_KEY from a Secret named by .Values.existingSecret
(default "ngc-api") but provides no install-time guidance or validation; add a
NOTES.txt entry that documents the required Secret name and example create
command, and add a simple template validation that uses the Helm lookup function
to check for the Secret (using .Values.existingSecret) during render and emits a
clear failure (using the fail function) or a prominent NOTE if missing;
reference the NGC_API_KEY env var, the .Values.existingSecret value/default, and
update k8s/helm/templates/NOTES.txt and add a new validation template (e.g.,
templates/_validate-secret.yaml) to implement the check.

k8s/helm/templates/core/_helpers.tpl-37-42 (1)

37-42: 🔒 Security & Privacy | 🟠 Major

Remove "default" ServiceAccount fallback when create=false (controller RBAC/security)

nmp-core.controllerServiceAccountName falls back to "default" when .Values.core.controller.serviceAccount.create=false and .Values.core.controller.serviceAccount.name is unset. That helper drives both the controller Deployment serviceAccountName and the controller-role.yaml RoleBinding subjects[].name, so a misconfigured release can bind controller RBAC to the namespace’s default ServiceAccount.

Same unsafe fallback pattern exists in nmp-core.apiServiceAccountName and nmp-core.jobsServiceAccountName (core/_helpers.tpl); handle create=false consistently (no "default" implicit name).

nmp-core.database-migrations-servicename appends -core-migrations to nemo-platform.fullname without truncation; if used for a Kubernetes Service.metadata.name, it can exceed the 63-char DNS label limit.
Suggested fix
 {{- define "nmp-core.controllerServiceAccountName" -}}
 {{- if .Values.core.controller.serviceAccount.create }}
 {{- default (printf "%s-core-controller" (include "nemo-platform.fullname" .)) .Values.core.controller.serviceAccount.name }}
 {{- else }}
-{{- default "default" .Values.core.controller.serviceAccount.name }}
+{{- required "core.controller.serviceAccount.name must be set when core.controller.serviceAccount.create=false" .Values.core.controller.serviceAccount.name }}
 {{- end }}
 {{- end }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/core/_helpers.tpl` around lines 37 - 42, The helpers are
unsafe: nmp-core.apiServiceAccountName (and similarly
nmp-core.jobsServiceAccountName and nmp-core.controllerServiceAccountName)
currently fall back to the literal "default" when
.Values.*.serviceAccount.create is false and .Values.*.serviceAccount.name is
unset; remove that implicit "default" fallback so that when create=false and no
explicit name is provided the helper returns empty (or nil) and callers must not
bind RBAC to the namespace default SA. Update those three helpers to: if
.Values.*.serviceAccount.create is true return the provided name or the
generated fullname-core, otherwise return empty/omit value. Also update
nmp-core.database-migrations-servicename to generate the fullname with
"-core-migrations" but ensure the result is truncated to a DNS-1123 label (max
63 chars) — use the chart's truncation utility (or substring) to cap length
while preserving the suffix "-core-migrations".

k8s/helm/templates/postgres/postgres-service.yaml-9-10 (1)

9-10: 🩺 Stability & Availability | 🟠 Major

Make the governing PostgreSQL Service headless for the StatefulSet.

postgres-statefulset.yaml sets spec.serviceName to the same fullname as postgres-service.yaml’s Service, but postgres-service.yaml renders spec.type: ClusterIP and there is no clusterIP: None headless Service under k8s/helm/templates/postgres, which can break StatefulSet pod DNS/stable identity.
Suggested fix
 spec:
-  type: ClusterIP
+  clusterIP: None
   ports:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/postgres/postgres-service.yaml` around lines 9 - 10, The
Service rendered by postgres-service.yaml is currently a non-headless ClusterIP
which breaks StatefulSet stable network identity because
postgres-statefulset.yaml sets spec.serviceName to the Service fullname; change
the Service to be headless by setting clusterIP: None (and remove/ignore any
external type requirements) so it becomes a governing headless Service for the
StatefulSet; update postgres-service.yaml to set clusterIP: None while keeping
the same metadata name used by postgres-statefulset.yaml (reference
spec.serviceName in postgres-statefulset.yaml and spec.type/clusterIP in
postgres-service.yaml).

🟡 Minor comments (2)

services/guardrails/callouts/README.md-12-12 (1)
12-12: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Vague product documentation reference.

"product documentation" lacks a specific URL. Local developers may struggle to find the correct Guardrails configuration examples.

Consider adding a direct link or verifying the product docs contain the necessary examples for local setup.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/guardrails/callouts/README.md` at line 12, Update the README line
"Create nemoguard configs using the Guardrails configuration examples in the
product documentation." to include a direct link (or internal doc anchor) to the
exact "Guardrails configuration examples" page or section and/or the specific
example file names needed for local setup, and confirm the referenced product
docs actually contain the example snippets required for creating nemoguard
configs.
k8s/helm/templates/api/api-servicemonitor.yaml-13-19 (1)
13-19: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Empty annotations block when both annotation sources are unset.

If both api.annotations and api.serviceMonitor.annotations are empty, line 13 renders annotations: with no keys, producing invalid YAML.
🔧 Wrap annotations in conditional
+  {{- if or .Values.api.annotations .Values.api.serviceMonitor.annotations }}
   annotations:
   {{- with .Values.api.annotations }}
     {{- toYaml . | nindent 4 }}
   {{- end }}
   {{- with .Values.api.serviceMonitor.annotations }}
     {{- toYaml . | nindent 4 }}
   {{- end }}
+  {{- end }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/api/api-servicemonitor.yaml` around lines 13 - 19, Wrap
the annotations section in a conditional so "annotations:" is only rendered when
either .Values.api.annotations or .Values.api.serviceMonitor.annotations is
non-empty; replace the current block with an if checking "or
.Values.api.annotations .Values.api.serviceMonitor.annotations" and keep the
existing toYaml | nindent 4 lines for each with block inside that conditional so
indentation and existing annotation sources remain unchanged.

🧹 Nitpick comments (3)

k8s/helm/templates/api/_helpers.tpl (2)

42-48: 🩺 Stability & Availability | ⚖️ Poor tradeoff

Risk: Fragile nested YAML access.

Line 44 accesses .files.default_storage_config.type after a single nil check on .files. If default_storage_config or type is missing, the template will fail at render time.

Consider adding defensive checks:

{{- if and (include "nemo-platform.calculatedConfig" . | fromYaml).files (include "nemo-platform.calculatedConfig" . | fromYaml).files.default_storage_config -}}
{{- eq (include "nemo-platform.calculatedConfig" . | fromYaml).files.default_storage_config.type "local" -}}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/api/_helpers.tpl` around lines 42 - 48, The template
nmp-core.localStorageEnabled is accessing nested fields on (include
"nemo-platform.calculatedConfig" . | fromYaml) without checking for existence of
default_storage_config or type; update the template to first assign the
included/calculated config to a local variable (e.g., $cfg) and then use
defensive checks like and $cfg.files and $cfg.files.default_storage_config
before evaluating $cfg.files.default_storage_config.type == "local" so the
template won't fail when those keys are missing.

53-57: 🩺 Stability & Availability | ⚖️ Poor tradeoff

Risk: Missing null check on nested path.

Line 55 accesses .files.default_storage_config.path without verifying default_storage_config exists. Template fails if structure is incomplete.

Add safety check for default_storage_config:

{{- if and (include "nemo-platform.calculatedConfig" . | fromYaml).files (include "nemo-platform.calculatedConfig" . | fromYaml).files.default_storage_config -}}
{{ (include "nemo-platform.calculatedConfig" . | fromYaml).files.default_storage_config.path | default "" }}
{{- end -}}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/api/_helpers.tpl` around lines 53 - 57, The template
define "nmp-core.localStoragePath" accesses .files.default_storage_config.path
without null-checking default_storage_config, causing failures when that key is
missing; update the template to guard the access by adding an existence check
(e.g., use an if with and to ensure (include "nemo-platform.calculatedConfig" .
| fromYaml).files and (include "nemo-platform.calculatedConfig" . |
fromYaml).files.default_storage_config are present) before reading
.files.default_storage_config.path so the template returns the default "" when
missing.

k8s/helm/templates/core/_helpers.tpl (1)

30-32: 🎯 Functional Correctness

Harden the migrations servicename helper (but it’s currently unused in this chart).

"nmp-core.database-migrations-servicename" is defined without truncation, but repo/template searches show it has no usages in this chart—so the installation/upgrade failure risk is currently unproven. Still, truncate it to keep the helper safe for DNS-1123 name length if it’s used later.
Suggested fix
 {{- define "nmp-core.database-migrations-servicename" }}
-{{- printf "%s-core-migrations" ( include "nemo-platform.fullname" .) }}
+{{- printf "%s-core-migrations" ( include "nemo-platform.fullname" . | trunc 47 | trimSuffix "-" ) }}
 {{- end }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/core/_helpers.tpl` around lines 30 - 32, The helper
"nmp-core.database-migrations-servicename" currently builds a service name from
( include "nemo-platform.fullname" .) without truncation, risking DNS-1123
length violations; update that helper to truncate the resulting
"%s-core-migrations" to 63 characters (DNS-1123 max) and strip any trailing
hyphen so the output is a valid DNS name—use the existing fullname include, then
apply Helm truncation (e.g., trunc 63) and a trimSuffix "-" operation to produce
the final service name.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e134744b-485d-4b31-bc6c-311900c9fbe4

📥 Commits

Reviewing files that changed from the base of the PR and between f9e01c4 and d04ad8a.

📒 Files selected for processing (79)

.github/actions/changes/action.yaml
.github/workflows/ci.yaml
.gitignore
.pre-commit-config.yaml
AGENTS.md
k8s/helm/.helmignore
k8s/helm/Chart.yaml
k8s/helm/README.md
k8s/helm/ci/01-preexisting-imagepullsecret.yaml
k8s/helm/ci/02-empty-imagepullsecret.yaml
k8s/helm/ci/03-global-security-context.yaml
k8s/helm/ci/04-security-context-override.yaml
k8s/helm/ci/05-enable-auth.yaml
k8s/helm/ci/06-kyverno-aws.yaml
k8s/helm/ci/07-kyverno-azure.yaml
k8s/helm/ci/08-kyverno-gcp.yaml
k8s/helm/ci/09-kyverno-oci.yaml
k8s/helm/ci/10-override-image-registry.yaml
k8s/helm/ci/11-setting-env.yaml
k8s/helm/ci/12-topology-spread-constraints.yaml
k8s/helm/ci/13-openshift.yaml
k8s/helm/ci/14-pdb-min-available.yaml
k8s/helm/ci/14-secretfromenv.yaml
k8s/helm/ci/15-pdb-max-unavailable.yaml
k8s/helm/ci/16-enable-ingress.yaml
k8s/helm/ci/17-ingress-default-host.yaml
k8s/helm/ci/18-enable-httproute.yaml
k8s/helm/ci/19-envoy-extra-args.yaml
k8s/helm/files/nccl-test/entrypoint.sh
k8s/helm/files/nccl-test/nccl-env.sh
k8s/helm/files/nccl-test/nccl_test.py
k8s/helm/files/nccl-test/orchestrator.py
k8s/helm/helm-docs-template/nemo-helm-readme.md.gotmpl
k8s/helm/templates/NOTES.txt
k8s/helm/templates/_config-render.tpl
k8s/helm/templates/_helpers.tpl
k8s/helm/templates/api-env-secret.yaml
k8s/helm/templates/api/_helpers.tpl
k8s/helm/templates/api/api-deployment.yaml
k8s/helm/templates/api/api-hpa.yaml
k8s/helm/templates/api/api-pdb.yaml
k8s/helm/templates/api/api-service.yaml
k8s/helm/templates/api/api-serviceaccount.yaml
k8s/helm/templates/api/api-servicemonitor.yaml
k8s/helm/templates/core/_helpers.tpl
k8s/helm/templates/core/controller-deployment.yaml
k8s/helm/templates/core/controller-role.yaml
k8s/helm/templates/core/controller-service-headless.yaml
k8s/helm/templates/core/controller-serviceaccount.yaml
k8s/helm/templates/core/controller-servicemonitor.yaml
k8s/helm/templates/core/jobs-serviceaccount.yaml
k8s/helm/templates/core/shared-pvc.yaml
k8s/helm/templates/httproute.yaml
k8s/helm/templates/imagepull-secret.yaml
k8s/helm/templates/ingress.yaml
k8s/helm/templates/models-files-auth-secret.yaml
k8s/helm/templates/networking/kyverno-policy.yaml
k8s/helm/templates/networking/nccl-topology-configmap.yaml
k8s/helm/templates/ngc-api-secret.yaml
k8s/helm/templates/openshift-route.yaml
k8s/helm/templates/platform-configmap.yaml
k8s/helm/templates/platform-seed-job.yaml
k8s/helm/templates/postgres/postgres-secret.yaml
k8s/helm/templates/postgres/postgres-service.yaml
k8s/helm/templates/postgres/postgres-serviceaccount.yaml
k8s/helm/templates/postgres/postgres-statefulset.yaml
k8s/helm/templates/proxy/_helpers.tpl
k8s/helm/templates/proxy/envoy-configmap.yaml
k8s/helm/templates/proxy/envoy-deployment.yaml
k8s/helm/templates/proxy/envoy-hpa.yaml
k8s/helm/templates/proxy/envoy-service.yaml
k8s/helm/templates/proxy/envoy-serviceaccount.yaml
k8s/helm/templates/proxy/envoy-servicemonitor.yaml
k8s/helm/templates/tests/nccl-test.yaml
k8s/helm/values.yaml
services/core/jobs/README.md
services/guardrails/callouts/README.md
tools/lint/lint-helm.sh
web/packages/studio/scripts/feature-flag-matrix.ts

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

k8s/helm/templates/tests/nccl-test.yaml (3)
37-39: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Namespace is missing from cluster-scoped hook names.

$ncclTestBase feeds the ClusterRole and ClusterRoleBinding names, but it is derived only from the release fullname. Two installs with the same release name in different namespaces will render identical cluster-scoped hook resources, so concurrent helm test runs can fail with AlreadyExists or Helm ownership conflicts. Add namespace-specific entropy here before deriving the cluster-scoped names.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/tests/nccl-test.yaml` around lines 37 - 39, Cluster-scoped
hook names derived by $ncclTestBase are namespace-agnostic and can collide
across namespaces; include namespace-specific entropy when building
$ncclTestBase (for example incorporate .Release.Namespace or a short hash of it)
so the ClusterRole/ClusterRoleBinding names become unique per namespace (adjust
the printf used to build $ncclTestBase and keep existing truncation logic to
preserve length limits).
169-183: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

The test hook pods bypass the chart security-context contract.

Neither job threads the shared pod/container security-context settings into its pod spec. That leaves helm test outside the hardening knobs added elsewhere in this rollout, so restricted clusters can admit the release and then reject the test pods. Wire the same security-context helpers/values into both hook jobs.

Also applies to: 287-300
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/tests/nccl-test.yaml` around lines 169 - 183, The test
hook PodSpec is missing the shared pod/container security-context settings so
test pods bypass chart hardening; update the nccl-test hook templates
(template.spec and the orchestrator container spec) to inject the same
helpers/values used elsewhere by adding the chart's pod-level securityContext
and the container-level securityContext into the PodSpec and the orchestrator
container (e.g. call the existing pod security-context helper and the container
security-context helper the chart uses, such as the common podSecurityContext
and containerSecurityContext helpers/values) so both hook jobs honor the same
locked-down settings.
184-189: 🩺 Stability & Availability | 🟠 Major

Avoid runtime pip install kubernetes>=28.0.0 in helm test

k8s/helm/templates/tests/nccl-test.yaml installs kubernetes>=28.0.0 at runtime before running python -u /scripts/orchestrator.py, and k8s/helm/files/nccl-test/orchestrator.py imports/uses the kubernetes client. This makes helm test depend on outbound PyPI/egress, pip, and a nondeterministic dependency resolution. Bake a pinned kubernetes version into ncclTest.orchestrator.image (or otherwise include it in the image) and remove the runtime pip install.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/tests/nccl-test.yaml` around lines 184 - 189, The helm
test currently runs a runtime pip install in nccl-test.yaml (the args block that
executes "pip install --no-cache-dir 'kubernetes>=28.0.0'") before invoking
python -u /scripts/orchestrator.py; remove that pip install line and ensure the
container image referenced by ncclTest.orchestrator.image includes a pinned
kubernetes client (install a specific version, e.g. kubernetes==X.Y.Z) baked
into the image so orchestrator.py can import kubernetes without runtime pip,
then update image build (Dockerfile or image pipeline) to install the pinned
dependency and bump ncclTest.orchestrator.image to the new image tag.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@k8s/helm/templates/_helpers.tpl`:
- Around line 198-203: The helper
nemo-platform.defaultEncryptionKeyGeneratorServiceAccountName currently falls
back to "default" when create=false and name is empty; change it so that when
.Values.secrets.defaultEncryptionKey.generated.serviceAccount.create is false
you require an explicit
.Values.secrets.defaultEncryptionKey.generated.serviceAccount.name (use Helm's
required function with a clear message) instead of defaulting to "default"; keep
the existing behavior of using include
"nemo-platform.defaultEncryptionKeyGeneratorName" only when create is true and
the generated SA is created.

In `@k8s/helm/templates/api-env-secret-generator.yaml`:
- Line 15: The job must always have the service account token mounted; change
the template so automountServiceAccountToken is not driven by
secrets.defaultEncryptionKey.generated.serviceAccount.automount. Replace the
conditional value at automountServiceAccountToken with a hardcoded true (or a
default true via the template default function) so the job can always read the
service-account token and CA files used in the code paths that read those files
(the service-account token/CA reads referenced around the service-account token
and CA file handling).
- Around line 73-80: The hook only applies
.Values.secrets.defaultEncryptionKey.generated.{podSecurityContext,securityContext}
and must instead merge chart-wide defaults; update the api-env-keygen container
and pod securityContext blocks to merge the chart/global helpers with the
generated values (i.e., call the existing chart helper functions used by the
main workloads such as the global podSecurityContext and securityContext merge
helpers) so that
.Values.secrets.defaultEncryptionKey.generated.podSecurityContext and
.Values.secrets.defaultEncryptionKey.generated.securityContext are overlaid on
top of the chart-wide defaults rather than replacing them.

In `@k8s/helm/templates/api-env-secret.yaml`:
- Around line 2-12: Validate the provided defaultEncryptionKey before rendering:
check .Values.secrets.defaultEncryptionKey.value is present (use required or
fail) and matches base64 format and decodes to at least 32 bytes; if the check
fails, abort template rendering with a clear error message referencing the
secret name produced by include "nemo-platform.apiEnvSecretName" and the env var
include "nemo-platform.defaultEncryptionKeyEnvName" so the chart fails fast
instead of producing an invalid NMP_SECRETS_DEFAULT_ENCRYPTION_KEY that will
break at runtime.

---

Outside diff comments:
In `@k8s/helm/templates/tests/nccl-test.yaml`:
- Around line 37-39: Cluster-scoped hook names derived by $ncclTestBase are
namespace-agnostic and can collide across namespaces; include namespace-specific
entropy when building $ncclTestBase (for example incorporate .Release.Namespace
or a short hash of it) so the ClusterRole/ClusterRoleBinding names become unique
per namespace (adjust the printf used to build $ncclTestBase and keep existing
truncation logic to preserve length limits).
- Around line 169-183: The test hook PodSpec is missing the shared pod/container
security-context settings so test pods bypass chart hardening; update the
nccl-test hook templates (template.spec and the orchestrator container spec) to
inject the same helpers/values used elsewhere by adding the chart's pod-level
securityContext and the container-level securityContext into the PodSpec and the
orchestrator container (e.g. call the existing pod security-context helper and
the container security-context helper the chart uses, such as the common
podSecurityContext and containerSecurityContext helpers/values) so both hook
jobs honor the same locked-down settings.
- Around line 184-189: The helm test currently runs a runtime pip install in
nccl-test.yaml (the args block that executes "pip install --no-cache-dir
'kubernetes>=28.0.0'") before invoking python -u /scripts/orchestrator.py;
remove that pip install line and ensure the container image referenced by
ncclTest.orchestrator.image includes a pinned kubernetes client (install a
specific version, e.g. kubernetes==X.Y.Z) baked into the image so
orchestrator.py can import kubernetes without runtime pip, then update image
build (Dockerfile or image pipeline) to install the pinned dependency and bump
ncclTest.orchestrator.image to the new image tag.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 81a1b96c-bc51-46f4-a48d-126f62eb6887

📥 Commits

Reviewing files that changed from the base of the PR and between cefd558 and 02cae3e.

📒 Files selected for processing (12)

k8s/helm/README.md
k8s/helm/ci/14-secretfromenv.yaml
k8s/helm/ci/20-explicit-default-encryption-key.yaml
k8s/helm/helm-docs-template/nemo-helm-readme.md.gotmpl
k8s/helm/templates/_helpers.tpl
k8s/helm/templates/api-env-secret-generator.yaml
k8s/helm/templates/api-env-secret-upgrade-check.yaml
k8s/helm/templates/api-env-secret.yaml
k8s/helm/templates/api/api-deployment.yaml
k8s/helm/templates/platform-seed-job.yaml
k8s/helm/templates/tests/nccl-test.yaml
k8s/helm/values.yaml

✅ Files skipped from review due to trivial changes (2)

k8s/helm/ci/20-explicit-default-encryption-key.yaml
k8s/helm/README.md

🚧 Files skipped from review as they are similar to previous changes (3)

k8s/helm/templates/platform-seed-job.yaml
k8s/helm/templates/api/api-deployment.yaml
k8s/helm/values.yaml

Signed-off-by: Brooke Storm <brookes@nvidia.com>

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

services/core/jobs/README.md (1)
56-63: ⚠️ Potential issue | 🟡 Minor

Add missing Field import in the Python example.

The code uses Field(...) on lines 59–63 but doesn't import it from pydantic, causing a runtime NameError.
Suggested patch
 from pydantic_settings import BaseSettings
+from pydantic import Field
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/core/jobs/README.md` around lines 56 - 63, The Python example in the
README uses the Field function from pydantic in the
MyFunctionalMicroserviceSettings class definition (lines 59-63) but does not
import it. Add Field to the import statement by importing it from the pydantic
module alongside the existing BaseSettings import from pydantic_settings to
resolve the NameError that would occur at runtime.
Source: Coding guidelines

♻️ Duplicate comments (3)

k8s/helm/templates/api-env-secret-generator.yaml (2)
73-80: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Merge chart-wide security contexts into this hook.

Lines 73-80 apply only secrets.defaultEncryptionKey.generated.* contexts. This skips chart/global security defaults and can break hardened clusters or weaken expected posture for hook pods.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/api-env-secret-generator.yaml` around lines 73 - 80, The
api-env-keygen container and its pod are only using security contexts from
secrets.defaultEncryptionKey.generated.* paths, which skips chart-wide security
defaults. Merge the chart-wide security context values (typically from
.Values.podSecurityContext and .Values.securityContext or similar chart-level
paths) with the specific generated ones at both the pod level
(podSecurityContext) and container level (securityContext) to ensure the hook
respects global security policies. This requires updating both the pod-level
securityContext template at line 74 and the container-level securityContext
template at line 79 to include or merge with the appropriate chart-wide
defaults.
15-15: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not make service-account token mount optional for this job.

Line 15 allows automountServiceAccountToken: false, but Lines 99-105 always read token/CA files. That makes the hook fail whenever automount is disabled.

Also applies to: 99-105
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/api-env-secret-generator.yaml` at line 15, The
automountServiceAccountToken field on line 15 is configured as optional via a
Helm value, allowing it to be set to false. However, lines 99-105
unconditionally read the service account token and CA files, which requires the
token to be mounted. Change line 15 to always set automountServiceAccountToken
to true (hardcode it as true instead of using the configurable Helm value) to
ensure the service account token is always available for the job to read the
required token and CA files.
k8s/helm/templates/api-env-secret.yaml (1)
2-13: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate secrets.defaultEncryptionKey.value before rendering it.

Line 12 writes the user-provided key directly. Invalid/short material still passes templating and fails later when encryption is used. Add fail-fast validation for expected format/strength before this branch renders.

Also applies to: 20-20
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/api-env-secret.yaml` around lines 2 - 13, The encryption
key from Values.secrets.defaultEncryptionKey.value is rendered directly without
validation, allowing invalid or weak keys to pass through templating and only
fail later during actual encryption. Add validation logic within the first
branch (after the if condition that checks if the value exists) to verify that
the encryption key meets expected format and strength requirements, and fail the
template rendering early with a clear error message if validation fails. This
should occur before the line where the key is written to the Secret stringData.

🧹 Nitpick comments (2)

k8s/helm/templates/networking/kyverno-policy.yaml (1)
465-467: ⚡ Quick win

Unused variable nodeGpuProduct.

nodeGpuProduct is computed but never referenced in the OCI policy template. Dead code.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/templates/networking/kyverno-policy.yaml` around lines 465 - 467,
The variable `nodeGpuProduct` defined in the kyverno-policy.yaml file is
computed using jmesPath but is never referenced elsewhere in the policy
template, making it dead code. Remove the entire `nodeGpuProduct` variable
definition block (including the name, variable, and jmesPath fields) from the
OCI policy template to clean up the unused code.
k8s/helm/README.md (1)
1-25: ⚡ Quick win

Add Prerequisites at the top and Next Steps at the end.

This page is useful reference content, but it currently misses the required framing sections for navigation and adoption.

As per coding guidelines: "Always list prerequisites at the top of documentation pages before other content" and "Include 'Next Steps' section at the end with cross-links to related documentation content."

Also applies to: 325-325
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s/helm/README.md` around lines 1 - 25, Add a Prerequisites section at the
top of the README.md file before the existing content, listing any required
knowledge or setup needed before using the Helm chart. Then add a Next Steps
section at the end of the document with cross-links to related documentation
content, such as installation guides, configuration examples, or troubleshooting
pages. These sections should follow the standard documentation guidelines for
navigation and adoption framing.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci.yaml:
- Around line 111-117: Add SHA-256 verification for both kubeconform and
chart-verifier downloads before extraction. For each binary, download the
corresponding SHA-256 checksum file from the GitHub release (using the
KUBECONFORM_VERSION and CHART_VERIFIER_VERSION variables respectively), verify
the downloaded tar.gz archive against the checksum using sha256sum with the -c
flag, and only proceed with tar extraction if verification succeeds. If
verification fails, exit with an error to prevent execution of potentially
tampered binaries in CI.

In @.pre-commit-config.yaml:
- Around line 73-76: The `files` pattern in the pre-commit hook configuration is
too restrictive and only triggers on changes to `values.yaml`. To prevent README
drift when Chart.yaml or the helm-docs-template file is modified, expand the
files pattern to include these additional paths. Modify the `files:` field to
use a regex pattern that matches `values.yaml`, `Chart.yaml`, and the
`helm-docs-template/nemo-helm-readme.md.gotmpl` file so the hook runs whenever
any of these files change.

In `@k8s/helm/files/nccl-test/orchestrator.py`:
- Around line 160-170: The variable cs.state may be None, and the current code
attempts to access the .waiting, .terminated, and .running attributes without
checking if state exists first. Add a guard condition to verify that cs.state is
not None before attempting to access its nested attributes. This guard should
wrap all the conditional branches that access state.waiting, state.terminated,
and state.running to prevent AttributeError when state is None.

In `@k8s/helm/README.md`:
- Line 219: In the README.md file at line 219, the description for the
httpRoute.parentRefs configuration key contains an incorrect reference to
httpsRoute.enabled. Change the text "httpsRoute.enabled" to "httpRoute.enabled"
in the description to match the actual configuration key name used in the chart.

In `@k8s/helm/templates/httproute.yaml`:
- Around line 35-38: The port field in the backendRefs section is being rendered
using the tpl function which produces a string output, causing Kubernetes schema
validation failures. Modify the port line to ensure the value is rendered as an
integer by wrapping the tpl function call with the int conversion function, so
that port: {{ tpl .port $ }} becomes port: {{ int (tpl .port $) }}, ensuring the
port is always a numeric value as required by the Kubernetes HTTPRoute schema.

In `@k8s/helm/templates/ingress.yaml`:
- Around line 25-37: The template accesses the first element of
.Values.ingress.hosts using direct indexing without validating that the hosts
array exists and has at least one element. When ingress.defaultHost is set but
ingress.hosts is empty or missing, this causes a template render crash. Add a
guard condition to check that ingress.hosts is not empty (using a Helm condition
like if .Values.ingress.hosts) before attempting to access (index
.Values.ingress.hosts 0).paths. This ensures the paths range loop only executes
when the hosts array contains at least one element with a paths field.

In `@k8s/helm/templates/networking/kyverno-policy.yaml`:
- Around line 187-188: The NCCL_TOPO_FILE environment variable in the
kyverno-policy.yaml file is set to the literal placeholder string
NcclTopoFilePath instead of the actual mounted path. Change the value field from
NcclTopoFilePath to /etc/nccl/nccl-topo.xml to point to the correct location
where the NCCL topology ConfigMap is mounted, so NCCL can actually locate and
read the topology file.

In `@k8s/helm/templates/ngc-api-secret.yaml`:
- Around line 1-12: Add a validation check in the ngc-api-secret.yaml template
to fail fast when ngcAPIKey is empty and existingSecret is not provided, as
empty credentials will silently break authentication and image pull flows. Use a
Helm fail function or conditional to require ngcAPIKey when the auto-generation
path (when existingSecret is not set) is taken. Apply the same validation
pattern to k8s/helm/templates/imagepull-secret.yaml at line 4 for the
imagePullSecret path to ensure consistent protection against empty credentials
across both secret templates.

In `@k8s/helm/templates/postgres/postgres-secret.yaml`:
- Around line 10-11: The postgres-secret.yaml template at line 11 does not
validate that the PostgreSQL password is non-empty, allowing users to override
it to an empty string in their values. Add the required filter to the password
field value to enforce validation at template render time. Modify the password
line to use the required filter before the quote filter, providing a meaningful
error message that will cause the Helm template rendering to fail immediately if
the password value is empty or unset.

In `@k8s/helm/templates/proxy/envoy-service.yaml`:
- Around line 19-22: The admin port is currently always exposed through the
Service, which poses a security risk. Remove the unconditional publication of
the admin port (the block defining name: admin, port: envoyProxy.adminPort,
targetPort: admin, protocol: TCP) and instead wrap it in a conditional that
checks if envoyProxy.service.exposeAdminPort is enabled. Additionally, add a
default configuration in the values.yaml file setting
envoyProxy.service.exposeAdminPort to false to ensure the admin endpoint is not
exposed by default and only enabled when explicitly needed for controlled
in-cluster scraping scenarios.

---

Outside diff comments:
In `@services/core/jobs/README.md`:
- Around line 56-63: The Python example in the README uses the Field function
from pydantic in the MyFunctionalMicroserviceSettings class definition (lines
59-63) but does not import it. Add Field to the import statement by importing it
from the pydantic module alongside the existing BaseSettings import from
pydantic_settings to resolve the NameError that would occur at runtime.

---

Duplicate comments:
In `@k8s/helm/templates/api-env-secret-generator.yaml`:
- Around line 73-80: The api-env-keygen container and its pod are only using
security contexts from secrets.defaultEncryptionKey.generated.* paths, which
skips chart-wide security defaults. Merge the chart-wide security context values
(typically from .Values.podSecurityContext and .Values.securityContext or
similar chart-level paths) with the specific generated ones at both the pod
level (podSecurityContext) and container level (securityContext) to ensure the
hook respects global security policies. This requires updating both the
pod-level securityContext template at line 74 and the container-level
securityContext template at line 79 to include or merge with the appropriate
chart-wide defaults.
- Line 15: The automountServiceAccountToken field on line 15 is configured as
optional via a Helm value, allowing it to be set to false. However, lines 99-105
unconditionally read the service account token and CA files, which requires the
token to be mounted. Change line 15 to always set automountServiceAccountToken
to true (hardcode it as true instead of using the configurable Helm value) to
ensure the service account token is always available for the job to read the
required token and CA files.

In `@k8s/helm/templates/api-env-secret.yaml`:
- Around line 2-13: The encryption key from
Values.secrets.defaultEncryptionKey.value is rendered directly without
validation, allowing invalid or weak keys to pass through templating and only
fail later during actual encryption. Add validation logic within the first
branch (after the if condition that checks if the value exists) to verify that
the encryption key meets expected format and strength requirements, and fail the
template rendering early with a clear error message if validation fails. This
should occur before the line where the key is written to the Secret stringData.

---

Nitpick comments:
In `@k8s/helm/README.md`:
- Around line 1-25: Add a Prerequisites section at the top of the README.md file
before the existing content, listing any required knowledge or setup needed
before using the Helm chart. Then add a Next Steps section at the end of the
document with cross-links to related documentation content, such as installation
guides, configuration examples, or troubleshooting pages. These sections should
follow the standard documentation guidelines for navigation and adoption
framing.

In `@k8s/helm/templates/networking/kyverno-policy.yaml`:
- Around line 465-467: The variable `nodeGpuProduct` defined in the
kyverno-policy.yaml file is computed using jmesPath but is never referenced
elsewhere in the policy template, making it dead code. Remove the entire
`nodeGpuProduct` variable definition block (including the name, variable, and
jmesPath fields) from the OCI policy template to clean up the unused code.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 80764bee-43d3-4c57-82e9-59bbe0c7d6ac

📥 Commits

Reviewing files that changed from the base of the PR and between 02cae3e and 3fcb9a7.

📒 Files selected for processing (82)

.github/actions/changes/action.yaml
.github/workflows/ci.yaml
.gitignore
.pre-commit-config.yaml
AGENTS.md
k8s/helm/.helmignore
k8s/helm/Chart.yaml
k8s/helm/README.md
k8s/helm/ci/02-empty-imagepullsecret.yaml
k8s/helm/ci/03-global-security-context.yaml
k8s/helm/ci/04-security-context-override.yaml
k8s/helm/ci/05-enable-auth.yaml
k8s/helm/ci/06-kyverno-aws.yaml
k8s/helm/ci/07-kyverno-azure.yaml
k8s/helm/ci/08-kyverno-gcp.yaml
k8s/helm/ci/09-kyverno-oci.yaml
k8s/helm/ci/10-override-image-registry.yaml
k8s/helm/ci/11-setting-env.yaml
k8s/helm/ci/12-topology-spread-constraints.yaml
k8s/helm/ci/13-openshift.yaml
k8s/helm/ci/14-pdb-min-available.yaml
k8s/helm/ci/14-secretfromenv.yaml
k8s/helm/ci/15-pdb-max-unavailable.yaml
k8s/helm/ci/16-enable-ingress.yaml
k8s/helm/ci/17-ingress-default-host.yaml
k8s/helm/ci/18-enable-httproute.yaml
k8s/helm/ci/19-envoy-extra-args.yaml
k8s/helm/ci/20-explicit-default-encryption-key.yaml
k8s/helm/files/nccl-test/entrypoint.sh
k8s/helm/files/nccl-test/nccl-env.sh
k8s/helm/files/nccl-test/nccl_test.py
k8s/helm/files/nccl-test/orchestrator.py
k8s/helm/helm-docs-template/nemo-helm-readme.md.gotmpl
k8s/helm/templates/NOTES.txt
k8s/helm/templates/_config-render.tpl
k8s/helm/templates/_helpers.tpl
k8s/helm/templates/api-env-secret-generator.yaml
k8s/helm/templates/api-env-secret-upgrade-check.yaml
k8s/helm/templates/api-env-secret.yaml
k8s/helm/templates/api/_helpers.tpl
k8s/helm/templates/api/api-deployment.yaml
k8s/helm/templates/api/api-hpa.yaml
k8s/helm/templates/api/api-pdb.yaml
k8s/helm/templates/api/api-service.yaml
k8s/helm/templates/api/api-serviceaccount.yaml
k8s/helm/templates/api/api-servicemonitor.yaml
k8s/helm/templates/core/_helpers.tpl
k8s/helm/templates/core/controller-deployment.yaml
k8s/helm/templates/core/controller-role.yaml
k8s/helm/templates/core/controller-service-headless.yaml
k8s/helm/templates/core/controller-serviceaccount.yaml
k8s/helm/templates/core/controller-servicemonitor.yaml
k8s/helm/templates/core/jobs-serviceaccount.yaml
k8s/helm/templates/core/shared-pvc.yaml
k8s/helm/templates/httproute.yaml
k8s/helm/templates/imagepull-secret.yaml
k8s/helm/templates/ingress.yaml
k8s/helm/templates/models-files-auth-secret.yaml
k8s/helm/templates/networking/kyverno-policy.yaml
k8s/helm/templates/networking/nccl-topology-configmap.yaml
k8s/helm/templates/ngc-api-secret.yaml
k8s/helm/templates/openshift-route.yaml
k8s/helm/templates/platform-configmap.yaml
k8s/helm/templates/platform-seed-job.yaml
k8s/helm/templates/postgres/postgres-secret.yaml
k8s/helm/templates/postgres/postgres-service.yaml
k8s/helm/templates/postgres/postgres-serviceaccount.yaml
k8s/helm/templates/postgres/postgres-statefulset.yaml
k8s/helm/templates/proxy/_helpers.tpl
k8s/helm/templates/proxy/envoy-configmap.yaml
k8s/helm/templates/proxy/envoy-deployment.yaml
k8s/helm/templates/proxy/envoy-hpa.yaml
k8s/helm/templates/proxy/envoy-service.yaml
k8s/helm/templates/proxy/envoy-serviceaccount.yaml
k8s/helm/templates/proxy/envoy-servicemonitor.yaml
k8s/helm/templates/tests/nccl-test.yaml
k8s/helm/values.yaml
pyproject.toml
services/core/jobs/README.md
services/guardrails/callouts/README.md
tools/lint/lint-helm.sh
web/packages/studio/scripts/feature-flag-matrix.ts

✅ Files skipped from review due to trivial changes (15)

services/guardrails/callouts/README.md
k8s/helm/ci/17-ingress-default-host.yaml
k8s/helm/ci/16-enable-ingress.yaml
AGENTS.md
k8s/helm/ci/20-explicit-default-encryption-key.yaml
k8s/helm/ci/15-pdb-max-unavailable.yaml
.gitignore
k8s/helm/ci/02-empty-imagepullsecret.yaml
k8s/helm/ci/06-kyverno-aws.yaml
k8s/helm/Chart.yaml
k8s/helm/.helmignore
k8s/helm/ci/10-override-image-registry.yaml
k8s/helm/ci/13-openshift.yaml
k8s/helm/ci/14-pdb-min-available.yaml
k8s/helm/ci/03-global-security-context.yaml

🚧 Files skipped from review as they are similar to previous changes (24)

k8s/helm/ci/19-envoy-extra-args.yaml
k8s/helm/ci/07-kyverno-azure.yaml
k8s/helm/files/nccl-test/nccl-env.sh
tools/lint/lint-helm.sh
k8s/helm/ci/05-enable-auth.yaml
k8s/helm/ci/09-kyverno-oci.yaml
.github/actions/changes/action.yaml
k8s/helm/ci/04-security-context-override.yaml
k8s/helm/ci/08-kyverno-gcp.yaml
k8s/helm/ci/14-secretfromenv.yaml
k8s/helm/ci/18-enable-httproute.yaml
k8s/helm/templates/proxy/_helpers.tpl
k8s/helm/templates/api/_helpers.tpl
k8s/helm/ci/12-topology-spread-constraints.yaml
k8s/helm/templates/_config-render.tpl
k8s/helm/helm-docs-template/nemo-helm-readme.md.gotmpl
pyproject.toml
web/packages/studio/scripts/feature-flag-matrix.ts
k8s/helm/ci/11-setting-env.yaml
k8s/helm/templates/core/_helpers.tpl
k8s/helm/files/nccl-test/nccl_test.py
k8s/helm/files/nccl-test/entrypoint.sh
k8s/helm/values.yaml
k8s/helm/templates/_helpers.tpl

Signed-off-by: Brooke Storm <brookes@nvidia.com>

matthewgrossman

Two thoughts from exploring the PR w/ an agent:

The README has good guidance about restoring the key on upgrade, but NOTES.txt is what users see right after helm install. A one-liner like "Back up the <fullname>-api-env Secret — if lost, existing encrypted platform secrets become unrecoverable" could be nice
(moreso a question) The generator Job defaults to docker.io/library/python:3.12-slim. Is this "normal"/okay for our charts these days, as opposed to using a mirror? I know its configurable, so we should in our own ToT, but wondering if using the docker.io image versus something else is standard (versus using I guess our nmp-api image?)

svvarom

LGTM

matthewgrossman

ty!

crookedstorm · 2026-06-15T17:42:26Z

Two thoughts from exploring the PR w/ an agent:

The README has good guidance about restoring the key on upgrade, but NOTES.txt is what users see right after helm install. A one-liner like "Back up the <fullname>-api-env Secret — if lost, existing encrypted platform secrets become unrecoverable" could be nice

(moreso a question) The generator Job defaults to docker.io/library/python:3.12-slim. Is this "normal"/okay for our charts these days, as opposed to using a mirror? I know its configurable, so we should in our own ToT, but wondering if using the docker.io image versus something else is standard (versus using I guess our nmp-api image?)

We should never default to anything mirrored or internal anymore. All default containers should be public (or aspirationally public such as the ones this repo will create. I'll poke the NOTES.txt :)

Signed-off-by: Brooke Storm <brookes@nvidia.com>

crookedstorm requested review from a team as code owners June 12, 2026 21:11

github-actions Bot added the feat label Jun 12, 2026

svvarom approved these changes Jun 12, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread k8s/helm/ci/11-setting-env.yaml Outdated

Comment thread k8s/helm/templates/api-env-secret.yaml Outdated

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread k8s/helm/templates/_helpers.tpl

Comment thread k8s/helm/templates/api-env-secret-generator.yaml Outdated

Comment thread k8s/helm/templates/api-env-secret-generator.yaml Outdated

Comment thread k8s/helm/templates/api-env-secret.yaml Outdated

svvarom self-requested a review June 12, 2026 22:59

svvarom reviewed Jun 12, 2026

View reviewed changes

Comment thread k8s/helm/files/nccl-test/entrypoint.sh

mquong-ca reviewed Jun 12, 2026

View reviewed changes

Comment thread k8s/helm/templates/core/controller-role.yaml

crookedstorm added 5 commits June 15, 2026 08:25

feat: prepare helm chart for kubernetes deployment options

490d30f

Signed-off-by: Brooke Storm <brookes@nvidia.com>

chore: clean up some issues with linting

2186f7f

Signed-off-by: Brooke Storm <brookes@nvidia.com>

chore: adjust versions

4460444

Signed-off-by: Brooke Storm <brookes@nvidia.com>

fix: correct test template

942c9be

Signed-off-by: Brooke Storm <brookes@nvidia.com>

feat: implement auto-generated KEK secret upon install only

3fcb9a7

Signed-off-by: Brooke Storm <brookes@nvidia.com>

crookedstorm force-pushed the setup-helm/crookedstorm branch from 02cae3e to 3fcb9a7 Compare June 15, 2026 15:25

fix: fix some code rabbit issues

6e90c37

Signed-off-by: Brooke Storm <brookes@nvidia.com>

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

crookedstorm added 7 commits June 15, 2026 08:54

fix: update precommit a tad

b36f7e4

Signed-off-by: Brooke Storm <brookes@nvidia.com>

chore: spelling issue and ci fix

303c09a

Signed-off-by: Brooke Storm <brookes@nvidia.com>

fix: inherit global security context

a68d9bd

Signed-off-by: Brooke Storm <brookes@nvidia.com>

chore: cleanup some nonsense

8817a02

Signed-off-by: Brooke Storm <brookes@nvidia.com>

chore: python could-be-none stuff

ef892ee

Signed-off-by: Brooke Storm <brookes@nvidia.com>

fix: correct old security hole in envoy service

90d6938

Signed-off-by: Brooke Storm <brookes@nvidia.com>

fix: correct some guards against bad usage

0340de3

Signed-off-by: Brooke Storm <brookes@nvidia.com>

svvarom self-requested a review June 15, 2026 16:55

svvarom reviewed Jun 15, 2026

View reviewed changes

Comment thread k8s/helm/helm-docs-template/nemo-helm-readme.md.gotmpl Outdated

fix: change all docs urls

b4bc90f

Signed-off-by: Brooke Storm <brookes@nvidia.com>

svvarom reviewed Jun 15, 2026

View reviewed changes

Comment thread k8s/helm/templates/ngc-api-secret.yaml

matthewgrossman reviewed Jun 15, 2026

View reviewed changes

svvarom approved these changes Jun 15, 2026

View reviewed changes

matthewgrossman approved these changes Jun 15, 2026

View reviewed changes

docs: add a backup suggestion to the NOTES.txt

78cf2f7

Signed-off-by: Brooke Storm <brookes@nvidia.com>

crookedstorm enabled auto-merge June 15, 2026 17:58

crookedstorm added this pull request to the merge queue Jun 15, 2026

Merged via the queue into main with commit 6b2af4b Jun 15, 2026
49 checks passed

crookedstorm deleted the setup-helm/crookedstorm branch June 15, 2026 18:15

Conversation

crookedstorm commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matthewgrossman left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

svvarom left a comment

Choose a reason for hiding this comment

Uh oh!

matthewgrossman left a comment

Choose a reason for hiding this comment

Uh oh!

crookedstorm commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

crookedstorm commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

matthewgrossman left a comment •

edited

Loading