feat: prepare helm chart for kubernetes deployment options#319
Conversation
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds complete k8s/helm Helm chart with templates for API, core controller, PostgreSQL, Envoy proxy, ingress/routes, and Kyverno policies. Includes NCCL multi-node test orchestration, CI validation fixtures, and change-detection-gated linting/verification jobs. Repository paths and tooling updated to align with k8s/helm location. ChangesHelm chart implementation and CI integration
Suggested labels
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 2
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
🟠 Major comments (21)
k8s/helm/files/nccl-test/orchestrator.py-547-555 (1)
547-555: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winTrack pods in
created_namesimmediately after creation.If Kyverno assertion or wait logic fails before append, created pods are not cleaned up.
Suggested fix
print(f"Creating pod {name0} on node {n0}") v1.create_namespaced_pod(namespace, body=pod0) + created_names.append(name0) if worker_net and wn_name and not injection_disabled: _wait_assert_kyverno_interconnect(v1, namespace, name0, wn_name, wn_req, wn_lim) p0 = _wait_pod_running(v1, namespace, name0, timeout_s=300) @@ - created_names.append(name0) @@ print(f"Creating pod {name} on node {hostname}") v1.create_namespaced_pod(namespace, body=pod) + created_names.append(name) if worker_net and wn_name and not injection_disabled: _wait_assert_kyverno_interconnect(v1, namespace, name, wn_name, wn_req, wn_lim) - created_names.append(name)Also applies to: 562-565
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/files/nccl-test/orchestrator.py` around lines 547 - 555, The code creates pods via v1.create_namespaced_pod but only appends the pod name to created_names after subsequent waits/assertions (_wait_assert_kyverno_interconnect and _wait_pod_running), so if those fail the pod is not tracked for cleanup; immediately append the pod identifier (e.g., name0) to created_names right after v1.create_namespaced_pod(...) and before calling _wait_assert_kyverno_interconnect/_wait_pod_running, and make the same change for the other symmetric creation block (the one that uses name1) so all created pods are recorded even if later waits fail..github/workflows/ci.yaml-102-103 (1)
102-103: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winDisable credential persistence in checkout steps for Helm jobs.
These jobs run repository-controlled scripts; leaving auth in
.git/configunnecessarily expands token exposure on PR runs.Suggested fix
- name: Checkout code uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3 + with: + persist-credentials: falseAlso applies to: 152-153
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci.yaml around lines 102 - 103, The checkout steps currently use actions/checkout (the "Checkout code" step with uses: actions/checkout@...) and leave credentials persisted in .git/config; update those checkout steps (including the similar occurrences around the later lines noted) to set persist-credentials: false so the runner does not write repository auth into .git/config for Helm/PR jobs, keeping the step name "Checkout code" and the uses: actions/checkout entries but adding persist-credentials: false to the step inputs.Source: Linters/SAST tools
k8s/helm/files/nccl-test/orchestrator.py-170-183 (1)
170-183: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winCleanup exceptions in
finallycan mask the real test outcome.If
_delete_workerstimes out, it can override the original return/error path fromconfig_test.Suggested fix
finally: - _delete_workers(v1, namespace, created_names) + try: + _delete_workers(v1, namespace, created_names) + except Exception as cleanup_error: + print(f"WARN: worker cleanup failed: {cleanup_error}", file=sys.stderr)Also applies to: 587-588
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/files/nccl-test/orchestrator.py` around lines 170 - 183, _fix: cleanup code in _delete_workers can raise (e.g. _wait_pod_absent timeout) and thereby mask the original failure from config_test; change _delete_workers to never let cleanup exceptions propagate: wrap the call to _wait_pod_absent (and the delete_namespaced_pod block) in try/except that catches ApiException and generic Exception, log warnings to stderr with context (pod name and exception), and continue rather than re-raising; apply the same defensive try/except pattern to the other cleanup site referenced (lines 587-588) so cleanup never hides the original test error..github/workflows/ci.yaml-112-117 (1)
112-117: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winVerify release asset integrity before executing downloaded binaries.
kubeconformandchart-verifierare downloaded and executed without checksum/signature validation. That is a supply-chain risk.Suggested fix pattern
curl -fsSL \ -o "${RUNNER_TEMP}/kubeconform.tar.gz" \ "https://github.com/yannh/kubeconform/releases/download/${KUBECONFORM_VERSION}/kubeconform-linux-amd64.tar.gz" + echo "${KUBECONFORM_SHA256} ${RUNNER_TEMP}/kubeconform.tar.gz" | sha256sum -c - tar -xzf "${RUNNER_TEMP}/kubeconform.tar.gz" -C "${RUNNER_TEMP}/kubeconform" kubeconformcurl -fsSL \ -o "${RUNNER_TEMP}/chart-verifier.tar.gz" \ "https://github.com/redhat-certification/chart-verifier/releases/download/${CHART_VERIFIER_VERSION}/chart-verifier-${CHART_VERIFIER_VERSION}.tgz" + echo "${CHART_VERIFIER_SHA256} ${RUNNER_TEMP}/chart-verifier.tar.gz" | sha256sum -c - tar -xzf "${RUNNER_TEMP}/chart-verifier.tar.gz" -C "${RUNNER_TEMP}/chart-verifier" chart-verifierAlso applies to: 162-167
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci.yaml around lines 112 - 117, The workflow downloads and executes kubeconform (and chart-verifier at the other block) without verifying release integrity; update the CI steps that fetch "${RUNNER_TEMP}/kubeconform.tar.gz" (KUBECONFORM_VERSION) and the chart-verifier download to also fetch the corresponding checksum or signature file from the release, verify it before extracting/executing (e.g., download .sha256 or .asc and validate with shasum -c or gpg --verify), and fail the job if verification fails so the subsequent tar, chmod and adding to "${GITHUB_PATH}" only run after a successful integrity/signature check.k8s/helm/files/nccl-test/entrypoint.sh-72-81 (1)
72-81: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winStrict IB mode is bypassed when
ibv_devinfofails.In strict mode, a failed
ibv_devinfocall should fail fast; currently it falls through silently.Suggested fix
if command -v ibv_devinfo >/dev/null 2>&1; then if ibv_devinfo > /tmp/ibv_precheck.out 2>&1; then if ! grep -q "PORT_ACTIVE" /tmp/ibv_precheck.out; then echo "ERROR: InfiniBand devices present but ibv_devinfo shows no PORT_ACTIVE (see network-operator README)." cat /tmp/ibv_precheck.out exit 1 fi echo "✓ ibv_devinfo reports PORT_ACTIVE" + else + echo "ERROR: NCCL_TEST_STRICT_IB_PORT_ACTIVE=true but ibv_devinfo failed" + cat /tmp/ibv_precheck.out 2>/dev/null || true + exit 1 fi fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/files/nccl-test/entrypoint.sh` around lines 72 - 81, The script currently ignores failures from ibv_devinfo; modify the ibv_devinfo check in entrypoint.sh so that if the ibv_devinfo invocation itself fails and strict IB mode is enabled (use the existing strict-mode flag/variable used elsewhere in the script, e.g., STRICT_IB or similar), the script should print an error including the ibv_devinfo output and exit non-zero; otherwise (non-strict mode) continue as before. Specifically, update the block that runs `ibv_devinfo > /tmp/ibv_precheck.out 2>&1` to test its exit status, and on failure when strict mode is true call `echo` (with context) and `exit 1`, while preserving the existing PORT_ACTIVE grep path when ibv_devinfo succeeds.k8s/helm/templates/networking/kyverno-policy.yaml-187-188 (1)
187-188: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
NCCL_TOPO_FILEpoints to a placeholder, not the mounted file.Line 188 should reference the actual mounted ConfigMap path, otherwise NCCL topology injection is ineffective.
Suggested fix
- name: "NCCL_TOPO_FILE" - value: NcclTopoFilePath + value: "/etc/nccl/nccl-topo.xml"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/networking/kyverno-policy.yaml` around lines 187 - 188, The environment variable NCCL_TOPO_FILE currently contains the literal placeholder "NcclTopoFilePath" instead of pointing at the actual mounted ConfigMap file; change the value to the actual mount path or wire it to the ConfigMap key so the container can read the topology file. Locate the NCCL_TOPO_FILE entry in networking/kyverno-policy.yaml and either set value to the exact file path where the ConfigMap is mounted (for example the folder used by your volumeMount) or replace the value with a valueFrom configMapKeyRef that references the ConfigMap name and key that holds the topology file so the process reads the real file at runtime.k8s/helm/templates/networking/kyverno-policy.yaml-263-288 (1)
263-288: 🎯 Functional Correctness | 🟠 MajorGCP mutate
patchStrategicMergewrites Pod annotations at the wrong level—usemetadata.annotations.In
k8s/helm/templates/networking/kyverno-policy.yaml(lines 263-288), the PodpatchStrategicMergesetsannotations:at the patch root; for Pod resources these must be undermetadata.annotationsso Kyverno can apply them.Suggested fix
patchStrategicMerge: - annotations: - devices.gke.io/container.tcpxo-daemon: | + metadata: + annotations: + devices.gke.io/container.tcpxo-daemon: | - path: /dev/nvidia0 - path: /dev/nvidia1 - path: /dev/nvidia2 - path: /dev/nvidia3 - path: /dev/nvidia4 - path: /dev/nvidia5 - path: /dev/nvidia6 - path: /dev/nvidia7 - path: /dev/nvidiactl - path: /dev/nvidia-uvm - path: /dev/dmabuf_import_helper - networking.gke.io/default-interface: eth0 - networking.gke.io/interfaces: | + networking.gke.io/default-interface: eth0 + networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth1","network":"gpu-nic0"}, {"interfaceName":"eth2","network":"gpu-nic1"}, {"interfaceName":"eth3","network":"gpu-nic2"}, {"interfaceName":"eth4","network":"gpu-nic3"}, {"interfaceName":"eth5","network":"gpu-nic4"}, {"interfaceName":"eth6","network":"gpu-nic5"}, {"interfaceName":"eth7","network":"gpu-nic6"}, {"interfaceName":"eth8","network":"gpu-nic7"} ]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/networking/kyverno-policy.yaml` around lines 263 - 288, The Kyverno patch under patchStrategicMerge is placing annotations at the patch root instead of under metadata.annotations; update the patch used in the Kyverno policy (patchStrategicMerge) so that the existing annotations block (including devices.gke.io/container.tcpxo-daemon, networking.gke.io/default-interface and networking.gke.io/interfaces) is nested beneath metadata.annotations for the Pod template/resource; keep the annotation keys and values unchanged but move them into metadata.annotations so Kyverno will apply them correctly.k8s/helm/templates/tests/nccl-test.yaml-187-189 (1)
187-189: 🩺 Stability & Availability | 🟠 MajorPin Kubernetes Python client version in helm test.
k8s/helm/templates/tests/nccl-test.yamlinstallskubernetes>=28.0.0at test runtime, makinghelm testnon-deterministic (different runs can pull different kubernetes client versions). Pin an exact version (or bake it into the orchestrator image and remove the runtimepip install).Suggested change
- pip install --no-cache-dir 'kubernetes>=28.0.0' >/dev/null + pip install --no-cache-dir 'kubernetes==<pinned-version>' >/dev/null🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/tests/nccl-test.yaml` around lines 187 - 189, The test YAML currently runs pip install 'kubernetes>=28.0.0' at runtime (the pip install line in nccl-test.yaml) which makes helm test nondeterministic; change that line to pin an exact client version (e.g. pip install --no-cache-dir 'kubernetes==28.0.0') or remove the runtime pip install entirely and ensure /scripts/orchestrator.py runs with a prebaked image that already contains the pinned kubernetes client; update the nccl-test.yaml pip install invocation or the orchestrator image accordingly so tests always use a fixed kubernetes client version.k8s/helm/templates/proxy/envoy-service.yaml-13-22 (1)
13-22: 🔒 Security & Privacy | 🟠 MajorGate Envoy admin Service exposure behind an explicit opt-in (prevents NodePort/LoadBalancer leakage).
k8s/helm/templates/proxy/envoy-service.yamlalways publishes the Envoyadminport via a Service whosetypeis configurable (.Values.envoyProxy.service.type). Envoy binds the admin listener to0.0.0.0, so withNodePort/LoadBalancerthe admin endpoint can become externally reachable.Proposed direction
ports: - name: proxy port: {{ .Values.envoyProxy.service.port }} targetPort: proxy protocol: TCP + {{- if .Values.envoyProxy.service.exposeAdminPort }} - name: admin port: {{ .Values.envoyProxy.adminPort }} targetPort: admin protocol: TCP + {{- end }}Add
envoyProxy.service.exposeAdminPortdefaulting tofalseand enable only where needed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/proxy/envoy-service.yaml` around lines 13 - 22, The Service currently always exposes the Envoy admin port (port name "admin"/template envoy-service.yaml) which can leak externally when .Values.envoyProxy.service.type is NodePort/LoadBalancer; add a new boolean value .Values.envoyProxy.service.exposeAdminPort (default false in values.yaml) and change the template to include the "admin" port entry only when that flag is true (keep the existing "proxy" port unconditional), ensuring you reference the env var names admin port (.Values.envoyProxy.adminPort) and service type (.Values.envoyProxy.service.type) so maintainers can opt-in to exposing admin.k8s/helm/templates/ingress.yaml-25-30 (1)
25-30: 🎯 Functional Correctness | 🟠 MajorGuard
.Values.ingress.defaultHostagainst empty.Values.ingress.hoststo prevent render-time Helm failure.
k8s/helm/templates/ingress.yamlindexes(index .Values.ingress.hosts 0).pathswhendefaultHostis set, without checking thatingress.hostshas at least one element. This can crashhelm templateif users overrideingress.hosts: [].Proposed fix
rules: {{- if .Values.ingress.defaultHost }} + {{- if eq (len .Values.ingress.hosts) 0 }} + {{- fail "ingress.hosts must contain at least one item when ingress.defaultHost is set" }} + {{- end }} - host: {{ .Values.ingress.defaultHost | quote }} http: paths: {{- range (index .Values.ingress.hosts 0).paths }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/ingress.yaml` around lines 25 - 30, The template accesses (index .Values.ingress.hosts 0).paths when .Values.ingress.defaultHost is present, which crashes if .Values.ingress.hosts is empty; update the ingress template to guard that .Values.ingress.hosts has at least one element before indexing (e.g., check with "and .Values.ingress.defaultHost (gt (len .Values.ingress.hosts) 0)" or equivalent conditional) and only iterate over (index .Values.ingress.hosts 0).paths when that condition is true so helm template won't fail when ingress.hosts is [].k8s/helm/templates/proxy/_helpers.tpl-21-21 (1)
21-21: 🎯 Functional Correctness | 🟠 MajorTruncate fullname before appending
-envoyinnmp-envoy.serviceAccountNameFile:
k8s/helm/templates/proxy/_helpers.tpl
Lines: 21-21{{- default (printf "%s-envoy" (include "nemo-platform.fullname" .)) .Values.envoyProxy.serviceAccount.name }}
nemo-platform.fullnameis truncated to 63 chars, butnmp-envoy.serviceAccountNameappends-envoywithout truncation, so the resultingmetadata.namecan exceed the DNS label 63-char limit and fail resource creation. Same un-truncated pattern also exists fornmp-api.apiServiceAccountNameandnmp-core.*ServiceAccountNamehelpers.Proposed fix
-{{- default (printf "%s-envoy" (include "nemo-platform.fullname" .)) .Values.envoyProxy.serviceAccount.name }} +{{- default (printf "%s-envoy" (include "nemo-platform.fullname" . | trunc 57)) .Values.envoyProxy.serviceAccount.name }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/proxy/_helpers.tpl` at line 21, Truncate the fullname before appending suffixes so the final serviceAccount name stays within the 63-char DNS label limit; replace the current expression with one that truncates the result of (include "nemo-platform.fullname" .) to 57 chars then appends "-envoy" (63 - len("-envoy") = 57), e.g. use the template truncation function on nemo-platform.fullname inside the printf; apply the same pattern to the nmp-api.apiServiceAccountName and all nmp-core.*ServiceAccountName helpers, adjusting the truncation length by subtracting the suffix length from 63 for each suffix.k8s/helm/templates/openshift-route.yaml-20-20 (1)
20-20: 🎯 Functional Correctness | 🟠 MajorDon’t coerce OpenShift Route
targetPorttoint
route.openshift.io/v1spec.port.targetPortisIntOrString(named or numeric), but the template forces| int, so a named port becomes0and breaks routing.Suggested fix
- targetPort: {{ tpl .Values.openshiftRoute.targetPort $ | int }} + targetPort: {{ tpl .Values.openshiftRoute.targetPort $ }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/openshift-route.yaml` at line 20, The template coerces .Values.openshiftRoute.targetPort to an int which breaks named ports; update the openshift-route.yaml so spec.port.targetPort uses the rendered tpl value directly (remove the | int coercion) and preserve the original type/string (e.g., use {{ tpl .Values.openshiftRoute.targetPort $ }} or quote it) so that .Values.openshiftRoute.targetPort can be a named port or a numeric port without being cast to 0.k8s/helm/values.yaml-22-25 (1)
22-25: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winSecret defaults disable chart-generated secrets by default.
Line 22 and Line 25 are non-empty, so
ngc-api-secret.yamlandimagepull-secret.yamldo not render, while workloads still reference those secret names. Fresh installs fail unless both secrets already exist. Default these to""or hard-fail with explicit validation when missing.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/values.yaml` around lines 22 - 25, The values existingSecret and existingImagePullSecret are set to non-empty defaults which disable rendering of the chart templates ngc-api-secret.yaml and imagepull-secret.yaml while workloads still reference those secret names; change the defaults for existingSecret and existingImagePullSecret in values.yaml to "" (empty string) so the chart generates the secrets, or alternatively add explicit validation in the chart (e.g., in _helpers or a pre-install hook) that fails fast when those values are non-empty but the referenced secrets are missing; update references to the symbols existingSecret and existingImagePullSecret and the templates ngc-api-secret.yaml/imagepull-secret.yaml accordingly so fresh installs do not fail.k8s/helm/templates/_helpers.tpl-237-247 (1)
237-247: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winExternal DB path can render an invalid empty secret reference.
When
postgresql.enabled=falseand noexternalDatabase.existingSecret/uriSecretis set, Line 268 renderssecretKeyRef.nameempty. Fail early withrequiredinstead of emitting invalid manifests.Suggested fix
{{- define "nemo-common.postgresql.password" -}} {{- if not (and .Values.externalDatabase.uriSecret .Values.externalDatabase.uriSecret.name .Values.externalDatabase.uriSecret.key) }} +{{- if and (not .Values.postgresql.enabled) (not .Values.externalDatabase.existingSecret) -}} +{{- required "externalDatabase.existingSecret is required when postgresql.enabled=false and externalDatabase.uriSecret is not set" .Values.externalDatabase.existingSecret -}} +{{- end }} - name: POSTGRES_DB_PASSWORD valueFrom: secretKeyRef: name: {{ include "nemo-common.postgresql.secret-name" .}} key: {{ include "nemo-common.postgresql.password-key" .}} {{- end }} {{- end -}}Also applies to: 252-270
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/_helpers.tpl` around lines 237 - 247, The helper template "nemo-common.postgresql.secret-name" can return an empty string when postgresql.enabled is false and neither externalDatabase.existingSecret nor uriSecret (or the postgresql auth existingSecret) is set, which leads to invalid manifests (empty secretKeyRef.name); change the template to fail early using Helm's required function so it errors when no secret is provided (e.g., wrap the externalDatabase.existingSecret or uriSecret lookup with required and a clear message), and apply the same fix to the other helper occurrences referenced around lines 252-270 so any missing secret causes a clear required error instead of emitting an empty name.k8s/helm/templates/_helpers.tpl-183-185 (1)
183-185: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winPostgreSQL service account name can exceed 63 chars.
Line 184 appends
-postgrestonemo-platform.fullnamewithout truncation. Long release names will generate invalid ServiceAccount names.Suggested fix
{{- define "nemo-common.postgresql.serviceAccountName" -}} {{- if .Values.postgresql.serviceAccount.create -}} -{{- default (printf "%s-postgres" (include "nemo-platform.fullname" .)) .Values.postgresql.serviceAccount.name }} +{{- default (include "nemo-common.postgresql.fullname" .) .Values.postgresql.serviceAccount.name }} {{- else -}} {{- default "default" .Values.postgresql.serviceAccount.name }} {{- end -}} {{- end -}}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/_helpers.tpl` around lines 183 - 185, The ServiceAccount name currently appends "-postgres" to (include "nemo-platform.fullname" .) without truncation, which can exceed the 63-char DNS limit; update the template producing the default name to truncate the fullname to 54 characters before appending "-postgres" (63 - len("-postgres") = 54) so the final name is <=63 chars, e.g. replace the default expression that uses (include "nemo-platform.fullname" .) with one that calls the truncation function (e.g. trunc/transpose appropriate Helm/Sprig trunc function) on that include and then printf "%s-postgres" against the truncated value, keeping the existing fallback to .Values.postgresql.serviceAccount.name.k8s/helm/templates/_helpers.tpl-294-297 (1)
294-297: 🩺 Stability & Availability | 🟠 MajorSort map keys before rendering env vars to keep output deterministic.
These
rangeblocks iterate over Go template maps ($mergedis built viadict+set, and other.Values.*.envfields arekindIs "map"), so env var ordering can flap between renders.{{- range $key, $val := $merged }} - name: {{ $key }} value: {{ $val | quote }} {{- end }}Sort keys before rendering (same issue also at
355-363,376-384,397-405,417-425).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/_helpers.tpl` around lines 294 - 297, The env var rendering iterates over the map $merged which makes output order nondeterministic; change to iterate over a sorted list of keys (use keys $merged piped to sortAlpha) and then index into $merged for each key so the - name/value blocks are emitted in deterministic alphabetical order; apply the same pattern for the other range blocks mentioned (around the occurrences at lines ~355-363, ~376-384, ~397-405, ~417-425) using the sorted keys and index function.k8s/helm/templates/api/api-servicemonitor.yaml-21-24 (1)
21-24: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winSelector mismatch: using full labels instead of selectorLabels.
Line 24 uses
nemo-platform.labelswhich includes chart version and managed-by. ServiceMonitor selector should usenemo-platform.selectorLabelsto match the Service selector (line 22 in api-service.yaml).🔧 Fix selector
selector: matchLabels: app.kubernetes.io/component: nmp-api - {{- include "nemo-platform.labels" . | nindent 6 }} + {{- include "nemo-platform.selectorLabels" . | nindent 6 }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/api/api-servicemonitor.yaml` around lines 21 - 24, The ServiceMonitor selector is using the full label set via the template helper nemo-platform.labels (inside matchLabels) which includes version/managed-by keys and therefore mismatches the Service's selector; change the selector in api-servicemonitor.yaml to use the selector-only helper nemo-platform.selectorLabels instead of nemo-platform.labels so matchLabels exactly matches the Service selector (update the template reference where matchLabels includes {{- include "nemo-platform.labels" . | nindent 6 }} to use {{- include "nemo-platform.selectorLabels" . | nindent 6 }}).k8s/helm/templates/api/api-hpa.yaml-21-37 (1)
21-37: 🎯 Functional Correctness | 🟠 MajorGuard against rendering an HPA with an empty
metricslist
The template only emits CPU/memory metrics whenapi.autoscaling.targetCPUUtilizationPercentage/...targetMemoryUtilizationPercentageare set; if both are unset whileapi.autoscaling.enabled=true,metrics:becomes empty and Kubernetes rejects the HPA. Current defaults are safe (api.autoscaling.enabled=false,targetCPUUtilizationPercentage=80), but add a guard/required(or ensure one default target is set) to prevent the misconfigured invalid manifest.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/api/api-hpa.yaml` around lines 21 - 37, The HPA template can render an empty metrics: block if neither .Values.api.autoscaling.targetCPUUtilizationPercentage nor .Values.api.autoscaling.targetMemoryUtilizationPercentage are set; update the api-hpa.yaml template to guard rendering of the entire metrics: section by checking that at least one of those values exists (or provide a safe default metric) and ensure .Values.api.autoscaling.enabled is considered when deciding to emit metrics; locate the metrics block using the symbols .Values.api.autoscaling.targetCPUUtilizationPercentage and .Values.api.autoscaling.targetMemoryUtilizationPercentage and change the template logic so metrics: is only emitted when it will contain at least one valid entry.k8s/helm/templates/api/api-deployment.yaml-95-99 (1)
95-99: 🩺 Stability & Availability | 🟠 MajorAdd fail-fast/documentation for the NGC_API_KEY secret dependency
k8s/helm/templates/api/api-deployment.yaml(lines 95-99) setsNGC_API_KEYviasecretKeyRefusing{{ .Values.existingSecret | default "ngc-api" }};k8s/helm/values.yamldefaultsexistingSecret: ngc-api. The chart provides no install-time warning guidance ink8s/helm/templates/NOTES.txtfor this dependency. Document the prerequisite (or add validation/fail-fast) so installs don’t crashloop on a missing Secret.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/api/api-deployment.yaml` around lines 95 - 99, The chart currently mounts NGC_API_KEY from a Secret named by .Values.existingSecret (default "ngc-api") but provides no install-time guidance or validation; add a NOTES.txt entry that documents the required Secret name and example create command, and add a simple template validation that uses the Helm lookup function to check for the Secret (using .Values.existingSecret) during render and emits a clear failure (using the fail function) or a prominent NOTE if missing; reference the NGC_API_KEY env var, the .Values.existingSecret value/default, and update k8s/helm/templates/NOTES.txt and add a new validation template (e.g., templates/_validate-secret.yaml) to implement the check.k8s/helm/templates/core/_helpers.tpl-37-42 (1)
37-42: 🔒 Security & Privacy | 🟠 MajorRemove
"default"ServiceAccount fallback whencreate=false(controller RBAC/security)
nmp-core.controllerServiceAccountNamefalls back to"default"when.Values.core.controller.serviceAccount.create=falseand.Values.core.controller.serviceAccount.nameis unset. That helper drives both the controller DeploymentserviceAccountNameand thecontroller-role.yamlRoleBindingsubjects[].name, so a misconfigured release can bind controller RBAC to the namespace’s default ServiceAccount.
- Same unsafe fallback pattern exists in
nmp-core.apiServiceAccountNameandnmp-core.jobsServiceAccountName(core/_helpers.tpl); handlecreate=falseconsistently (no"default"implicit name).nmp-core.database-migrations-servicenameappends-core-migrationstonemo-platform.fullnamewithout truncation; if used for a KubernetesService.metadata.name, it can exceed the 63-char DNS label limit.Suggested fix
{{- define "nmp-core.controllerServiceAccountName" -}} {{- if .Values.core.controller.serviceAccount.create }} {{- default (printf "%s-core-controller" (include "nemo-platform.fullname" .)) .Values.core.controller.serviceAccount.name }} {{- else }} -{{- default "default" .Values.core.controller.serviceAccount.name }} +{{- required "core.controller.serviceAccount.name must be set when core.controller.serviceAccount.create=false" .Values.core.controller.serviceAccount.name }} {{- end }} {{- end }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/core/_helpers.tpl` around lines 37 - 42, The helpers are unsafe: nmp-core.apiServiceAccountName (and similarly nmp-core.jobsServiceAccountName and nmp-core.controllerServiceAccountName) currently fall back to the literal "default" when .Values.*.serviceAccount.create is false and .Values.*.serviceAccount.name is unset; remove that implicit "default" fallback so that when create=false and no explicit name is provided the helper returns empty (or nil) and callers must not bind RBAC to the namespace default SA. Update those three helpers to: if .Values.*.serviceAccount.create is true return the provided name or the generated fullname-core, otherwise return empty/omit value. Also update nmp-core.database-migrations-servicename to generate the fullname with "-core-migrations" but ensure the result is truncated to a DNS-1123 label (max 63 chars) — use the chart's truncation utility (or substring) to cap length while preserving the suffix "-core-migrations".k8s/helm/templates/postgres/postgres-service.yaml-9-10 (1)
9-10: 🩺 Stability & Availability | 🟠 MajorMake the governing PostgreSQL Service headless for the StatefulSet.
postgres-statefulset.yamlsetsspec.serviceNameto the same fullname aspostgres-service.yaml’s Service, butpostgres-service.yamlrendersspec.type: ClusterIPand there is noclusterIP: Noneheadless Service underk8s/helm/templates/postgres, which can break StatefulSet pod DNS/stable identity.Suggested fix
spec: - type: ClusterIP + clusterIP: None ports:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/postgres/postgres-service.yaml` around lines 9 - 10, The Service rendered by postgres-service.yaml is currently a non-headless ClusterIP which breaks StatefulSet stable network identity because postgres-statefulset.yaml sets spec.serviceName to the Service fullname; change the Service to be headless by setting clusterIP: None (and remove/ignore any external type requirements) so it becomes a governing headless Service for the StatefulSet; update postgres-service.yaml to set clusterIP: None while keeping the same metadata name used by postgres-statefulset.yaml (reference spec.serviceName in postgres-statefulset.yaml and spec.type/clusterIP in postgres-service.yaml).
🟡 Minor comments (2)
services/guardrails/callouts/README.md-12-12 (1)
12-12: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winVague product documentation reference.
"product documentation" lacks a specific URL. Local developers may struggle to find the correct Guardrails configuration examples.
Consider adding a direct link or verifying the product docs contain the necessary examples for local setup.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/guardrails/callouts/README.md` at line 12, Update the README line "Create nemoguard configs using the Guardrails configuration examples in the product documentation." to include a direct link (or internal doc anchor) to the exact "Guardrails configuration examples" page or section and/or the specific example file names needed for local setup, and confirm the referenced product docs actually contain the example snippets required for creating nemoguard configs.k8s/helm/templates/api/api-servicemonitor.yaml-13-19 (1)
13-19: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winEmpty annotations block when both annotation sources are unset.
If both
api.annotationsandapi.serviceMonitor.annotationsare empty, line 13 rendersannotations:with no keys, producing invalid YAML.🔧 Wrap annotations in conditional
+ {{- if or .Values.api.annotations .Values.api.serviceMonitor.annotations }} annotations: {{- with .Values.api.annotations }} {{- toYaml . | nindent 4 }} {{- end }} {{- with .Values.api.serviceMonitor.annotations }} {{- toYaml . | nindent 4 }} {{- end }} + {{- end }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/api/api-servicemonitor.yaml` around lines 13 - 19, Wrap the annotations section in a conditional so "annotations:" is only rendered when either .Values.api.annotations or .Values.api.serviceMonitor.annotations is non-empty; replace the current block with an if checking "or .Values.api.annotations .Values.api.serviceMonitor.annotations" and keep the existing toYaml | nindent 4 lines for each with block inside that conditional so indentation and existing annotation sources remain unchanged.
🧹 Nitpick comments (3)
k8s/helm/templates/api/_helpers.tpl (2)
42-48: 🩺 Stability & Availability | ⚖️ Poor tradeoffRisk: Fragile nested YAML access.
Line 44 accesses
.files.default_storage_config.typeafter a single nil check on.files. Ifdefault_storage_configortypeis missing, the template will fail at render time.Consider adding defensive checks:
{{- if and (include "nemo-platform.calculatedConfig" . | fromYaml).files (include "nemo-platform.calculatedConfig" . | fromYaml).files.default_storage_config -}} {{- eq (include "nemo-platform.calculatedConfig" . | fromYaml).files.default_storage_config.type "local" -}}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/api/_helpers.tpl` around lines 42 - 48, The template nmp-core.localStorageEnabled is accessing nested fields on (include "nemo-platform.calculatedConfig" . | fromYaml) without checking for existence of default_storage_config or type; update the template to first assign the included/calculated config to a local variable (e.g., $cfg) and then use defensive checks like and $cfg.files and $cfg.files.default_storage_config before evaluating $cfg.files.default_storage_config.type == "local" so the template won't fail when those keys are missing.
53-57: 🩺 Stability & Availability | ⚖️ Poor tradeoffRisk: Missing null check on nested path.
Line 55 accesses
.files.default_storage_config.pathwithout verifyingdefault_storage_configexists. Template fails if structure is incomplete.Add safety check for
default_storage_config:{{- if and (include "nemo-platform.calculatedConfig" . | fromYaml).files (include "nemo-platform.calculatedConfig" . | fromYaml).files.default_storage_config -}} {{ (include "nemo-platform.calculatedConfig" . | fromYaml).files.default_storage_config.path | default "" }} {{- end -}}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/api/_helpers.tpl` around lines 53 - 57, The template define "nmp-core.localStoragePath" accesses .files.default_storage_config.path without null-checking default_storage_config, causing failures when that key is missing; update the template to guard the access by adding an existence check (e.g., use an if with and to ensure (include "nemo-platform.calculatedConfig" . | fromYaml).files and (include "nemo-platform.calculatedConfig" . | fromYaml).files.default_storage_config are present) before reading .files.default_storage_config.path so the template returns the default "" when missing.k8s/helm/templates/core/_helpers.tpl (1)
30-32: 🎯 Functional CorrectnessHarden the migrations servicename helper (but it’s currently unused in this chart).
"nmp-core.database-migrations-servicename"is defined without truncation, but repo/template searches show it has no usages in this chart—so the installation/upgrade failure risk is currently unproven. Still, truncate it to keep the helper safe for DNS-1123 name length if it’s used later.Suggested fix
{{- define "nmp-core.database-migrations-servicename" }} -{{- printf "%s-core-migrations" ( include "nemo-platform.fullname" .) }} +{{- printf "%s-core-migrations" ( include "nemo-platform.fullname" . | trunc 47 | trimSuffix "-" ) }} {{- end }}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/core/_helpers.tpl` around lines 30 - 32, The helper "nmp-core.database-migrations-servicename" currently builds a service name from ( include "nemo-platform.fullname" .) without truncation, risking DNS-1123 length violations; update that helper to truncate the resulting "%s-core-migrations" to 63 characters (DNS-1123 max) and strip any trailing hyphen so the output is a valid DNS name—use the existing fullname include, then apply Helm truncation (e.g., trunc 63) and a trimSuffix "-" operation to produce the final service name.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e134744b-485d-4b31-bc6c-311900c9fbe4
📒 Files selected for processing (79)
.github/actions/changes/action.yaml.github/workflows/ci.yaml.gitignore.pre-commit-config.yamlAGENTS.mdk8s/helm/.helmignorek8s/helm/Chart.yamlk8s/helm/README.mdk8s/helm/ci/01-preexisting-imagepullsecret.yamlk8s/helm/ci/02-empty-imagepullsecret.yamlk8s/helm/ci/03-global-security-context.yamlk8s/helm/ci/04-security-context-override.yamlk8s/helm/ci/05-enable-auth.yamlk8s/helm/ci/06-kyverno-aws.yamlk8s/helm/ci/07-kyverno-azure.yamlk8s/helm/ci/08-kyverno-gcp.yamlk8s/helm/ci/09-kyverno-oci.yamlk8s/helm/ci/10-override-image-registry.yamlk8s/helm/ci/11-setting-env.yamlk8s/helm/ci/12-topology-spread-constraints.yamlk8s/helm/ci/13-openshift.yamlk8s/helm/ci/14-pdb-min-available.yamlk8s/helm/ci/14-secretfromenv.yamlk8s/helm/ci/15-pdb-max-unavailable.yamlk8s/helm/ci/16-enable-ingress.yamlk8s/helm/ci/17-ingress-default-host.yamlk8s/helm/ci/18-enable-httproute.yamlk8s/helm/ci/19-envoy-extra-args.yamlk8s/helm/files/nccl-test/entrypoint.shk8s/helm/files/nccl-test/nccl-env.shk8s/helm/files/nccl-test/nccl_test.pyk8s/helm/files/nccl-test/orchestrator.pyk8s/helm/helm-docs-template/nemo-helm-readme.md.gotmplk8s/helm/templates/NOTES.txtk8s/helm/templates/_config-render.tplk8s/helm/templates/_helpers.tplk8s/helm/templates/api-env-secret.yamlk8s/helm/templates/api/_helpers.tplk8s/helm/templates/api/api-deployment.yamlk8s/helm/templates/api/api-hpa.yamlk8s/helm/templates/api/api-pdb.yamlk8s/helm/templates/api/api-service.yamlk8s/helm/templates/api/api-serviceaccount.yamlk8s/helm/templates/api/api-servicemonitor.yamlk8s/helm/templates/core/_helpers.tplk8s/helm/templates/core/controller-deployment.yamlk8s/helm/templates/core/controller-role.yamlk8s/helm/templates/core/controller-service-headless.yamlk8s/helm/templates/core/controller-serviceaccount.yamlk8s/helm/templates/core/controller-servicemonitor.yamlk8s/helm/templates/core/jobs-serviceaccount.yamlk8s/helm/templates/core/shared-pvc.yamlk8s/helm/templates/httproute.yamlk8s/helm/templates/imagepull-secret.yamlk8s/helm/templates/ingress.yamlk8s/helm/templates/models-files-auth-secret.yamlk8s/helm/templates/networking/kyverno-policy.yamlk8s/helm/templates/networking/nccl-topology-configmap.yamlk8s/helm/templates/ngc-api-secret.yamlk8s/helm/templates/openshift-route.yamlk8s/helm/templates/platform-configmap.yamlk8s/helm/templates/platform-seed-job.yamlk8s/helm/templates/postgres/postgres-secret.yamlk8s/helm/templates/postgres/postgres-service.yamlk8s/helm/templates/postgres/postgres-serviceaccount.yamlk8s/helm/templates/postgres/postgres-statefulset.yamlk8s/helm/templates/proxy/_helpers.tplk8s/helm/templates/proxy/envoy-configmap.yamlk8s/helm/templates/proxy/envoy-deployment.yamlk8s/helm/templates/proxy/envoy-hpa.yamlk8s/helm/templates/proxy/envoy-service.yamlk8s/helm/templates/proxy/envoy-serviceaccount.yamlk8s/helm/templates/proxy/envoy-servicemonitor.yamlk8s/helm/templates/tests/nccl-test.yamlk8s/helm/values.yamlservices/core/jobs/README.mdservices/guardrails/callouts/README.mdtools/lint/lint-helm.shweb/packages/studio/scripts/feature-flag-matrix.ts
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
k8s/helm/templates/tests/nccl-test.yaml (3)
37-39: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winNamespace is missing from cluster-scoped hook names.
$ncclTestBasefeeds theClusterRoleandClusterRoleBindingnames, but it is derived only from the release fullname. Two installs with the same release name in different namespaces will render identical cluster-scoped hook resources, so concurrenthelm testruns can fail withAlreadyExistsor Helm ownership conflicts. Add namespace-specific entropy here before deriving the cluster-scoped names.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/tests/nccl-test.yaml` around lines 37 - 39, Cluster-scoped hook names derived by $ncclTestBase are namespace-agnostic and can collide across namespaces; include namespace-specific entropy when building $ncclTestBase (for example incorporate .Release.Namespace or a short hash of it) so the ClusterRole/ClusterRoleBinding names become unique per namespace (adjust the printf used to build $ncclTestBase and keep existing truncation logic to preserve length limits).
169-183: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick winThe test hook pods bypass the chart security-context contract.
Neither job threads the shared pod/container security-context settings into its pod spec. That leaves
helm testoutside the hardening knobs added elsewhere in this rollout, so restricted clusters can admit the release and then reject the test pods. Wire the same security-context helpers/values into both hook jobs.Also applies to: 287-300
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/tests/nccl-test.yaml` around lines 169 - 183, The test hook PodSpec is missing the shared pod/container security-context settings so test pods bypass chart hardening; update the nccl-test hook templates (template.spec and the orchestrator container spec) to inject the same helpers/values used elsewhere by adding the chart's pod-level securityContext and the container-level securityContext into the PodSpec and the orchestrator container (e.g. call the existing pod security-context helper and the container security-context helper the chart uses, such as the common podSecurityContext and containerSecurityContext helpers/values) so both hook jobs honor the same locked-down settings.
184-189: 🩺 Stability & Availability | 🟠 MajorAvoid runtime
pip install kubernetes>=28.0.0inhelm test
k8s/helm/templates/tests/nccl-test.yamlinstallskubernetes>=28.0.0at runtime before runningpython -u /scripts/orchestrator.py, andk8s/helm/files/nccl-test/orchestrator.pyimports/uses thekubernetesclient. This makeshelm testdepend on outbound PyPI/egress,pip, and a nondeterministic dependency resolution. Bake a pinnedkubernetesversion intoncclTest.orchestrator.image(or otherwise include it in the image) and remove the runtimepip install.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/tests/nccl-test.yaml` around lines 184 - 189, The helm test currently runs a runtime pip install in nccl-test.yaml (the args block that executes "pip install --no-cache-dir 'kubernetes>=28.0.0'") before invoking python -u /scripts/orchestrator.py; remove that pip install line and ensure the container image referenced by ncclTest.orchestrator.image includes a pinned kubernetes client (install a specific version, e.g. kubernetes==X.Y.Z) baked into the image so orchestrator.py can import kubernetes without runtime pip, then update image build (Dockerfile or image pipeline) to install the pinned dependency and bump ncclTest.orchestrator.image to the new image tag.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@k8s/helm/templates/_helpers.tpl`:
- Around line 198-203: The helper
nemo-platform.defaultEncryptionKeyGeneratorServiceAccountName currently falls
back to "default" when create=false and name is empty; change it so that when
.Values.secrets.defaultEncryptionKey.generated.serviceAccount.create is false
you require an explicit
.Values.secrets.defaultEncryptionKey.generated.serviceAccount.name (use Helm's
required function with a clear message) instead of defaulting to "default"; keep
the existing behavior of using include
"nemo-platform.defaultEncryptionKeyGeneratorName" only when create is true and
the generated SA is created.
In `@k8s/helm/templates/api-env-secret-generator.yaml`:
- Line 15: The job must always have the service account token mounted; change
the template so automountServiceAccountToken is not driven by
secrets.defaultEncryptionKey.generated.serviceAccount.automount. Replace the
conditional value at automountServiceAccountToken with a hardcoded true (or a
default true via the template default function) so the job can always read the
service-account token and CA files used in the code paths that read those files
(the service-account token/CA reads referenced around the service-account token
and CA file handling).
- Around line 73-80: The hook only applies
.Values.secrets.defaultEncryptionKey.generated.{podSecurityContext,securityContext}
and must instead merge chart-wide defaults; update the api-env-keygen container
and pod securityContext blocks to merge the chart/global helpers with the
generated values (i.e., call the existing chart helper functions used by the
main workloads such as the global podSecurityContext and securityContext merge
helpers) so that
.Values.secrets.defaultEncryptionKey.generated.podSecurityContext and
.Values.secrets.defaultEncryptionKey.generated.securityContext are overlaid on
top of the chart-wide defaults rather than replacing them.
In `@k8s/helm/templates/api-env-secret.yaml`:
- Around line 2-12: Validate the provided defaultEncryptionKey before rendering:
check .Values.secrets.defaultEncryptionKey.value is present (use required or
fail) and matches base64 format and decodes to at least 32 bytes; if the check
fails, abort template rendering with a clear error message referencing the
secret name produced by include "nemo-platform.apiEnvSecretName" and the env var
include "nemo-platform.defaultEncryptionKeyEnvName" so the chart fails fast
instead of producing an invalid NMP_SECRETS_DEFAULT_ENCRYPTION_KEY that will
break at runtime.
---
Outside diff comments:
In `@k8s/helm/templates/tests/nccl-test.yaml`:
- Around line 37-39: Cluster-scoped hook names derived by $ncclTestBase are
namespace-agnostic and can collide across namespaces; include namespace-specific
entropy when building $ncclTestBase (for example incorporate .Release.Namespace
or a short hash of it) so the ClusterRole/ClusterRoleBinding names become unique
per namespace (adjust the printf used to build $ncclTestBase and keep existing
truncation logic to preserve length limits).
- Around line 169-183: The test hook PodSpec is missing the shared pod/container
security-context settings so test pods bypass chart hardening; update the
nccl-test hook templates (template.spec and the orchestrator container spec) to
inject the same helpers/values used elsewhere by adding the chart's pod-level
securityContext and the container-level securityContext into the PodSpec and the
orchestrator container (e.g. call the existing pod security-context helper and
the container security-context helper the chart uses, such as the common
podSecurityContext and containerSecurityContext helpers/values) so both hook
jobs honor the same locked-down settings.
- Around line 184-189: The helm test currently runs a runtime pip install in
nccl-test.yaml (the args block that executes "pip install --no-cache-dir
'kubernetes>=28.0.0'") before invoking python -u /scripts/orchestrator.py;
remove that pip install line and ensure the container image referenced by
ncclTest.orchestrator.image includes a pinned kubernetes client (install a
specific version, e.g. kubernetes==X.Y.Z) baked into the image so
orchestrator.py can import kubernetes without runtime pip, then update image
build (Dockerfile or image pipeline) to install the pinned dependency and bump
ncclTest.orchestrator.image to the new image tag.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 81a1b96c-bc51-46f4-a48d-126f62eb6887
📒 Files selected for processing (12)
k8s/helm/README.mdk8s/helm/ci/14-secretfromenv.yamlk8s/helm/ci/20-explicit-default-encryption-key.yamlk8s/helm/helm-docs-template/nemo-helm-readme.md.gotmplk8s/helm/templates/_helpers.tplk8s/helm/templates/api-env-secret-generator.yamlk8s/helm/templates/api-env-secret-upgrade-check.yamlk8s/helm/templates/api-env-secret.yamlk8s/helm/templates/api/api-deployment.yamlk8s/helm/templates/platform-seed-job.yamlk8s/helm/templates/tests/nccl-test.yamlk8s/helm/values.yaml
✅ Files skipped from review due to trivial changes (2)
- k8s/helm/ci/20-explicit-default-encryption-key.yaml
- k8s/helm/README.md
🚧 Files skipped from review as they are similar to previous changes (3)
- k8s/helm/templates/platform-seed-job.yaml
- k8s/helm/templates/api/api-deployment.yaml
- k8s/helm/values.yaml
Signed-off-by: Brooke Storm <brookes@nvidia.com>
Signed-off-by: Brooke Storm <brookes@nvidia.com>
Signed-off-by: Brooke Storm <brookes@nvidia.com>
Signed-off-by: Brooke Storm <brookes@nvidia.com>
Signed-off-by: Brooke Storm <brookes@nvidia.com>
02cae3e to
3fcb9a7
Compare
Signed-off-by: Brooke Storm <brookes@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
services/core/jobs/README.md (1)
56-63:⚠️ Potential issue | 🟡 MinorAdd missing
Fieldimport in the Python example.The code uses
Field(...)on lines 59–63 but doesn't import it from pydantic, causing a runtime NameError.Suggested patch
from pydantic_settings import BaseSettings +from pydantic import Field🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/core/jobs/README.md` around lines 56 - 63, The Python example in the README uses the Field function from pydantic in the MyFunctionalMicroserviceSettings class definition (lines 59-63) but does not import it. Add Field to the import statement by importing it from the pydantic module alongside the existing BaseSettings import from pydantic_settings to resolve the NameError that would occur at runtime.Source: Coding guidelines
♻️ Duplicate comments (3)
k8s/helm/templates/api-env-secret-generator.yaml (2)
73-80:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMerge chart-wide security contexts into this hook.
Lines 73-80 apply only
secrets.defaultEncryptionKey.generated.*contexts. This skips chart/global security defaults and can break hardened clusters or weaken expected posture for hook pods.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/api-env-secret-generator.yaml` around lines 73 - 80, The api-env-keygen container and its pod are only using security contexts from secrets.defaultEncryptionKey.generated.* paths, which skips chart-wide security defaults. Merge the chart-wide security context values (typically from .Values.podSecurityContext and .Values.securityContext or similar chart-level paths) with the specific generated ones at both the pod level (podSecurityContext) and container level (securityContext) to ensure the hook respects global security policies. This requires updating both the pod-level securityContext template at line 74 and the container-level securityContext template at line 79 to include or merge with the appropriate chart-wide defaults.
15-15:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not make service-account token mount optional for this job.
Line 15 allows
automountServiceAccountToken: false, but Lines 99-105 always read token/CA files. That makes the hook fail whenever automount is disabled.Also applies to: 99-105
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/api-env-secret-generator.yaml` at line 15, The automountServiceAccountToken field on line 15 is configured as optional via a Helm value, allowing it to be set to false. However, lines 99-105 unconditionally read the service account token and CA files, which requires the token to be mounted. Change line 15 to always set automountServiceAccountToken to true (hardcode it as true instead of using the configurable Helm value) to ensure the service account token is always available for the job to read the required token and CA files.k8s/helm/templates/api-env-secret.yaml (1)
2-13:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate
secrets.defaultEncryptionKey.valuebefore rendering it.Line 12 writes the user-provided key directly. Invalid/short material still passes templating and fails later when encryption is used. Add fail-fast validation for expected format/strength before this branch renders.
Also applies to: 20-20
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/api-env-secret.yaml` around lines 2 - 13, The encryption key from Values.secrets.defaultEncryptionKey.value is rendered directly without validation, allowing invalid or weak keys to pass through templating and only fail later during actual encryption. Add validation logic within the first branch (after the if condition that checks if the value exists) to verify that the encryption key meets expected format and strength requirements, and fail the template rendering early with a clear error message if validation fails. This should occur before the line where the key is written to the Secret stringData.
🧹 Nitpick comments (2)
k8s/helm/templates/networking/kyverno-policy.yaml (1)
465-467: ⚡ Quick winUnused variable
nodeGpuProduct.
nodeGpuProductis computed but never referenced in the OCI policy template. Dead code.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/templates/networking/kyverno-policy.yaml` around lines 465 - 467, The variable `nodeGpuProduct` defined in the kyverno-policy.yaml file is computed using jmesPath but is never referenced elsewhere in the policy template, making it dead code. Remove the entire `nodeGpuProduct` variable definition block (including the name, variable, and jmesPath fields) from the OCI policy template to clean up the unused code.k8s/helm/README.md (1)
1-25: ⚡ Quick winAdd
Prerequisitesat the top andNext Stepsat the end.This page is useful reference content, but it currently misses the required framing sections for navigation and adoption.
As per coding guidelines: "Always list prerequisites at the top of documentation pages before other content" and "Include 'Next Steps' section at the end with cross-links to related documentation content."
Also applies to: 325-325
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@k8s/helm/README.md` around lines 1 - 25, Add a Prerequisites section at the top of the README.md file before the existing content, listing any required knowledge or setup needed before using the Helm chart. Then add a Next Steps section at the end of the document with cross-links to related documentation content, such as installation guides, configuration examples, or troubleshooting pages. These sections should follow the standard documentation guidelines for navigation and adoption framing.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/ci.yaml:
- Around line 111-117: Add SHA-256 verification for both kubeconform and
chart-verifier downloads before extraction. For each binary, download the
corresponding SHA-256 checksum file from the GitHub release (using the
KUBECONFORM_VERSION and CHART_VERIFIER_VERSION variables respectively), verify
the downloaded tar.gz archive against the checksum using sha256sum with the -c
flag, and only proceed with tar extraction if verification succeeds. If
verification fails, exit with an error to prevent execution of potentially
tampered binaries in CI.
In @.pre-commit-config.yaml:
- Around line 73-76: The `files` pattern in the pre-commit hook configuration is
too restrictive and only triggers on changes to `values.yaml`. To prevent README
drift when Chart.yaml or the helm-docs-template file is modified, expand the
files pattern to include these additional paths. Modify the `files:` field to
use a regex pattern that matches `values.yaml`, `Chart.yaml`, and the
`helm-docs-template/nemo-helm-readme.md.gotmpl` file so the hook runs whenever
any of these files change.
In `@k8s/helm/files/nccl-test/orchestrator.py`:
- Around line 160-170: The variable cs.state may be None, and the current code
attempts to access the .waiting, .terminated, and .running attributes without
checking if state exists first. Add a guard condition to verify that cs.state is
not None before attempting to access its nested attributes. This guard should
wrap all the conditional branches that access state.waiting, state.terminated,
and state.running to prevent AttributeError when state is None.
In `@k8s/helm/README.md`:
- Line 219: In the README.md file at line 219, the description for the
httpRoute.parentRefs configuration key contains an incorrect reference to
httpsRoute.enabled. Change the text "httpsRoute.enabled" to "httpRoute.enabled"
in the description to match the actual configuration key name used in the chart.
In `@k8s/helm/templates/httproute.yaml`:
- Around line 35-38: The port field in the backendRefs section is being rendered
using the tpl function which produces a string output, causing Kubernetes schema
validation failures. Modify the port line to ensure the value is rendered as an
integer by wrapping the tpl function call with the int conversion function, so
that port: {{ tpl .port $ }} becomes port: {{ int (tpl .port $) }}, ensuring the
port is always a numeric value as required by the Kubernetes HTTPRoute schema.
In `@k8s/helm/templates/ingress.yaml`:
- Around line 25-37: The template accesses the first element of
.Values.ingress.hosts using direct indexing without validating that the hosts
array exists and has at least one element. When ingress.defaultHost is set but
ingress.hosts is empty or missing, this causes a template render crash. Add a
guard condition to check that ingress.hosts is not empty (using a Helm condition
like if .Values.ingress.hosts) before attempting to access (index
.Values.ingress.hosts 0).paths. This ensures the paths range loop only executes
when the hosts array contains at least one element with a paths field.
In `@k8s/helm/templates/networking/kyverno-policy.yaml`:
- Around line 187-188: The NCCL_TOPO_FILE environment variable in the
kyverno-policy.yaml file is set to the literal placeholder string
NcclTopoFilePath instead of the actual mounted path. Change the value field from
NcclTopoFilePath to /etc/nccl/nccl-topo.xml to point to the correct location
where the NCCL topology ConfigMap is mounted, so NCCL can actually locate and
read the topology file.
In `@k8s/helm/templates/ngc-api-secret.yaml`:
- Around line 1-12: Add a validation check in the ngc-api-secret.yaml template
to fail fast when ngcAPIKey is empty and existingSecret is not provided, as
empty credentials will silently break authentication and image pull flows. Use a
Helm fail function or conditional to require ngcAPIKey when the auto-generation
path (when existingSecret is not set) is taken. Apply the same validation
pattern to k8s/helm/templates/imagepull-secret.yaml at line 4 for the
imagePullSecret path to ensure consistent protection against empty credentials
across both secret templates.
In `@k8s/helm/templates/postgres/postgres-secret.yaml`:
- Around line 10-11: The postgres-secret.yaml template at line 11 does not
validate that the PostgreSQL password is non-empty, allowing users to override
it to an empty string in their values. Add the required filter to the password
field value to enforce validation at template render time. Modify the password
line to use the required filter before the quote filter, providing a meaningful
error message that will cause the Helm template rendering to fail immediately if
the password value is empty or unset.
In `@k8s/helm/templates/proxy/envoy-service.yaml`:
- Around line 19-22: The admin port is currently always exposed through the
Service, which poses a security risk. Remove the unconditional publication of
the admin port (the block defining name: admin, port: envoyProxy.adminPort,
targetPort: admin, protocol: TCP) and instead wrap it in a conditional that
checks if envoyProxy.service.exposeAdminPort is enabled. Additionally, add a
default configuration in the values.yaml file setting
envoyProxy.service.exposeAdminPort to false to ensure the admin endpoint is not
exposed by default and only enabled when explicitly needed for controlled
in-cluster scraping scenarios.
---
Outside diff comments:
In `@services/core/jobs/README.md`:
- Around line 56-63: The Python example in the README uses the Field function
from pydantic in the MyFunctionalMicroserviceSettings class definition (lines
59-63) but does not import it. Add Field to the import statement by importing it
from the pydantic module alongside the existing BaseSettings import from
pydantic_settings to resolve the NameError that would occur at runtime.
---
Duplicate comments:
In `@k8s/helm/templates/api-env-secret-generator.yaml`:
- Around line 73-80: The api-env-keygen container and its pod are only using
security contexts from secrets.defaultEncryptionKey.generated.* paths, which
skips chart-wide security defaults. Merge the chart-wide security context values
(typically from .Values.podSecurityContext and .Values.securityContext or
similar chart-level paths) with the specific generated ones at both the pod
level (podSecurityContext) and container level (securityContext) to ensure the
hook respects global security policies. This requires updating both the
pod-level securityContext template at line 74 and the container-level
securityContext template at line 79 to include or merge with the appropriate
chart-wide defaults.
- Line 15: The automountServiceAccountToken field on line 15 is configured as
optional via a Helm value, allowing it to be set to false. However, lines 99-105
unconditionally read the service account token and CA files, which requires the
token to be mounted. Change line 15 to always set automountServiceAccountToken
to true (hardcode it as true instead of using the configurable Helm value) to
ensure the service account token is always available for the job to read the
required token and CA files.
In `@k8s/helm/templates/api-env-secret.yaml`:
- Around line 2-13: The encryption key from
Values.secrets.defaultEncryptionKey.value is rendered directly without
validation, allowing invalid or weak keys to pass through templating and only
fail later during actual encryption. Add validation logic within the first
branch (after the if condition that checks if the value exists) to verify that
the encryption key meets expected format and strength requirements, and fail the
template rendering early with a clear error message if validation fails. This
should occur before the line where the key is written to the Secret stringData.
---
Nitpick comments:
In `@k8s/helm/README.md`:
- Around line 1-25: Add a Prerequisites section at the top of the README.md file
before the existing content, listing any required knowledge or setup needed
before using the Helm chart. Then add a Next Steps section at the end of the
document with cross-links to related documentation content, such as installation
guides, configuration examples, or troubleshooting pages. These sections should
follow the standard documentation guidelines for navigation and adoption
framing.
In `@k8s/helm/templates/networking/kyverno-policy.yaml`:
- Around line 465-467: The variable `nodeGpuProduct` defined in the
kyverno-policy.yaml file is computed using jmesPath but is never referenced
elsewhere in the policy template, making it dead code. Remove the entire
`nodeGpuProduct` variable definition block (including the name, variable, and
jmesPath fields) from the OCI policy template to clean up the unused code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 80764bee-43d3-4c57-82e9-59bbe0c7d6ac
📒 Files selected for processing (82)
.github/actions/changes/action.yaml.github/workflows/ci.yaml.gitignore.pre-commit-config.yamlAGENTS.mdk8s/helm/.helmignorek8s/helm/Chart.yamlk8s/helm/README.mdk8s/helm/ci/02-empty-imagepullsecret.yamlk8s/helm/ci/03-global-security-context.yamlk8s/helm/ci/04-security-context-override.yamlk8s/helm/ci/05-enable-auth.yamlk8s/helm/ci/06-kyverno-aws.yamlk8s/helm/ci/07-kyverno-azure.yamlk8s/helm/ci/08-kyverno-gcp.yamlk8s/helm/ci/09-kyverno-oci.yamlk8s/helm/ci/10-override-image-registry.yamlk8s/helm/ci/11-setting-env.yamlk8s/helm/ci/12-topology-spread-constraints.yamlk8s/helm/ci/13-openshift.yamlk8s/helm/ci/14-pdb-min-available.yamlk8s/helm/ci/14-secretfromenv.yamlk8s/helm/ci/15-pdb-max-unavailable.yamlk8s/helm/ci/16-enable-ingress.yamlk8s/helm/ci/17-ingress-default-host.yamlk8s/helm/ci/18-enable-httproute.yamlk8s/helm/ci/19-envoy-extra-args.yamlk8s/helm/ci/20-explicit-default-encryption-key.yamlk8s/helm/files/nccl-test/entrypoint.shk8s/helm/files/nccl-test/nccl-env.shk8s/helm/files/nccl-test/nccl_test.pyk8s/helm/files/nccl-test/orchestrator.pyk8s/helm/helm-docs-template/nemo-helm-readme.md.gotmplk8s/helm/templates/NOTES.txtk8s/helm/templates/_config-render.tplk8s/helm/templates/_helpers.tplk8s/helm/templates/api-env-secret-generator.yamlk8s/helm/templates/api-env-secret-upgrade-check.yamlk8s/helm/templates/api-env-secret.yamlk8s/helm/templates/api/_helpers.tplk8s/helm/templates/api/api-deployment.yamlk8s/helm/templates/api/api-hpa.yamlk8s/helm/templates/api/api-pdb.yamlk8s/helm/templates/api/api-service.yamlk8s/helm/templates/api/api-serviceaccount.yamlk8s/helm/templates/api/api-servicemonitor.yamlk8s/helm/templates/core/_helpers.tplk8s/helm/templates/core/controller-deployment.yamlk8s/helm/templates/core/controller-role.yamlk8s/helm/templates/core/controller-service-headless.yamlk8s/helm/templates/core/controller-serviceaccount.yamlk8s/helm/templates/core/controller-servicemonitor.yamlk8s/helm/templates/core/jobs-serviceaccount.yamlk8s/helm/templates/core/shared-pvc.yamlk8s/helm/templates/httproute.yamlk8s/helm/templates/imagepull-secret.yamlk8s/helm/templates/ingress.yamlk8s/helm/templates/models-files-auth-secret.yamlk8s/helm/templates/networking/kyverno-policy.yamlk8s/helm/templates/networking/nccl-topology-configmap.yamlk8s/helm/templates/ngc-api-secret.yamlk8s/helm/templates/openshift-route.yamlk8s/helm/templates/platform-configmap.yamlk8s/helm/templates/platform-seed-job.yamlk8s/helm/templates/postgres/postgres-secret.yamlk8s/helm/templates/postgres/postgres-service.yamlk8s/helm/templates/postgres/postgres-serviceaccount.yamlk8s/helm/templates/postgres/postgres-statefulset.yamlk8s/helm/templates/proxy/_helpers.tplk8s/helm/templates/proxy/envoy-configmap.yamlk8s/helm/templates/proxy/envoy-deployment.yamlk8s/helm/templates/proxy/envoy-hpa.yamlk8s/helm/templates/proxy/envoy-service.yamlk8s/helm/templates/proxy/envoy-serviceaccount.yamlk8s/helm/templates/proxy/envoy-servicemonitor.yamlk8s/helm/templates/tests/nccl-test.yamlk8s/helm/values.yamlpyproject.tomlservices/core/jobs/README.mdservices/guardrails/callouts/README.mdtools/lint/lint-helm.shweb/packages/studio/scripts/feature-flag-matrix.ts
✅ Files skipped from review due to trivial changes (15)
- services/guardrails/callouts/README.md
- k8s/helm/ci/17-ingress-default-host.yaml
- k8s/helm/ci/16-enable-ingress.yaml
- AGENTS.md
- k8s/helm/ci/20-explicit-default-encryption-key.yaml
- k8s/helm/ci/15-pdb-max-unavailable.yaml
- .gitignore
- k8s/helm/ci/02-empty-imagepullsecret.yaml
- k8s/helm/ci/06-kyverno-aws.yaml
- k8s/helm/Chart.yaml
- k8s/helm/.helmignore
- k8s/helm/ci/10-override-image-registry.yaml
- k8s/helm/ci/13-openshift.yaml
- k8s/helm/ci/14-pdb-min-available.yaml
- k8s/helm/ci/03-global-security-context.yaml
🚧 Files skipped from review as they are similar to previous changes (24)
- k8s/helm/ci/19-envoy-extra-args.yaml
- k8s/helm/ci/07-kyverno-azure.yaml
- k8s/helm/files/nccl-test/nccl-env.sh
- tools/lint/lint-helm.sh
- k8s/helm/ci/05-enable-auth.yaml
- k8s/helm/ci/09-kyverno-oci.yaml
- .github/actions/changes/action.yaml
- k8s/helm/ci/04-security-context-override.yaml
- k8s/helm/ci/08-kyverno-gcp.yaml
- k8s/helm/ci/14-secretfromenv.yaml
- k8s/helm/ci/18-enable-httproute.yaml
- k8s/helm/templates/proxy/_helpers.tpl
- k8s/helm/templates/api/_helpers.tpl
- k8s/helm/ci/12-topology-spread-constraints.yaml
- k8s/helm/templates/_config-render.tpl
- k8s/helm/helm-docs-template/nemo-helm-readme.md.gotmpl
- pyproject.toml
- web/packages/studio/scripts/feature-flag-matrix.ts
- k8s/helm/ci/11-setting-env.yaml
- k8s/helm/templates/core/_helpers.tpl
- k8s/helm/files/nccl-test/nccl_test.py
- k8s/helm/files/nccl-test/entrypoint.sh
- k8s/helm/values.yaml
- k8s/helm/templates/_helpers.tpl
Signed-off-by: Brooke Storm <brookes@nvidia.com>
Signed-off-by: Brooke Storm <brookes@nvidia.com>
Signed-off-by: Brooke Storm <brookes@nvidia.com>
Signed-off-by: Brooke Storm <brookes@nvidia.com>
Signed-off-by: Brooke Storm <brookes@nvidia.com>
Signed-off-by: Brooke Storm <brookes@nvidia.com>
Signed-off-by: Brooke Storm <brookes@nvidia.com>
Signed-off-by: Brooke Storm <brookes@nvidia.com>
There was a problem hiding this comment.
Two thoughts from exploring the PR w/ an agent:
-
The README has good guidance about restoring the key on upgrade, but
NOTES.txtis what users see right afterhelm install. A one-liner like "Back up the<fullname>-api-envSecret — if lost, existing encrypted platform secrets become unrecoverable" could be nice -
(moreso a question) The generator Job defaults to
docker.io/library/python:3.12-slim. Is this "normal"/okay for our charts these days, as opposed to using a mirror? I know its configurable, so we should in our own ToT, but wondering if using the docker.io image versus something else is standard (versus using I guess our nmp-api image?)
We should never default to anything mirrored or internal anymore. All default containers should be public (or aspirationally public such as the ones this repo will create. I'll poke the NOTES.txt :) |
Signed-off-by: Brooke Storm <brookes@nvidia.com>
Summary by CodeRabbit