Skip to content

Add OpenShift Container Platform (OCP) as a service type#1380

Open
kaponco wants to merge 17 commits into
NVIDIA:mainfrom
kaponco:feat/ocp-helm
Open

Add OpenShift Container Platform (OCP) as a service type#1380
kaponco wants to merge 17 commits into
NVIDIA:mainfrom
kaponco:feat/ocp-helm

Conversation

@kaponco

@kaponco kaponco commented Jun 16, 2026

Copy link
Copy Markdown

Summary

Add OpenShift Container Platform (OCP) as a first-class service type, deploying operators via OLM using the
existing Helm bundler pipeline.

Motivation / Context

AICR currently supports EKS, GKE and more cluster providers. OCP customers need GPU-accelerated recipes but
OCP deploys operators through OLM (Operator Lifecycle Manager) rather than direct Helm installs. This PR
models each OCP operator as two in-tree Helm chart phases — an OLM installer (Subscription + OperatorGroup)
and a CR configuration phase — with Chainsaw readiness gates ensuring operators are available before CRs are
applied.

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

  • Only basic components are currently maintained, as agreed to secure the design. More will be added in next PRs
  • Cannot overlay ocp on top of other components as they use remote helm repos at the registry level
  • Template rendering for in-tree manifests is taken as is. The rendering engine resolves the template at bundle time. Therefor the $v. variables in the manifests.
  • Trying to have similar value files for OCP as the higher overlays where possible - so it will be easier to compare

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify
  • Unit tests: All passed with race detector, coverage across packages ranging 64–100%
  • Lint: golangci-lint 0 issues, yamllint clean, license headers OK, AGENTS.md in sync, doc conventions pass, chart-version pins verified
  • E2E (chainsaw): 22/22 passed, 0 failed, 0 skipped
  • Vulnerability scan: 2 known low/medium findings in k8s.io/kubernetes v1.36.1 (pre-existing, not related to this change)

Everything is green — ready to commit.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@kaponco kaponco requested review from a team as code owners June 16, 2026 10:42
@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds OpenShift Container Platform (ocp) support across service criteria, OpenAPI schemas, validator behavior, registry entries, overlays, Helm templates, readiness gates, health checks, tests, and documentation. It also adds a Helm quote template helper and updates related tests and docs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • mchmarny
  • ArangoGutierrez
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding OCP as a service type.
Description check ✅ Passed The description is aligned with the changes and explains the new OCP service type and OLM-based deployment approach.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/010-ocp-helm.md`:
- Line 71: Add language identifiers to all unlabeled fenced code blocks to
comply with markdownlint MD040. In docs/design/010-ocp-helm.md at lines 71-71,
add appropriate language tags (such as yaml, go, or text) after the opening
triple backticks for each fenced code block. Similarly, in
docs/integrator/openshift.md at lines 25-25, add language identifiers to all
unlabeled fenced code blocks. Each opening fence should specify the code
language (e.g., ```yaml instead of just ```).
- Line 5: The ADR status in the header line currently states "Proposed —
2026-06-07 (design-only; not implemented)" but this PR is shipping the actual
implementation. Update this status line to change "Proposed" to "Implemented"
(or similar status indicating the feature is complete), remove the
"(design-only; not implemented)" qualifier, and update the date to reflect when
the implementation was completed. This ensures the architecture documentation
accurately reflects the current state of the OCP support feature.

In `@pkg/recipe/doc.go`:
- Line 71: The doc block in pkg/recipe/doc.go has been updated with
CriteriaServiceOCP documentation at one location, but other service enumeration
lists within the same doc block (such as the initial service list and other
enumeration descriptions) still lack the corresponding ocp entry and
description. Update all service enumeration lists and descriptions throughout
the doc block to consistently include CriteriaServiceOCP with its "Red Hat
OpenShift Container Platform" description, ensuring the package documentation
remains internally consistent across all service listings.

In `@recipes/components/gpu-operator-ocp-olm/readiness.yaml`:
- Around line 34-50: The assertions in the readiness checks lack explicit
resource names in their metadata, causing them to match any
ClusterServiceVersion or Deployment in the gpu-operator namespace with the
matching status conditions. This can cause false positives if unrelated
resources satisfy the conditions first. Add metadata.name constraints to both
assertions: add a name field to the ClusterServiceVersion resource metadata
(around line 37) and add a name field to the Deployment resource metadata in the
operator-deployment-available test (around line 47) to ensure the assertions
target only the specific GPU operator resources, not any resource that happens
to match the status conditions.

In `@recipes/components/gpu-operator-ocp/values.yaml`:
- Around line 57-58: The dcgmExporter.config.data section in values.yaml
contains duplicate DCGM metric IDs that will cause initialization failure:
DCGM_FI_PROF_PCIE_TX_BYTES appears at both line 57 and line 96, and
DCGM_FI_PROF_PCIE_RX_BYTES appears at both line 58 and line 97. Remove the
duplicate definitions of these two metric IDs and keep only one canonical
definition per metric. Review the descriptions at both locations to determine
which version is correct (based on whether they are NVML-sourced or DCP metrics)
and retain only that single definition while removing the redundant entries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b6b26084-e126-4918-8535-29d8ac6987cb

📥 Commits

Reviewing files that changed from the base of the PR and between 12b86df and d774791.

📒 Files selected for processing (40)
  • .claude/skills/creating-slide-decks/skeleton.html
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • cmd/gate/chainsaw-config.yaml
  • docs/contributor/recipe.md
  • docs/design/010-ocp-helm.md
  • docs/integrator/openshift.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/manifest/render.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • recipes/checks/gpu-operator-ocp-olm/readiness.yaml
  • recipes/checks/network-operator-ocp-olm/readiness.yaml
  • recipes/checks/nfd-ocp-olm/readiness.yaml
  • recipes/components/gpu-operator-ocp-olm/manifests/operatorgroup.yaml
  • recipes/components/gpu-operator-ocp-olm/manifests/subscription.yaml
  • recipes/components/gpu-operator-ocp-olm/readiness.yaml
  • recipes/components/gpu-operator-ocp-olm/values.yaml
  • recipes/components/gpu-operator-ocp/manifests/clusterpolicy.yaml
  • recipes/components/gpu-operator-ocp/manifests/dcgm-exporter-configmap.yaml
  • recipes/components/gpu-operator-ocp/values.yaml
  • recipes/components/network-operator-ocp-olm/manifests/operatorgroup.yaml
  • recipes/components/network-operator-ocp-olm/manifests/subscription.yaml
  • recipes/components/network-operator-ocp-olm/readiness.yaml
  • recipes/components/network-operator-ocp-olm/values.yaml
  • recipes/components/network-operator-ocp/manifests/nicclusterpolicy.yaml
  • recipes/components/network-operator-ocp/values.yaml
  • recipes/components/nfd-ocp-olm/manifests/operatorgroup.yaml
  • recipes/components/nfd-ocp-olm/manifests/subscription.yaml
  • recipes/components/nfd-ocp-olm/readiness.yaml
  • recipes/components/nfd-ocp-olm/values.yaml
  • recipes/components/nfd-ocp/manifests/nodefeaturediscovery.yaml
  • recipes/components/nfd-ocp/values.yaml
  • recipes/overlays/ocp-inference.yaml
  • recipes/overlays/ocp-training.yaml
  • recipes/overlays/ocp.yaml
  • recipes/registry.yaml
  • validators/performance/nccl_all_reduce_bw_constraint.go

Comment thread docs/design/010-ocp-helm.md Outdated
Comment thread docs/design/010-ocp-helm.md Outdated
Comment thread pkg/recipe/doc.go
Comment thread recipes/components/gpu-operator-ocp-olm/readiness.yaml Outdated
Comment thread recipes/components/gpu-operator-ocp/values.yaml Outdated
@github-actions

Copy link
Copy Markdown
Contributor

@kaponco this PR now has merge conflicts with main. Please rebase to resolve them.

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough ADR and the careful values-mirroring work — modeling OLM install + CR as two in-tree KindLocalHelm phases is a clean fit.

Architecturally this lands well inside scope: no custom deployer, OLM resources flow through the existing KindLocalHelm + readiness-hooks pipeline, and the ADR explicitly rejected the out-of-scope direct/kubectl apply deployer. That's the right call.

Two things block merge: (1) the entire new rendering path is untested — ADR-010 promises a tests/chainsaw/cli/bundle-ocp/ structure+content test that isn't in this PR, so 6 components, nested-value rendering, the new quote func, and hook ordering all ship unverified; and (2) ocp-inference inherits training-tuned base values (MIG Manager + GDRCopy enabled). The rest are mediums (component-catalog entries, duplicated readiness.yaml) and a couple of low nits, all inline.

I verified the bundle-time rendering model in pkg/manifest/render.go (it injects .Release.Service/.Release.Namespace and nests values under the component name), so the mixed $v + .Release + helm.sh/hook usage is correct, not a bug.

Comment thread recipes/overlays/ocp.yaml
Comment thread recipes/registry.yaml
Comment thread recipes/components/gpu-operator-ocp-olm/readiness.yaml
Comment thread recipes/components/gpu-operator-ocp/values.yaml
Comment thread recipes/components/gpu-operator-ocp/manifests/dcgm-exporter-configmap.yaml Outdated
Comment thread pkg/manifest/render.go
Comment thread recipes/overlays/ocp.yaml Outdated
@kaponco kaponco changed the title Openshift Container Platform (OCP) service Add OpenShift Container Platform (OCP) as a service type Jun 23, 2026
kaponco added 10 commits June 23, 2026 14:53
…ree Helm charts (NVIDIA#566)

Add OCP as a new service type with OLM-based operator deployment using
the existing KindLocalHelm bundler pipeline. Each OCP operator is modeled
as two in-tree Helm chart components: an OLM installer phase and a CR
configuration phase, with readiness gates between them.

- Add CriteriaServiceOCP constant with "openshift" alias to criteria parser
- Register six OCP components (OLM + CR pairs for NFD and GPU Operator)
- Create OCP overlays (base, training, inference) that disable base
  components not applicable to the platform
- Add Helm-templated manifests for OLM resources (OperatorGroup,
  Subscription) and operator CRs (NodeFeatureDiscovery, ClusterPolicy)
- Add Chainsaw readiness gates that assert CSV phase Succeeded before
  CR deployment proceeds
- Update OpenAPI spec, CLI docs, API docs, and issue templates with ocp
  service value
- Add exhaustive switch coverage for CriteriaServiceOCP in validators
- Update ADR-010 design document with implementation decisions and
  overlay merge constraints

Signed-off-by: Shai Kapon <skapon@redhat.com>
…k Operator to base overlay (NVIDIA#566)

Improve OCP recipe reliability and coverage by strengthening readiness gates, consolidating networking into the base overlay, and adding NIM Operator support for inference workloads.

- Add Deployment availability checks to all OLM readiness gates (GPU, Network, NFD) to prevent premature CR creation before operator pods are ready
- Register k8s-nim-operator-ocp-olm component and add ocp-inference-nim overlay for NIM-based inference on OpenShift
- Expand Network Operator NicClusterPolicy values with upgrade policy, health probes, RDMA shared device plugin config, and NV-IPAM settings
- Correct GPU Operator namespace from nvidia-gpu-operator to gpu-operator across values, readiness gates, and registry
- Update design doc and added openshift integration specification

Signed-off-by: Shai Kapon <skapon@redhat.com>
…IA#566)

Restructure OCP Helm component values to use flat top-level keys instead of nesting everything under a  wrapper. This aligns the values files with their upstream Helm chart counterparts for easier side-by-side comparison
- Flatten gpu-operator-ocp, network-operator-ocp, and nfd-ocp values from nested  structure to top-level keys matching upstream charts
- Expand ClusterPolicy, NicClusterPolicy, and NodeFeatureDiscovery CR templates to explicitly render each field instead of
- Add  template function to pkg/manifest/render.go for Helm compatibility in CR templates
- Add DCGM exporter ConfigMap manifest and comprehensive metrics config to gpu-operator-ocp (dcgm-exporter-configmap.yaml)
- Remove values-training.yaml — gdrcopy and migManager are now enabled in the base values, eliminating the training-specific override file
- Update ocp-training overlay to drop the componentRef that pointed at the removed values-training.yaml
- Register dcgm-exporter-configmap.yaml in the ocp overlay manifestFiles

Signed-off-by: Shai Kapon <skapon@redhat.com>
Signed-off-by: Shai Kapon <skapon@redhat.com>
  unsupported DCGM PCIe metrics

Signed-off-by: Shai Kapon <skapon@redhat.com>
…duleType (NVIDIA#566)

  - Add OPERATOR_NAMESPACE env var to OLM Subscription config so the GPU Operator resolves
    its install namespace at runtime instead of falling back to a hard-coded value
  - Add driver.kernelModuleType field to ClusterPolicy template (defaultsto auto)
  - Add clarifying comment on operatorGroup targetNamespace choice

Signed-off-by: Shai Kapon <skapon@redhat.com>
…#566)

- Update health-check.yaml for OCP components; OLM checks are no-ops, CR checks assert reconciliation
- Simplify OLM readiness gates to CSV Succeeded assertion only
- Add OCP components to component-catalog.md
- Use {{ .Release.Namespace }} in dcgm-exporter-configmap template
- Fix missing trailing newlines, add quote template function tests
- Add chainsaw bundle-ocp test

Signed-off-by: Shai Kapon <skapon@redhat.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/013-ocp-helm.md`:
- Line 1: The document title has an ADR identifier mismatch that should be
corrected to match the file name. Update the heading in the top-level markdown
title so the ADR number aligns with docs/design/013-ocp-helm.md, and keep the
rest of the title text unchanged.

In `@docs/user/cli-reference.md`:
- Line 842: The markdown blockquote at this spot has an extra blank line inside
the quoted section, triggering MD028. Update the blockquote in the CLI reference
so the quote content stays contiguous without an empty quoted line, and verify
the surrounding markdown renders cleanly with the same quote structure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 11920244-96c4-47f4-9ab6-ecaa71a830dd

📥 Commits

Reviewing files that changed from the base of the PR and between 53676ff and 30e11ba.

📒 Files selected for processing (20)
  • .github/ISSUE_TEMPLATE/bug_report.yml
  • api/aicr/v1/server.yaml
  • docs/contributor/recipe.md
  • docs/design/013-ocp-helm.md
  • docs/integrator/openshift.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • docs/user/component-catalog.md
  • pkg/manifest/render.go
  • pkg/manifest/render_test.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • recipes/checks/gpu-operator-ocp-olm/health-check.yaml
  • recipes/checks/gpu-operator-ocp/health-check.yaml
  • recipes/checks/network-operator-ocp-olm/health-check.yaml
  • recipes/checks/network-operator-ocp/health-check.yaml
  • recipes/checks/nfd-ocp-olm/health-check.yaml
  • recipes/checks/nfd-ocp/health-check.yaml
  • recipes/components/gpu-operator-ocp-olm/manifests/operatorgroup.yaml
💤 Files with no reviewable changes (12)
  • recipes/components/gpu-operator-ocp-olm/manifests/operatorgroup.yaml
  • recipes/checks/nfd-ocp-olm/health-check.yaml
  • recipes/checks/gpu-operator-ocp-olm/health-check.yaml
  • pkg/manifest/render.go
  • recipes/checks/network-operator-ocp/health-check.yaml
  • recipes/checks/gpu-operator-ocp/health-check.yaml
  • recipes/checks/network-operator-ocp-olm/health-check.yaml
  • pkg/recipe/doc.go
  • recipes/checks/nfd-ocp/health-check.yaml
  • pkg/recipe/criteria_test.go
  • pkg/manifest/render_test.go
  • pkg/recipe/criteria.go

coderabbitai[bot]

This comment was marked as resolved.

Signed-off-by: Shai Kapon <skapon@redhat.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/bundler/bundler_test.go`:
- Around line 2738-2743: The `findNumberedDir` helper is hiding the real failure
and contains an unreachable suffix check. Update the `os.ReadDir(outDir)` call
to handle and return its error instead of discarding it, and simplify the
directory-name match in the loop by removing the impossible `"-"+name+"/"`
branch so only the valid `strings.HasSuffix(e.Name(), "-"+name)` check remains.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: e866f5c0-4432-4dee-941e-60372a19f392

📥 Commits

Reviewing files that changed from the base of the PR and between 30e11ba and 555c414.

📒 Files selected for processing (1)
  • pkg/bundler/bundler_test.go

Comment thread pkg/bundler/bundler_test.go Outdated
@kaponco kaponco requested a review from mchmarny June 24, 2026 10:14
@github-actions

Copy link
Copy Markdown
Contributor

coderabbitai[bot]

This comment was marked as resolved.

@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Recipe evidence check

Registry change: scoped to recipes that reference a changed component
entry in recipes/registry.yaml (not every leaf).

No leaf overlays affected by this PR.

This gate is warning-only and never blocks merge.

@mchmarny

Copy link
Copy Markdown
Member

Functionally the PR is in good shape now. Fix the linting errors (make sure make qualify passes locally). and while you there resolve the CodeRabbit issue #1380 (comment). After that we should be good to go. Thanks for you sticking with this, @kaponco

Signed-off-by: Shai Kapon <skapon@redhat.com>
mchmarny
mchmarny previously approved these changes Jun 24, 2026
@mchmarny mchmarny self-requested a review June 24, 2026 15:28
@mchmarny

mchmarny commented Jun 24, 2026

Copy link
Copy Markdown
Member

@kaponco the KWOK failures are related to the new OCP overlays (ocp, ocp-inference, ocp-training) across the Tier 1 deployer matrix. These recipes render OpenShift/OLM APIs (OperatorGroup, Subscription, OCP operator CRs), but the KWOK lane runs a plain Kind/KWOK cluster without those CRDs, so Helm/Argo CD/Flux all fail on API discovery rather than deployer behavior.

Recommended direction: either mark/exclude OCP overlays as not KWOK-testable, or add a separate OCP/OLM-aware validation lane. Installing fake CRDs in KWOK would only make sense if the intended coverage is render/apply smoke, not real operator readiness.

…indNumberedDir

Signed-off-by: Shai Kapon <skapon@redhat.com>
@mchmarny

Copy link
Copy Markdown
Member

@kaponco I noticed all of your commits on this PR are not cryptographically signed again. Please rview the CONTRIBUTING.md#nvidia-org-members-and-automation and amend these commits. It may also be a good idea to squash these 15 commits into logical groups.

@kaponco

kaponco commented Jun 25, 2026

Copy link
Copy Markdown
Author

@mchmarny yes 2 of the 15 I missed the -s flag. Anyway I ll squash. Can I squash everything to one ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants