feat(bundler): support dynamic toleration injection for pre-built OCI bundles#1471
feat(bundler): support dynamic toleration injection for pre-built OCI bundles#1471mohityadav8 wants to merge 14 commits into
Conversation
- Add NO_COLOR-aware color helpers (_ok, _fail, _warn_line) - Add _step_header/ok/fail/retry for per-component section framing - Show [N/M] progress counter and manual copy-paste command per step - Color-code pre-flight pass/fail and final deployment summary Closes NVIDIA#1362
Signed-off-by: Mohit Yadav <ymohit799057@gmail.com>
Signed-off-by: Mohit Yadav <ymohit799057@gmail.com>
- Re-indent attempt/while/done block inside for-loop body - Add env var caveat to Manual: step hint with printf -v quoting - Regenerate all 6 testdata goldens
When a toleration path is declared via --dynamic, skip baking in the CLI-supplied value during applyNodeSchedulingOverrides. Merging dynamic paths into policy.optOut causes the injection to be skipped, so cluster-values.yaml carries an empty placeholder instead of a baked-in toleration. Operators can then inject additional taints at install time without rebuilding the OCI bundle. Add dynamicPathSetFor() helper that resolves --dynamic override keys to component names via the registry and returns the declared path set. Closes NVIDIA#1371
|
Welcome to AICR, @mohityadav8! Thanks for your first pull request. Before review, please ensure:
A maintainer will review this soon. |
📝 WalkthroughWalkthroughThe bundler now filters enabled components inline, rejects bundles when none remain enabled, and removes the old filtering helper. It also resolves dynamic value paths from the component registry, adds them to scheduling opt-out, and keeps those paths unset during node scheduling overrides. Agentgateway handling now emits warnings instead of rewriting source ranges. Tests cover alias-based dynamic path resolution and toleration opt-out behavior. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/bundler/bundler_test.go`:
- Around line 2477-2544: The new tests in TestDynamicPathSetFor and
TestDynamicTolerationPathExcludedFromBakeIn use spaces instead of the file’s
tab-based Go formatting, which will fail fmt-check. Reformat the added test
blocks with gofmt (or make tidy) so the indentation matches the surrounding
code, and keep the test logic in place around dynamicPathSetFor,
computeSchedulingPathPolicy, and applyNodeSchedulingOverrides.
In `@pkg/bundler/bundler.go`:
- Around line 1143-1146: The registry-load failure in
applyNodeSchedulingOverrides is being swallowed when
GetComponentRegistryFor(provider) returns an error, which can silently disable
dynamic opt-out behavior. Update the applyNodeSchedulingOverrides path in
bundler.go to keep returning nil on error but also add a slog.Debug log,
matching the existing GetComponentRegistryFor failure handling, so the failure
is visible without changing behavior.
- Around line 1109-1132: Remove the duplicated storage-class injection block in
bundler.go within the component handling logic: the second copy of the
sc/comp.GetStorageClassPaths()/component.ApplyMapOverrides sequence is redundant
and should be deleted, leaving only the original implementation. While editing,
ensure the remaining code in that area uses the existing tab-based Go formatting
so the bundler logic around getValueOverridesForComponent and ApplyMapOverrides
still passes fmt checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 229b45ed-3222-4948-adcd-3c26498a11b7
📒 Files selected for processing (9)
pkg/bundler/bundler.gopkg/bundler/bundler_test.gopkg/bundler/deployer/helm/templates/deploy.sh.tmplpkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.shpkg/bundler/deployer/helm/testdata/manifest_only/deploy.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.shpkg/bundler/deployer/helm/testdata/mixed_with_pre/deploy.shpkg/bundler/deployer/helm/testdata/nodewright_present/deploy.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.sh
There was a problem hiding this comment.
Thanks for this — the core approach is clean: merging dynamic paths into policy.optOut and letting the existing filterPaths machinery skip injection is exactly the right seam, no churn to applyNodeSchedulingOverrides, and optOut is always initialized so the merge is panic-safe.
Blockers before this can merge:
gofmt failure (inline) pkg/bundler/bundler_test.go is space-indented; gofmt -l flags it, so make qualify fails.
One medium concern inline on dynamicPathSetFor (silent fail-open on bad --dynamic keys), plus two nits:
- "empty placeholder" (PR description + the comment at the
optOutmerge): the path ends up absent from values, not set to an empty placeholder — the test correctly asserts absence, but the wording overstates it. - Worth a one-line note in
docs/user/bundling.mdthat--dynamicpaths are excluded from scheduling bake-in, since that's the user-visible behavior this enables.
None of the substance is in question — once the rebase shrinks the diff and gofmt passes, this should be quick to re-review.
- Add slog.Warn when GetByOverrideKey returns nil in dynamicPathSetFor so gaps in upstream validation surface immediately rather than silently baking in a toleration the user expected to leave dynamic - Fix gofmt tab indentation in new test functions
|
@mchmarny |
|
|
- Delete duplicate space-indented storage-class block causing gofmt failure - Fix optOut merge comment: path is absent from values, not an empty placeholder - Add note to docs/user/bundling.md on --dynamic scheduling exclusion
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/bundler/bundler.go`:
- Around line 228-252: The new filtering loops in bundler.go ignore cancellation
while iterating over recipeResult.ComponentRefs and DeploymentOrder, so add
ctx.Err() checks inside these loops and return early when the context is
canceled, consistent with extractComponentValues. Update the filtering logic in
the bundle-building path to stop processing once Make’s ctx is done, using the
existing ctx passed through the bundling flow and the same cancellation pattern
used elsewhere in the bundle work.
- Around line 238-244: The re-enable path in bundler logic still carries the
recipe-level enabled override into value generation, so a component restored by
`--set <component>:enabled=true` can still render with stale disabled state.
Update the enabled-ref handling in `bundler.go` where refs are added to
`enabledRefs`/`enabledSet` so that `enabled` is stripped from any ref that was
re-enabled before later value extraction, matching the existing cleanup done for
scalar `--set` values.
- Around line 1265-1270: The source-range check in the bundler helper that
iterates over items and calls netutil.IsAnySourceCIDR is treating malformed
CIDRs as scoped, so invalid values like not-a-cidr bypass the warning. Update
that logic to validate each entry first using net.ParseCIDR (or equivalent)
before calling IsAnySourceCIDR, and return false on parse failure; add the net
import if it is not already used. Keep the behavior in the same helper so the
warning is only suppressed for valid non-any source CIDRs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 29502f5a-8943-4072-9963-1667ef3cf757
📒 Files selected for processing (1)
pkg/bundler/bundler.go
| for _, ref := range recipeResult.ComponentRefs { | ||
| setEnabled, ok, overrideErr := b.getSetEnabledOverride(ref.Name, recipeResult.DataProvider()) | ||
| if overrideErr != nil { | ||
| return nil, overrideErr | ||
| } | ||
| if ok { | ||
| if !setEnabled { | ||
| slog.Info("skipping component disabled via --set", "component", ref.Name) | ||
| continue | ||
| } | ||
| // --set enabled=true overrides recipe-level disabled | ||
| } else if !ref.IsEnabled() { | ||
| slog.Info("skipping disabled component", "component", ref.Name) | ||
| continue | ||
| } | ||
| enabledRefs = append(enabledRefs, ref) | ||
| enabledSet[ref.Name] = struct{}{} | ||
| } | ||
|
|
||
| // Filter DeploymentOrder to match enabled components | ||
| filteredOrder := make([]string, 0, len(recipeResult.DeploymentOrder)) | ||
| for _, name := range recipeResult.DeploymentOrder { | ||
| if _, ok := enabledSet[name]; ok { | ||
| filteredOrder = append(filteredOrder, name) | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Honor cancellation in the new filtering loops.
These loops run under Make's ctx but continue after cancellation; add ctx.Err() checks like extractComponentValues does before later bundle work proceeds.
As per coding guidelines, Always check ctx.Done() in loops and long operations to respect context cancellation.
Proposed fix
for _, ref := range recipeResult.ComponentRefs {
+ if err := ctx.Err(); err != nil {
+ return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled during component filtering", err)
+ }
setEnabled, ok, overrideErr := b.getSetEnabledOverride(ref.Name, recipeResult.DataProvider())
if overrideErr != nil {
return nil, overrideErr
@@
// Filter DeploymentOrder to match enabled components
filteredOrder := make([]string, 0, len(recipeResult.DeploymentOrder))
for _, name := range recipeResult.DeploymentOrder {
+ if err := ctx.Err(); err != nil {
+ return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled during deployment order filtering", err)
+ }
if _, ok := enabledSet[name]; ok {
filteredOrder = append(filteredOrder, name)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for _, ref := range recipeResult.ComponentRefs { | |
| setEnabled, ok, overrideErr := b.getSetEnabledOverride(ref.Name, recipeResult.DataProvider()) | |
| if overrideErr != nil { | |
| return nil, overrideErr | |
| } | |
| if ok { | |
| if !setEnabled { | |
| slog.Info("skipping component disabled via --set", "component", ref.Name) | |
| continue | |
| } | |
| // --set enabled=true overrides recipe-level disabled | |
| } else if !ref.IsEnabled() { | |
| slog.Info("skipping disabled component", "component", ref.Name) | |
| continue | |
| } | |
| enabledRefs = append(enabledRefs, ref) | |
| enabledSet[ref.Name] = struct{}{} | |
| } | |
| // Filter DeploymentOrder to match enabled components | |
| filteredOrder := make([]string, 0, len(recipeResult.DeploymentOrder)) | |
| for _, name := range recipeResult.DeploymentOrder { | |
| if _, ok := enabledSet[name]; ok { | |
| filteredOrder = append(filteredOrder, name) | |
| } | |
| for _, ref := range recipeResult.ComponentRefs { | |
| if err := ctx.Err(); err != nil { | |
| return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled during component filtering", err) | |
| } | |
| setEnabled, ok, overrideErr := b.getSetEnabledOverride(ref.Name, recipeResult.DataProvider()) | |
| if overrideErr != nil { | |
| return nil, overrideErr | |
| } | |
| if ok { | |
| if !setEnabled { | |
| slog.Info("skipping component disabled via --set", "component", ref.Name) | |
| continue | |
| } | |
| // --set enabled=true overrides recipe-level disabled | |
| } else if !ref.IsEnabled() { | |
| slog.Info("skipping disabled component", "component", ref.Name) | |
| continue | |
| } | |
| enabledRefs = append(enabledRefs, ref) | |
| enabledSet[ref.Name] = struct{}{} | |
| } | |
| // Filter DeploymentOrder to match enabled components | |
| filteredOrder := make([]string, 0, len(recipeResult.DeploymentOrder)) | |
| for _, name := range recipeResult.DeploymentOrder { | |
| if err := ctx.Err(); err != nil { | |
| return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled during deployment order filtering", err) | |
| } | |
| if _, ok := enabledSet[name]; ok { | |
| filteredOrder = append(filteredOrder, name) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/bundler/bundler.go` around lines 228 - 252, The new filtering loops in
bundler.go ignore cancellation while iterating over recipeResult.ComponentRefs
and DeploymentOrder, so add ctx.Err() checks inside these loops and return early
when the context is canceled, consistent with extractComponentValues. Update the
filtering logic in the bundle-building path to stop processing once Make’s ctx
is done, using the existing ctx passed through the bundling flow and the same
cancellation pattern used elsewhere in the bundle work.
Source: Coding guidelines
| // --set enabled=true overrides recipe-level disabled | ||
| } else if !ref.IsEnabled() { | ||
| slog.Info("skipping disabled component", "component", ref.Name) | ||
| continue | ||
| } | ||
| enabledRefs = append(enabledRefs, ref) | ||
| enabledSet[ref.Name] = struct{}{} |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Strip enabled from re-enabled refs before value extraction.
Line 238 allows --set <component>:enabled=true to include a recipe-disabled component, but Line 243 appends the original ref with Overrides["enabled"]=false. Since only scalar --set strips enabled later, that stale recipe toggle can be rendered into chart values for the component the user re-enabled.
Proposed fix
} else if !ref.IsEnabled() {
slog.Info("skipping disabled component", "component", ref.Name)
continue
}
+ if len(ref.Overrides) > 0 {
+ if _, hasEnabled := ref.Overrides[config.ComponentEnabledKey]; hasEnabled {
+ cleaned := make(map[string]any, len(ref.Overrides)-1)
+ for k, v := range ref.Overrides {
+ if k == config.ComponentEnabledKey {
+ continue
+ }
+ cleaned[k] = v
+ }
+ ref.Overrides = cleaned
+ }
+ }
enabledRefs = append(enabledRefs, ref)
enabledSet[ref.Name] = struct{}{}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // --set enabled=true overrides recipe-level disabled | |
| } else if !ref.IsEnabled() { | |
| slog.Info("skipping disabled component", "component", ref.Name) | |
| continue | |
| } | |
| enabledRefs = append(enabledRefs, ref) | |
| enabledSet[ref.Name] = struct{}{} | |
| // --set enabled=true overrides recipe-level disabled | |
| } else if !ref.IsEnabled() { | |
| slog.Info("skipping disabled component", "component", ref.Name) | |
| continue | |
| } | |
| if len(ref.Overrides) > 0 { | |
| if _, hasEnabled := ref.Overrides[config.ComponentEnabledKey]; hasEnabled { | |
| cleaned := make(map[string]any, len(ref.Overrides)-1) | |
| for k, v := range ref.Overrides { | |
| if k == config.ComponentEnabledKey { | |
| continue | |
| } | |
| cleaned[k] = v | |
| } | |
| ref.Overrides = cleaned | |
| } | |
| } | |
| enabledRefs = append(enabledRefs, ref) | |
| enabledSet[ref.Name] = struct{}{} |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/bundler/bundler.go` around lines 238 - 244, The re-enable path in bundler
logic still carries the recipe-level enabled override into value generation, so
a component restored by `--set <component>:enabled=true` can still render with
stale disabled state. Update the enabled-ref handling in `bundler.go` where refs
are added to `enabledRefs`/`enabledSet` so that `enabled` is stripped from any
ref that was re-enabled before later value extraction, matching the existing
cleanup done for scalar `--set` values.
| for _, r := range items { | ||
| if !netutil.IsValidCIDR(r) { | ||
| return exposureInvalid, fmt.Sprintf("%q is not a valid CIDR", r) | ||
| } | ||
| if netutil.IsAnySourceCIDR(r) { | ||
| hasAnySource = true | ||
| return false | ||
| } | ||
| } | ||
| if hasAnySource { | ||
| return exposureOpen, "" | ||
| } | ||
| return exposureScoped, "" | ||
| return true |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Reject malformed CIDRs before treating source ranges as scoped.
netutil.IsAnySourceCIDR returns false for parse errors, so allowedSourceRanges: ["not-a-cidr"] reaches Line 1270 and suppresses the warning even though the value is invalid.
Proposed fix
for _, r := range items {
+ if _, _, err := net.ParseCIDR(strings.TrimSpace(r)); err != nil {
+ return false
+ }
if netutil.IsAnySourceCIDR(r) {
return false
}
}Add net to the imports if it is not already present.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for _, r := range items { | |
| if !netutil.IsValidCIDR(r) { | |
| return exposureInvalid, fmt.Sprintf("%q is not a valid CIDR", r) | |
| } | |
| if netutil.IsAnySourceCIDR(r) { | |
| hasAnySource = true | |
| return false | |
| } | |
| } | |
| if hasAnySource { | |
| return exposureOpen, "" | |
| } | |
| return exposureScoped, "" | |
| return true | |
| for _, r := range items { | |
| if _, _, err := net.ParseCIDR(strings.TrimSpace(r)); err != nil { | |
| return false | |
| } | |
| if netutil.IsAnySourceCIDR(r) { | |
| return false | |
| } | |
| } | |
| return true |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/bundler/bundler.go` around lines 1265 - 1270, The source-range check in
the bundler helper that iterates over items and calls netutil.IsAnySourceCIDR is
treating malformed CIDRs as scoped, so invalid values like not-a-cidr bypass the
warning. Update that logic to validate each entry first using net.ParseCIDR (or
equivalent) before calling IsAnySourceCIDR, and return false on parse failure;
add the net import if it is not already used. Keep the behavior in the same
helper so the warning is only suppressed for valid non-any source CIDRs.
Summary
Closes #1371
When AICR generates an OCI bundle, GPU node tolerations are baked in at
build time via
--accelerated-node-toleration. Once built, there was noway to inject additional tolerations at install time without rebuilding
the entire bundle — a blocker when customers add arbitrary taints to GPU
nodes (e.g.
reserved-by:NoScheduleduring application upgrades).What changed
pkg/bundler/bundler.goAdded
dynamicPathSetFor(componentName, provider)— resolves--dynamicoverride keys to component names via the registry andreturns the declared path set for that component.
In
extractComponentValues(), dynamic paths are merged intopolicy.optOutbeforeapplyNodeSchedulingOverridesruns. Theexisting
filterPaths()machinery already skips injection for opt-outpaths — so no changes to
applyNodeSchedulingOverridesitself wereneeded. The result:
cluster-values.yamlgets an empty placeholderinstead of a baked-in toleration value.
pkg/bundler/bundler_test.goTestDynamicPathSetFor— verifies the helper correctly resolvesoverride keys to component names and scopes paths per component.
TestDynamicTolerationPathExcludedFromBakeIn— verifies that when--dynamic gpuoperator:daemonsets.tolerationsis set alongside--accelerated-node-toleration, the toleration is NOT written intocomponent values.
How to use
daemonsets.tolerationswill be an empty placeholder incluster-values.yaml. At install time, inject the taint:Testing