Suggestion Description
Feature request: integrate GPU partitioning plans into AMD GPU Operator
Summary
- Introduce cluster-scoped
PartitioningPlan and NodePartitioning custom resources to orchestrate AMD GPU partitioning using DCM profiles.
- Add controllers that safely drain, taint, label, verify, and restore nodes while coordinating rollouts and surfacing detailed status/conditions.
- Leverage existing AMD GPU Operator primitives (DCM ConfigMap,
amd-dcm taint, dcm.amd.com/gpu-config-profile label, amd.com/* allocatable resources) to provide an end-to-end partition management workflow.
Background & Motivation
Today, Device Config Manager supports applying static partition profiles, but fleet operators still need to script the manual parts: selecting nodes, draining workloads, enforcing serial rollouts, waiting for device plugin health, and validating allocatable GPUs. Introducing PartitioningPlan/NodePartitioning controllers would automate that lifecycle so platform teams can request multi-node GPU partitioning with a single CR.
Proposed Solution
API additions
PartitioningPlan (cluster scope): High-level rollout CR, which links one or more DCM profiles to a set of nodes via label selectors. Status tracks phase, per-node summaries, and typed conditions for UIs/alerting.
NodePartitioning (cluster scope): Per-node work item owned by a plan. Responsible for orchestrating the partitioning for a single node.
PartitioningProfileSpec: Inline DCM profile metadata with optional expected allocatable resources to assert after reconciliation, e.g. expect amd.com.com/cpx_nps4: 63
Controller behavior
- PartitioningPlan controller
- Discovers matching nodes, filters control plane hosts by default, and rejects ambiguous selector overlaps
- Blocks conflicting ownership when another plan targets the same node
- Projects desired
NodePartitioning specs, computing deterministic hashes of the requested profile for drift detection
- Aggregates per-node phase counts, emits user-facing conditions (PlanReady, RolloutProgressing/Completed/Degraded, Paused), and maintains a lightweight node status cache for dashboards
- Cleans up stale child CRs when nodes fall out of scope, ensuring serialized ownership
- State machine drives taint/cordon, drain, DCM profile application, operator wait, verification, and cleanup
- Dry-run mode to see the
NodePartitioning resources that would get created, without impacting the driver
- Each phase updates strongly-typed conditions (NodeCordoned, NodeTainted, DrainCompleted, ProfileApplied, OperatorReady, Verified) and only advances once prerequisites succeed
- Verification asserts the DCM label and
amd.com/* allocatable availability before untainting/uncordoning and marking success
- Watches node events and allocatable resource deltas to retry automatically when the operator surfaces readiness
Custom Resource Examples
PartitioningPlan (apiVersion: amd.com/v1alpha1)
apiVersion: amd.com/v1alpha1
kind: PartitioningPlan
metadata:
name: mi300-partitioning
spec:
dryRun: false
rollout:
maxParallel: 1 # run at most one partitioning process at a time
maxUnavailable: 1 # out of the targeted nodes, at most one can be unavailable at any time
excludeControlPlane: true # if false, potentially include control plane nodes, default to true
rules:
- description: "Partitioned MI300X nodes"
selector:
matchLabels:
amd.com/gpu.product-name: AMD_Instinct_MI300X_OAM
profile:
expectedResources:
amd.com/cpx_nps4: "63"
dcmProfileName: "cpx"
Acceptance Criteria / Work Items
Operating System
No response
GPU
No response
ROCm Component
No response
Suggestion Description
Feature request: integrate GPU partitioning plans into AMD GPU Operator
Summary
PartitioningPlanandNodePartitioningcustom resources to orchestrate AMD GPU partitioning using DCM profiles.amd-dcmtaint,dcm.amd.com/gpu-config-profilelabel,amd.com/*allocatable resources) to provide an end-to-end partition management workflow.Background & Motivation
Today, Device Config Manager supports applying static partition profiles, but fleet operators still need to script the manual parts: selecting nodes, draining workloads, enforcing serial rollouts, waiting for device plugin health, and validating allocatable GPUs. Introducing
PartitioningPlan/NodePartitioningcontrollers would automate that lifecycle so platform teams can request multi-node GPU partitioning with a single CR.Proposed Solution
API additions
PartitioningPlan(cluster scope): High-level rollout CR, which links one or more DCM profiles to a set of nodes via label selectors. Status tracks phase, per-node summaries, and typed conditions for UIs/alerting.NodePartitioning(cluster scope): Per-node work item owned by a plan. Responsible for orchestrating the partitioning for a single node.PartitioningProfileSpec: Inline DCM profile metadata with optional expected allocatable resources to assert after reconciliation, e.g. expectamd.com.com/cpx_nps4: 63Controller behavior
NodePartitioningspecs, computing deterministic hashes of the requested profile for drift detectionNodePartitioningresources that would get created, without impacting the driveramd.com/*allocatable availability before untainting/uncordoning and marking successCustom Resource Examples
PartitioningPlan (
apiVersion: amd.com/v1alpha1)Acceptance Criteria / Work Items
Operating System
No response
GPU
No response
ROCm Component
No response