Environment
- Slurm: Deployed on Kubernetes via Slinky Project
- Slurm Image:
ghcr.io/slinkyproject/slurmd-pyxis:25.11-ubuntu24.04 (includes Enroot + Pyxis)
- nvidia-container-toolkit: v1.19.0 (bundled in the Slinky image)
Problem
I encountered the same nvidia-persistenced/socket: operation not permitted error as described in SlinkyProject/slurm-operator#99 (NVIDIA Hook Failure). While investigating a fix, I found that the issue can be resolved at the enroot hook level.
In 98-nvidia.sh, the current cli_args initialization is:
cli_args=("--no-cgroups" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")
When nvidia-container-cli configure runs inside a Kubernetes pod, it attempts to bind-mount /var/run/nvidia-persistenced/socket into the container rootfs. In containerized (non-baremetal) environments, this mount operation is denied due to filesystem/permission restrictions, causing GPU workloads submitted via srun --container-image=... to fail.
A comment on slurm-operator#99 suggested that this was fixed in nvidia-container-toolkit#1593 (v1.18.2+). However, I have confirmed that the nvidia-container-toolkit version in my Slinky image is already v1.19.0, and the error still persists:
root@gpu-gpu-1:/tmp# nvidia-container-toolkit --version
NVIDIA Container Runtime Hook version 1.19.0
commit: ec7b4e2fa2caecad6d89be4a26029b831fe7503a
This indicates that upgrading nvidia-container-toolkit alone does not fully resolve the issue in containerized Kubernetes environments. The 98-nvidia.sh hook still unconditionally allows nvidia-container-cli to attempt mounting the persistenced socket, which fails due to permission restrictions in the pod.
Fix
Adding --no-persistenced to the default cli_args array in 98-nvidia.sh resolves the issue:
cli_args=("--no-cgroups" "--no-persistenced" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")
Before — original 98-nvidia.sh, error on socket mount:
root@gpu-gpu-1:/tmp# cat /etc/enroot/hooks.d/98-nvidia.sh | grep cli_args
cli_args=("--no-cgroups" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")
root@gpu-gpu-1:/tmp# NVIDIA_VISIBLE_DEVICES=all enroot start nccl-tests+12.9.1-devel-ubuntu24.04-nccl2.29.2-1-2276a5e.sqsh nvidia-smi
nvidia-container-cli: mount error: mount operation failed: /run/enroot/overlay/run/nvidia-persistenced/socket: operation not permitted
[ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
After — patched 98-nvidia.sh with --no-persistenced, GPUs accessible:
root@gpu-gpu-1:/tmp# cat /etc/enroot/hooks.d/98-nvidia.sh | grep cli_args
cli_args=("--no-cgroups" "--no-persistenced" "--ldconfig=@$(command -v ldconfig.real || command -v ldconfig)")
root@gpu-gpu-1:/tmp# NVIDIA_VISIBLE_DEVICES=all enroot start nccl-tests+12.9.1-devel-ubuntu24.04-nccl2.29.2-1-2276a5e.sqsh nvidia-smi
==========
== CUDA ==
==========
CUDA Version 12.9.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Fri Mar 27 04:49:59 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
......
Rationale
- In containerized environments (e.g., Slinky NodeSet pods),
nvidia-persistenced is not running and its socket does not exist — attempting to mount it will always fail.
- Even on bare-metal where
nvidia-persistenced is running, its driver persistence effect operates at the host kernel level. Enroot containers access GPUs through the driver directly and do not need the persistenced socket mounted to function correctly.
- This is consistent with
--no-cgroups already being included in cli_args — skipping functionality that is managed externally by the container runtime or host environment.
Environment
ghcr.io/slinkyproject/slurmd-pyxis:25.11-ubuntu24.04(includes Enroot + Pyxis)Problem
I encountered the same
nvidia-persistenced/socket: operation not permittederror as described in SlinkyProject/slurm-operator#99 (NVIDIA Hook Failure). While investigating a fix, I found that the issue can be resolved at the enroot hook level.In
98-nvidia.sh, the currentcli_argsinitialization is:When
nvidia-container-cli configureruns inside a Kubernetes pod, it attempts to bind-mount/var/run/nvidia-persistenced/socketinto the container rootfs. In containerized (non-baremetal) environments, this mount operation is denied due to filesystem/permission restrictions, causing GPU workloads submitted viasrun --container-image=...to fail.A comment on slurm-operator#99 suggested that this was fixed in nvidia-container-toolkit#1593 (v1.18.2+). However, I have confirmed that the
nvidia-container-toolkitversion in my Slinky image is already v1.19.0, and the error still persists:This indicates that upgrading
nvidia-container-toolkitalone does not fully resolve the issue in containerized Kubernetes environments. The98-nvidia.shhook still unconditionally allowsnvidia-container-clito attempt mounting the persistenced socket, which fails due to permission restrictions in the pod.Fix
Adding
--no-persistencedto the defaultcli_argsarray in98-nvidia.shresolves the issue:Before — original
98-nvidia.sh, error on socket mount:After — patched
98-nvidia.shwith--no-persistenced, GPUs accessible:Rationale
nvidia-persistencedis not running and its socket does not exist — attempting to mount it will always fail.nvidia-persistencedis running, its driver persistence effect operates at the host kernel level. Enroot containers access GPUs through the driver directly and do not need the persistenced socket mounted to function correctly.--no-cgroupsalready being included incli_args— skipping functionality that is managed externally by the container runtime or host environment.