-
Notifications
You must be signed in to change notification settings - Fork 630
Fix Github workflows issues #2636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
ee0cca1
df6a81b
daa3bf3
d47399a
d2091f2
4dc9323
4171efe
b44ec74
23ea443
a0a528f
95c333f
cb3fa26
cc3c5b1
52f6cb2
4ebbe22
89d2985
b5f554e
30cd354
ba96b42
7ad602f
d3bfeeb
03488d9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -12,14 +12,15 @@ jobs: | |||||
| name: 'Core' | ||||||
| runs-on: ubuntu-latest | ||||||
| container: | ||||||
| image: nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04 | ||||||
| image: nvcr.io/nvidia/cuda:13.0.0-devel-ubuntu22.04 | ||||||
| options: --user root | ||||||
| steps: | ||||||
| - name: 'Dependencies' | ||||||
| run: | | ||||||
| apt-get update | ||||||
| apt-get install -y git python3.9 pip cudnn9-cuda-12 | ||||||
| pip install cmake==3.21.0 pybind11[global] ninja | ||||||
| git config --global --add safe.directory '*' | ||||||
| - name: 'Checkout' | ||||||
| uses: actions/checkout@v3 | ||||||
| with: | ||||||
|
|
@@ -32,125 +33,97 @@ jobs: | |||||
| NVTE_FRAMEWORK: none | ||||||
| MAX_JOBS: 1 | ||||||
| SCCACHE_GHA_ENABLED: "true" | ||||||
| NVTE_CUDA_ARCHS: "100" | ||||||
| - name: 'Sanity check' | ||||||
| run: python3 -c "import transformer_engine" | ||||||
| working-directory: / | ||||||
| pytorch: | ||||||
| name: 'PyTorch' | ||||||
| runs-on: ubuntu-latest | ||||||
| container: | ||||||
| image: nvcr.io/nvidia/cuda:13.0.0-devel-ubuntu22.04 | ||||||
| options: --user root | ||||||
| steps: | ||||||
| - name: Move /var/lib/docker/ | ||||||
| shell: bash -euxo pipefail {0} | ||||||
| run: sudo mv /var/lib/docker/ "${GITHUB_WORKSPACE}/docker" | ||||||
|
|
||||||
| - name: Maximize build space | ||||||
| uses: easimon/maximize-build-space@c28619d8999a147d5e09c1199f84ff6af6ad5794 | ||||||
| with: | ||||||
| root-reserve-mb: 5120 | ||||||
| temp-reserve-mb: 32 | ||||||
| swap-size-mb: 10240 | ||||||
| remove-dotnet: 'true' | ||||||
| remove-android: 'true' | ||||||
| remove-haskell: 'true' | ||||||
| remove-codeql: 'true' | ||||||
| build-mount-path: '/var/lib/docker/' | ||||||
|
|
||||||
| - name: Restore /var/lib/docker/ | ||||||
| shell: bash -euxo pipefail {0} | ||||||
| run: sudo sh -c "mv ${GITHUB_WORKSPACE}/docker/* /var/lib/docker" | ||||||
|
|
||||||
| - name: 'Dependencies' | ||||||
| run: | | ||||||
| apt-get update | ||||||
| apt-get install -y git python3.9 pip cudnn9-cuda-12 | ||||||
| pip install cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript | ||||||
| pip install torch --index-url https://download.pytorch.org/whl/cu130 | ||||||
| git config --global --add safe.directory '*' | ||||||
| - name: 'Checkout' | ||||||
| uses: actions/checkout@v3 | ||||||
| with: | ||||||
| submodules: recursive | ||||||
|
|
||||||
| - name: Start named container | ||||||
| run: | | ||||||
| docker run -v $(pwd):$(pwd) -w $(pwd) --name builder -d nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04 sleep infinity | ||||||
|
|
||||||
| - name: 'Dependencies' | ||||||
| run: | | ||||||
| docker exec builder bash -c '\ | ||||||
| apt-get update && \ | ||||||
| apt-get install -y git python3.9 pip cudnn9-cuda-12 && \ | ||||||
| pip install cmake torch ninja pydantic importlib-metadata>=1.0 packaging pybind11 numpy einops onnxscript && \ | ||||||
| apt-get clean \ | ||||||
| ' | ||||||
|
|
||||||
| - name: ccache | ||||||
| uses: mozilla-actions/sccache-action@7d986dd989559c6ecdb630a3fd2557667be217ad | ||||||
| - name: 'Build' | ||||||
| run: docker exec builder bash -c 'pip install --no-build-isolation . -v --no-deps' | ||||||
| run: NVTE_USE_CCACHE=1 NVTE_CCACHE_BIN=sccache pip install --no-build-isolation . -v --no-deps | ||||||
| env: | ||||||
| NVTE_FRAMEWORK: pytorch | ||||||
| MAX_JOBS: 1 | ||||||
| SCCACHE_GHA_ENABLED: "true" | ||||||
| NVTE_CUDA_ARCHS: "100" | ||||||
| - name: 'Sanity check' | ||||||
| run: docker exec builder bash -c 'python3 tests/pytorch/test_sanity_import.py' | ||||||
| run: python3 tests/pytorch/test_sanity_import.py | ||||||
| jax: | ||||||
| name: 'JAX' | ||||||
| runs-on: ubuntu-latest | ||||||
| container: | ||||||
| image: ghcr.io/nvidia/jax:jax | ||||||
| image: nvcr.io/nvidia/cuda:13.0.0-devel-ubuntu22.04 | ||||||
| options: --user root | ||||||
| steps: | ||||||
| - name: 'Dependencies' | ||||||
| run: pip install cmake==3.21.0 pybind11[global] | ||||||
| run: | | ||||||
| apt-get update | ||||||
| apt-get install -y git python3.9 pip cudnn9-cuda-12 | ||||||
| pip install cmake==3.21.0 pybind11[global] ninja packaging | ||||||
| pip install jax[cuda13] flax[cuda13] | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Verify
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| git config --global --add safe.directory '*' | ||||||
| - name: 'Checkout' | ||||||
| uses: actions/checkout@v3 | ||||||
| with: | ||||||
| submodules: recursive | ||||||
| - name: ccache | ||||||
| uses: mozilla-actions/sccache-action@7d986dd989559c6ecdb630a3fd2557667be217ad | ||||||
| - name: 'Build' | ||||||
| run: | | ||||||
| NVTE_CCACHE_BIN=sccache NVTE_USE_CCACHE=1 pip install --no-build-isolation . -v | ||||||
| run: NVTE_USE_CCACHE=1 NVTE_CCACHE_BIN=sccache pip install --no-build-isolation . -v | ||||||
| env: | ||||||
| NVTE_FRAMEWORK: jax | ||||||
| MAX_JOBS: 1 | ||||||
| SCCACHE_GHA_ENABLED: "true" | ||||||
| NVTE_CUDA_ARCHS: "100" | ||||||
| - name: 'Sanity check' | ||||||
| run: python3 tests/jax/test_sanity_import.py | ||||||
| all: | ||||||
| name: 'All' | ||||||
| runs-on: ubuntu-latest | ||||||
| container: | ||||||
| image: nvcr.io/nvidia/cuda:13.0.0-devel-ubuntu22.04 | ||||||
| options: --user root | ||||||
| steps: | ||||||
| - name: Move /var/lib/docker/ | ||||||
| shell: bash -euxo pipefail {0} | ||||||
| run: sudo mv /var/lib/docker/ "${GITHUB_WORKSPACE}/docker" | ||||||
|
|
||||||
| - name: Maximize build space | ||||||
| uses: easimon/maximize-build-space@c28619d8999a147d5e09c1199f84ff6af6ad5794 | ||||||
| with: | ||||||
| root-reserve-mb: 5120 | ||||||
| temp-reserve-mb: 32 | ||||||
| swap-size-mb: 10240 | ||||||
| remove-dotnet: 'true' | ||||||
| remove-android: 'true' | ||||||
| remove-haskell: 'true' | ||||||
| remove-codeql: 'true' | ||||||
| build-mount-path: '/var/lib/docker/' | ||||||
|
|
||||||
| - name: Restore /var/lib/docker/ | ||||||
| shell: bash -euxo pipefail {0} | ||||||
| run: sudo sh -c "mv ${GITHUB_WORKSPACE}/docker/* /var/lib/docker" | ||||||
|
|
||||||
| - name: 'Dependencies' | ||||||
| run: | | ||||||
| apt-get update | ||||||
| apt-get install -y git python3.9 pip cudnn9-cuda-12 | ||||||
| pip install cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript | ||||||
| pip install torch --index-url https://download.pytorch.org/whl/cu130 | ||||||
| pip install jax[cuda13] flax[cuda13] | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same concern as JAX job: verify
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same issue as JAX job:
Suggested change
|
||||||
| git config --global --add safe.directory '*' | ||||||
| - name: 'Checkout' | ||||||
| uses: actions/checkout@v3 | ||||||
| with: | ||||||
| submodules: recursive | ||||||
|
|
||||||
| - name: Start named container | ||||||
| run: | | ||||||
| docker run -v $(pwd):$(pwd) -w $(pwd) --name builder -d ghcr.io/nvidia/jax:jax sleep infinity | ||||||
|
|
||||||
| - name: 'Dependencies' | ||||||
| run: | | ||||||
| docker exec builder bash -c '\ | ||||||
| pip install cmake==3.21.0 pybind11[global] einops onnxscript && \ | ||||||
| pip install torch --no-cache-dir --index-url https://download.pytorch.org/whl/cu130 | ||||||
| ' | ||||||
| - name: ccache | ||||||
| uses: mozilla-actions/sccache-action@7d986dd989559c6ecdb630a3fd2557667be217ad | ||||||
| - name: 'Build' | ||||||
| run: docker exec builder bash -c 'pip install --no-cache-dir --no-build-isolation . -v --no-deps' | ||||||
| run: NVTE_USE_CCACHE=1 NVTE_CCACHE_BIN=sccache pip install --no-build-isolation . -v --no-deps | ||||||
| env: | ||||||
| NVTE_FRAMEWORK: all | ||||||
| MAX_JOBS: 1 | ||||||
| SCCACHE_GHA_ENABLED: "true" | ||||||
| NVTE_CUDA_ARCHS: "100" | ||||||
| - name: 'Sanity check' | ||||||
| run: docker exec builder bash -c 'python3 tests/pytorch/test_sanity_import.py && python3 tests/jax/test_sanity_import.py' | ||||||
| run: | | ||||||
| python3 tests/pytorch/test_sanity_import.py | ||||||
| python3 tests/jax/test_sanity_import.py | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched from
ghcr.io/nvidia/jax:jaxto base CUDA container - verify JAX[cuda12] install is compatible with CUDA 12.1 and includes all necessary dependencies