Skip to content

Fix Github workflows issues#2636

Open
pggPL wants to merge 22 commits intoNVIDIA:mainfrom
pggPL:fix_github_workflows
Open

Fix Github workflows issues#2636
pggPL wants to merge 22 commits intoNVIDIA:mainfrom
pggPL:fix_github_workflows

Conversation

@pggPL
Copy link
Collaborator

@pggPL pggPL commented Jan 30, 2026

Description

This PR fixes following issues:

  • Deploy nightly docs fails, because of non-compatible packages. I tested it in my own fork and version changes fix the issue,
  • Build jobs are red, because of OoM - the MAX_JOBS=1 envvar was not propagated correctly inside the containers,
  • PyTorch build job needed more disk space, so I changed container to JAX one and installed pytorch manually - it takes much less space than any other option,

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL pggPL changed the title PR to debug github workflows fails PR to debug github workflows failures Jan 30, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 30, 2026

Greptile Overview

Greptile Summary

This PR addresses CI/CD infrastructure issues by upgrading CUDA to 13.0, simplifying the build process, and updating GitHub Actions.

Key Changes:

  • Upgraded all jobs from CUDA 12.1.0/12.8.0 to CUDA 13.0.0 for consistency
  • Removed Docker-in-Docker complexity - builds now run directly in containers
  • Eliminated disk space maximization steps that were causing issues
  • Updated GitHub Pages deployment actions from v1/v2 to v3/v4 for compatibility
  • Added NVTE_CUDA_ARCHS: "100" (Blackwell architecture support)
  • Added git config --global --add safe.directory '*' to handle git security

Critical Issue:

  • The JAX installation uses jax[cuda13] and flax[cuda13] extras which don't follow JAX's naming convention. JAX uses extras like jax[cuda12_pip] or jax[cuda12_local], not cuda13. This will cause installation failures in the JAX and All jobs.

Confidence Score: 2/5

  • This PR has critical installation bugs that will cause JAX builds to fail
  • The invalid jax[cuda13] extra syntax will prevent JAX installation in two of four build jobs, causing immediate CI failures. While the overall approach is sound (simplifying Docker-in-Docker), the execution has a critical flaw.
  • .github/workflows/build.yml requires immediate attention for JAX installation syntax (lines 81, 111)

Important Files Changed

Filename Overview
.github/workflows/deploy_nightly_docs.yml Updated GitHub Actions to v3/v4 for compatibility, added workflow_dispatch trigger
.github/workflows/build.yml Major refactor: upgraded CUDA to 13.0, simplified build approach by removing Docker-in-Docker, uses jax[cuda13] syntax that may be invalid

Sequence Diagram

sequenceDiagram
    participant GH as GitHub Actions
    participant Core as Core Job
    participant PyTorch as PyTorch Job
    participant JAX as JAX Job
    participant All as All Job
    
    GH->>Core: Trigger build
    Core->>Core: Install deps in CUDA 13.0 container
    Core->>Core: Build with MAX_JOBS=1
    Core->>Core: Sanity check
    
    GH->>PyTorch: Trigger build
    PyTorch->>PyTorch: Install deps + PyTorch (CUDA 13.0)
    PyTorch->>PyTorch: Build with MAX_JOBS=1
    PyTorch->>PyTorch: Test import
    
    GH->>JAX: Trigger build
    JAX->>JAX: Install deps + jax[cuda13] ❌
    Note over JAX: Invalid extra syntax
    JAX--xJAX: Installation fails
    
    GH->>All: Trigger build
    All->>All: Install PyTorch + jax[cuda13] ❌
    Note over All: Invalid extra syntax
    All--xAll: Installation fails
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@pggPL pggPL changed the title PR to debug github workflows failures Fix Github workflows issues Jan 30, 2026
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines 59 to 61
root-reserve-mb: 4096
temp-reserve-mb: 32
swap-size-mb: 10240
swap-size-mb: 4096
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify that reduced memory allocation (root: 5120→4096 MB, swap: 10240→4096 MB) is sufficient for PyTorch builds to avoid OOM issues.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

- name: 'Dependencies'
run: |
pip install cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript
pip install torch --no-cache-dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing --index-url https://download.pytorch.org/whl/cu130 which was present before. This may install CPU-only PyTorch instead of the CUDA version needed for testing.

Suggested change
pip install torch --no-cache-dir
pip install torch --no-cache-dir --index-url https://download.pytorch.org/whl/cu130

- name: 'Dependencies'
run: |
pip install cmake==3.21.0 pybind11[global] einops onnxscript
pip install torch --no-cache-dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing --index-url https://download.pytorch.org/whl/cu130 which was present before. This may install CPU-only PyTorch instead of the CUDA version.

Suggested change
pip install torch --no-cache-dir
pip install torch --no-cache-dir --index-url https://download.pytorch.org/whl/cu130

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

- name: 'Dependencies'
run: |
pip install --no-cache-dir cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript
pip install --no-cache-dir torch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing --index-url https://download.pytorch.org/whl/cu130 when installing PyTorch. Without this, pip will install the default PyTorch from PyPI, which may be CPU-only or have incompatible CUDA version. This was present in the base commit for the all job and is needed here too.

Suggested change
pip install --no-cache-dir torch
pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130

- name: 'Dependencies'
run: |
pip install --no-cache-dir cmake==3.21.0 pybind11[global] einops onnxscript
pip install --no-cache-dir torch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing --index-url https://download.pytorch.org/whl/cu130 when installing PyTorch. The base commit explicitly used this index URL to ensure CUDA 13.0 support matching the JAX container (see commit 4cf2f12). Without it, the default PyPI version will be installed, which may be CPU-only.

Suggested change
pip install --no-cache-dir torch
pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

- name: 'Dependencies'
run: |
pip install --no-cache-dir cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript
pip install --no-cache-dir torch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch installation. The JAX container has CUDA 13.0, and commit 4cf2f12 explicitly added this index URL to ensure compatibility. Without it, pip may install a CPU-only or incompatible CUDA version from PyPI.

Suggested change
pip install --no-cache-dir torch
pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130

- name: 'Dependencies'
run: |
pip install --no-cache-dir cmake==3.21.0 pybind11[global] einops onnxscript
pip install --no-cache-dir torch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch installation. This was explicitly added in commit 4cf2f12 to match the CUDA 13.0 version in the JAX container. Without it, the default PyPI version will be installed, which may be CPU-only or have incompatible CUDA version.

Suggested change
pip install --no-cache-dir torch
pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

- name: 'Dependencies'
run: |
pip install --no-cache-dir cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript
pip install --no-cache-dir torch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch installation. This was explicitly added in commit 4cf2f12 to match CUDA 13.0 in the JAX container. Without it, pip installs the default PyPI version (likely CPU-only or wrong CUDA version).

Suggested change
pip install --no-cache-dir torch
pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130

- name: 'Dependencies'
run: |
pip install --no-cache-dir cmake==3.21.0 pybind11[global] einops onnxscript
pip install --no-cache-dir torch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch installation. This was present in the base commit and is required to match CUDA 13.0 in the JAX container (see commit 4cf2f12).

Suggested change
pip install --no-cache-dir torch
pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

- name: 'Dependencies'
run: |
pip install --no-cache-dir cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript
pip install --no-cache-dir torch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch. The JAX container has CUDA 13.0, but without this flag pip installs the default PyPI version (likely CPU-only or wrong CUDA version). This was explicitly added in commit 4cf2f12 (#2308) for this exact reason.

Suggested change
pip install --no-cache-dir torch
pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130

- name: 'Dependencies'
run: |
pip install --no-cache-dir cmake==3.21.0 pybind11[global] einops onnxscript
pip install --no-cache-dir torch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch. Without this, pip installs the default PyPI version which may not match the JAX container's CUDA 13.0. This was present in the base commit for this exact job.

Suggested change
pip install --no-cache-dir torch
pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines 30 to 34
run: NVTE_USE_CCACHE=1 NVTE_CCACHE_BIN=sccache pip install --no-build-isolation . -v
env:
NVTE_FRAMEWORK: none
MAX_JOBS: 1
SCCACHE_GHA_ENABLED: "true"
SCCACHE_CACHE_SIZE: "5G"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_JOBS=1 removed from Core job. Check that the build completes successfully without this limit to prevent OOM issues.

pggPL added 2 commits February 3, 2026 04:55
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 42 to 43
image: nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04
options: --user root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check that CUDA 12.1.0 (downgraded from 12.8.0) doesn't break PyTorch compatibility or cause runtime issues

Comment on lines 68 to 70
container:
image: ghcr.io/nvidia/jax:jax
image: nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04
options: --user root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched from ghcr.io/nvidia/jax:jax to base CUDA container - verify JAX[cuda12] install is compatible with CUDA 12.1 and includes all necessary dependencies

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

apt-get update
apt-get install -y git python3.9 pip cudnn9-cuda-12
pip install cmake==3.21.0 pybind11[global] ninja packaging
pip install jax
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip install jax installs CPU-only JAX by default. Need jax[cuda12] to match CUDA 13.0 runtime

Suggested change
pip install jax
pip install jax[cuda12]

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

apt-get update
apt-get install -y git python3.9 pip cudnn9-cuda-12
pip install cmake==3.21.0 pybind11[global] ninja packaging
pip install jax[cuda13] flax[cuda13]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify jax[cuda13] is a valid extra. JAX typically uses extras like jax[cuda12_local] or jax[cuda12_pip] (see build_tools/wheel_utils/build_wheels.sh:66). Also note that transformer_engine/jax/pyproject.toml:6 specifies jax[cuda12]. Check JAX documentation to confirm cuda13 is the correct syntax for CUDA 13.0.

apt-get install -y git python3.9 pip cudnn9-cuda-12
pip install cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript
pip install torch --index-url https://download.pytorch.org/whl/cu130
pip install jax[cuda13] flax[cuda13]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as JAX job: verify jax[cuda13] is the correct syntax for CUDA 13.0 installation

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

apt-get update
apt-get install -y git python3.9 pip cudnn9-cuda-12
pip install cmake==3.21.0 pybind11[global] ninja packaging
pip install jax[cuda13] flax[cuda13]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jax[cuda13] syntax is likely invalid. JAX typically uses extras like jax[cuda12_local] or jax[cuda12_pip] (see build_tools/wheel_utils/build_wheels.sh:66). Also, transformer_engine/jax/pyproject.toml:6 specifies jax[cuda12]. This will fail to install the CUDA-enabled version.

Suggested change
pip install jax[cuda13] flax[cuda13]
pip install "jax[cuda12_pip]" "flax[cuda12_pip]"

apt-get install -y git python3.9 pip cudnn9-cuda-12
pip install cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript
pip install torch --index-url https://download.pytorch.org/whl/cu130
pip install jax[cuda13] flax[cuda13]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as JAX job: jax[cuda13] and flax[cuda13] are invalid extras. Use jax[cuda12_pip] and flax[cuda12_pip] instead.

Suggested change
pip install jax[cuda13] flax[cuda13]
pip install jax[cuda12_pip] flax[cuda12_pip]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant