Skip to content

Generalize MFU/FLOPs module across recipes with log_mfu training hook#1548

Open
gagank1 wants to merge 12 commits intomainfrom
gkaushik/mfu_experiment
Open

Generalize MFU/FLOPs module across recipes with log_mfu training hook#1548
gagank1 wants to merge 12 commits intomainfrom
gkaushik/mfu_experiment

Conversation

@gagank1
Copy link
Copy Markdown
Collaborator

@gagank1 gagank1 commented Apr 4, 2026

Description

Adds an architecture-independent FLOPs counting and MFU tracking module (flops.py) that can be used by any native_te recipe. The module auto-detects model architecture (GQA vs MHA, SwiGLU vs standard FFN) from HuggingFace config dicts and supports analytical, simplified, and Hyena (Evo2) FLOPs formulas.

MFU tracking is integrated into all 11 training scripts across 4 recipes, gated behind a log_mfu: false Hydra flag

Some MFU results: https://linear.app/nvidia/issue/BIO-406/mfu-calculation-experiment#comment-50296297

Usage

During training:

# Any training script — just add log_mfu=true
  torchrun --nproc_per_node=2 train_fsdp2_cp.py --config-name L2_lingua_1b log_mfu=true                                   
                                                                                                                          
  Output during training:                                                                                                 
  [__main__][INFO] - MFU: 60.1% (93.19 TFLOPS/GPU) 

CLI tool:

  # Show GPU and peak bf16 TFLOPS                                                                                         
  python flops.py gpu-info                                                                                                
                                                                  
  # Compute FLOPs for any model config (auto-detects architecture)                                                        
  python flops.py flops --config-path ./model_configs/lingua-1B
  python flops.py flops --config-path nvidia/esm2_t36_3B_UR50D   # from HF Hub                                            
                                                                  
  # Estimate CP ring attention communication overhead                                                                     
  python flops.py flops --config-path ./model_configs/lingua-1B --num-gpus 2 --cp-size 2 --seq-len 16384
                                                                                                                          
  # Measure P2P GPU bandwidth (matches CP ring attention pattern)                                                         
  torchrun --nproc_per_node=2 flops.py bandwidth                                                                          

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks). This label can be used to enforce running all framework tests.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Summary by CodeRabbit

  • New Features

    • Added Model FLOPs Utilization (MFU) tracking to monitor per-step compute efficiency during training via log_mfu=true parameter.
    • Added FLOPs computation utilities with CLI tools for GPU peak performance detection, bandwidth measurement, and context-parallel communication overhead estimation.
  • Documentation

    • Added "MFU Tracking" guides with usage examples for all recipe READMEs.
    • Documented flops.py CLI commands for performance profiling and GPU analysis.
  • Tests

    • Added comprehensive test coverage for FLOPs calculations and MFU tracking functionality.

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1 gagank1 self-assigned this Apr 4, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 4, 2026

📝 Walkthrough

Walkthrough

This PR introduces Model FLOPs Utilization (MFU) tracking across multiple native training recipes (LLaMA3, ESM2, CodonFM, OpenGenome2). A new flops.py module provides architecture-independent FLOPs computation, peak TFLOPS detection, and communication overhead estimation. Training scripts are updated to conditionally log MFU metrics per optimizer step. Configuration and documentation are updated to support the new feature, and comprehensive test coverage is added.

Changes

Cohort / File(s) Summary
FLOPs Core Module
llama3_native_te/flops.py, esm2_native_te/flops.py, codonfm_native_te/flops.py, opengenome2_llama_native_te/flops.py
New ~714-line module providing FLOPs computation (analytical, simplified, Hyena variants), MFU tracking, GPU peak TFLOPS detection, CP ring-attention and DP communication overhead estimation, model config loading from HF/local, bandwidth measurement, and CLI commands (gpu-info, flops, cp-comm, bandwidth).
Training Script MFU Integration
llama3_native_te/train_ddp.py, llama3_native_te/train_fsdp2.py, llama3_native_te/train_fsdp2_cp.py, esm2_native_te/train_ddp.py, esm2_native_te/train_ddp_cp.py, esm2_native_te/train_fsdp2.py, esm2_native_te/train_fsdp2_cp.py, esm2_native_te/train_mfsdp.py, codonfm_native_te/train_fsdp2.py, opengenome2_llama_native_te/train_fsdp2.py, opengenome2_llama_native_te/train_fsdp2_cp.py
Adds optional MFU tracking gated by args.log_mfu: initializes MFUTracker from model config, measures per-step elapsed time via time.perf_counter(), computes and logs MFU percentage and per-GPU TFLOPS on main process each optimizer step. ~23--24 lines per file.
Configuration Defaults
llama3_native_te/hydra_config/defaults.yaml, esm2_native_te/hydra_config/defaults.yaml, codonfm_native_te/hydra_config/defaults.yaml, opengenome2_llama_native_te/hydra_config/defaults.yaml
Added log_mfu: false configuration parameter to enable MFU logging via CLI override.
FLOPs Test Coverage
llama3_native_te/tests/test_flops.py, esm2_native_te/tests/test_flops.py, codonfm_native_te/tests/test_flops.py, opengenome2_llama_native_te/tests/test_flops.py
New ~317--323 line test modules validating architecture auto-detection, analytical/simplified/Hyena FLOPs formulas, MFU calculations, communication estimation, and multi-GPU scaling with golden-value assertions.
Documentation
llama3_native_te/README.md, esm2_native_te/README.md, codonfm_native_te/README.md, opengenome2_llama_native_te/README.md
Added "MFU Tracking" section documenting log_mfu=true CLI usage, logged metrics (MFU %, TFLOPS/GPU, step time), and CLI examples for flops.py commands (gpu-info, FLOPs computation, CP communication, bandwidth measurement).
Performance Logger & Metadata
esm2_native_te/perf_logger.py
Added last_step_time instance attribute (initialized to None, updated in log_step) to track per-logged-step timing for MFU computation.
Model Configuration
llama3_native_te/model_configs/meta-llama/Llama-3.1-8B/config.json
New Llama-3.1-8B config with architecture parameters (4096 hidden, 32 layers, 32 attention heads, 8 KV heads, 14336 intermediate, 131072 max positions), rope scaling, and transformers version metadata.
CI/CD Synchronization
ci/scripts/check_copied_files.py
Added source-to-destination mappings for flops.py and tests/test_flops.py from llama3_native_te to esm2_native_te, codonfm_native_te, and opengenome2_llama_native_te for consistency verification.

Sequence Diagram(s)

sequenceDiagram
    participant TrainScript as Training Script
    participant MFUTracker as MFUTracker
    participant FlopsModule as FLOPs Compute
    participant GPU as GPU

    TrainScript->>MFUTracker: __init__(config, batch_size, seq_len, ...)
    MFUTracker->>FlopsModule: from_hf_config(model_config)
    FlopsModule-->>MFUTracker: ModelFLOPsConfig
    MFUTracker->>FlopsModule: compute_flops_analytical/simplified()
    FlopsModule-->>MFUTracker: total_training_flops
    MFUTracker->>GPU: detect_gpu_peak_tflops()
    GPU-->>MFUTracker: peak_tflops_bf16

    loop Each Optimizer Step
        TrainScript->>TrainScript: measure step_time (perf_counter)
        TrainScript->>MFUTracker: compute_mfu(step_time)
        MFUTracker->>MFUTracker: mfu = (total_flops / step_time / peak_tflops)
        MFUTracker-->>TrainScript: {mfu_percent, tflops_per_gpu}
        TrainScript->>TrainScript: log metrics
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • jstjohn
  • jwilber
  • cspades
  • dorotat-nv
  • trvachov
  • jomitchellnv
  • savitha-eng

Poem

🐰 Hops through training with metrics so bright,
FLOPs counted with analytical might,
MFU tracking each step of the way,
GPU peaks detected, efficiency on display! 🎯⚡

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main addition of MFU/FLOPs utilities with the log_mfu training hook, which aligns with the primary objective of the changeset across multiple recipes.
Description check ✅ Passed PR description is comprehensive, providing detailed motivation, usage examples, and change summary aligned with template structure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch gkaushik/mfu_experiment

Comment @coderabbitai help to get the list of available commands and usage tips.

if rank == 0:
print("Counting FLOPs with HF model (meta device)...")
hf_config_meta = LlamaConfig.from_pretrained(args.config_path)
hf_config_meta._attn_implementation = "eager"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be the slowest possible attention, can you use fa2 or sdpa?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the old version of the PR, no longer includes HF comparison

gagank1 and others added 11 commits April 6, 2026 17:59
- Add compare_mfu_validate.py: golden value tests comparing CP vs non-CP
  execution for both TE and HF models, validating loss, logits (cosine
  sim > 0.99), and gradients (cosine sim > 0.8) following the pattern
  from models/llama3/tests/test_cp_bshd.py
- Fix silent RoPE bug in HF CP: position_ids were not passed to
  context_parallel buffers, causing each rank to auto-generate [0..S/CP-1]
  instead of correct global positions
- Switch FLOPs counting from eager to SDPA attention for consistency with
  the actual training implementation (identical counts on meta device)
- Add exact FLOPs output (full integers with commas) alongside abbreviated
  values in both single-GPU and multi-GPU scripts
- Switch bandwidth measurement from all-reduce to all-gather for more
  accurate pure data movement measurement matching CP ring attention
- Add position_ids and max_length_q/k support to measure_step_time

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Replaces all-gather with explicit send/recv between rank 0 and rank 1,
matching CP ring attention's actual communication pattern. Measures 6.6 GB/s
unidirectional on PCIe Gen 3 x8 (vs 3.2 GB/s all-gather, 4.0 GB/s all-reduce).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Adds subcommands for using MFU utilities outside of full training scripts:
  - gpu-info:   Print GPU name, detected peak TFLOPS, and known GPU table
  - flops:      Compute FLOPs from a model config (README + first-principles)
  - cp-comm:    Estimate CP ring attention communication volume
  - bandwidth:  Measure unidirectional P2P bandwidth via send/recv (torchrun)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Add architecture-independent flops.py module (source: models/esm2/flops.py)
supporting Llama (GQA+SwiGLU), ESM2, CodonFM (MHA+GELU), and Evo2 (Hyena).

Key additions:
- ModelFLOPsConfig dataclass with from_hf_config() auto-detection
- Analytical, simplified (README), and Hyena FLOPs formulas
- MFUTracker class for training script integration
- Communication overhead estimation (CP ring attention, FSDP)
- CLI: gpu-info, flops, cp-comm, bandwidth

Training integration (gated by log_mfu: false in hydra config):
- 11 training scripts across 4 native_te recipes
- llama3_native_te (3), esm2_native_te (5), codonfm_native_te (1),
  opengenome2_llama_native_te (2)

Removes superseded Llama-specific compare_mfu*.py benchmark scripts.
Moves first_principles.md to models/esm2/ alongside the source module.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
- Fix RTX 4090 (330→165) and RTX 3090 (142→71): were using sparse values
- Remove A5000, RTX 4090, RTX 3090 (not relevant for data center workloads)
- Add B300 and GB300 Blackwell Ultra at 2,500 TFLOPS dense BF16
- Add authoritative NVIDIA source URLs for all GPU specs as comments

Sources:
  B300/GB300: https://www.nvidia.com/en-us/data-center/gb300-nvl72/
  (360 PFLOPS BF16 sparse / 72 GPUs / 2 = 2,500 TFLOPS dense per GPU)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Remove functions that were only used by the now-deleted compare_mfu*.py
standalone benchmark scripts:
- compute_flops_first_principles(), compute_flops_readme() (backward compat wrappers)
- measure_step_time(), split_for_cp_bshd(), count_flops_with_model(), cleanup_model()
- Unused imports: gc, nullcontext, FlopCounterMode

Retained: MFUTracker, from_hf_config, compute_flops_analytical,
compute_flops_simplified, compute_flops_hyena, CLI, and all utilities
needed by the training script log_mfu integration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
- Move flops.py source from models/esm2/ to recipes/llama3_native_te/
  (it's a training utility, not a model utility)
- Add test_flops.py with 27 tests covering config auto-detection,
  analytical/simplified/hyena formulas, MFUTracker, and CP comm estimation
- Update check_copied_files.py: source is now llama3_native_te, copied to
  esm2, codonfm, opengenome2 recipes (both flops.py and test_flops.py)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Document the log_mfu=true flag and flops.py CLI utilities in all 4
native_te recipe READMEs: llama3, esm2, codonfm, opengenome2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1 gagank1 changed the title Recipes SOL Perf Baselines Recipes MFU calculation utilities Apr 13, 2026
@gagank1 gagank1 marked this pull request as ready for review April 13, 2026 20:35
@gagank1 gagank1 enabled auto-merge April 13, 2026 20:36
@gagank1
Copy link
Copy Markdown
Collaborator Author

gagank1 commented Apr 13, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

✅ Actions performed

Full review triggered.

@gagank1
Copy link
Copy Markdown
Collaborator Author

gagank1 commented Apr 13, 2026

/ok to test 27255a1

@gagank1 gagank1 changed the title Recipes MFU calculation utilities Generalize MFU/FLOPs module across recipes with log_mfu training hook Apr 13, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
bionemo-recipes/recipes/esm2_native_te/perf_logger.py (1)

108-120: ⚠️ Potential issue | 🟠 Major

Reset last_step_time on non-logging steps to prevent stale MFU inputs.

self.last_step_time is only updated inside the logging-frequency branch, so it can retain an old value. Since training code reads perf_logger.last_step_time each step, MFU can be computed with stale timing.

Proposed fix
         with torch.no_grad():
+            self.last_step_time = None
             # FSDP2's clip_grad_norm_ returns a DTensor; convert to local tensor for torchmetrics compatibility.
             if isinstance(grad_norm, DTensor):
                 grad_norm = grad_norm.to_local()
@@
             if step % self.logging_frequency == 0 and step > 0:
@@
                 step_time = elapsed_time / self.logging_frequency
                 self.last_step_time = step_time
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/esm2_native_te/perf_logger.py` around lines 108 -
120, The perf logger currently only updates self.last_step_time inside the
logging branch guarded by self.logging_frequency, causing stale values on
non-logging steps; update the step handler in perf_logger.py so that when a step
is not a logging step you explicitly reset self.last_step_time (e.g., set to 0.0
or None) to avoid stale MFU inputs, keeping the existing logic that updates
self.previous_step_time and last_step_time when (step % self.logging_frequency
== 0).
🧹 Nitpick comments (3)
bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py (1)

354-359: Same minor note as train_fsdp2_cp.py.

The step_start_time reset on line 359 happens unconditionally. Same optional refactor applies if you want consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py` around
lines 354 - 359, The unconditional reset of step_start_time happens even when
mfu_tracker is None; update the block around mfu_tracker so step_start_time is
only reset when mfu_tracker is present (i.e., move the step_start_time =
time.perf_counter() inside the if mfu_tracker is not None branch, after calling
mfu_tracker.compute_mfu and optional dist_config.is_main_process()/logger.info)
to match the optional flow used elsewhere (see mfu_tracker, compute_mfu,
dist_config.is_main_process, logger.info).
bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py (1)

24-25: Consider using a more robust import approach.

The sys.path.insert(0, ...) pattern works but can be fragile. Since this test file will be copied across recipes, consider whether pytest's conftest.py or pyproject.toml configuration could provide a cleaner solution.

That said, this is a common pattern in recipe-style code and works correctly here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py` around lines 24
- 25, The test currently mutates sys.path via the sys.path.insert(0, ...) call
in test_flops.py which is fragile; instead move that path-setup into a shared
pytest configuration so tests don't each modify sys.path: create or update
pytest's conftest.py to add the parent directory to sys.path (or better,
configure package discovery via pyproject.toml/paths or install the package in
editable mode) and remove the sys.path.insert(0, ...) line from test_flops.py so
imports are resolved by pytest configuration rather than per-test mutation.
bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py (1)

362-367: Minor inconsistency: step_start_time reset occurs unconditionally.

The timer reset on line 367 happens even when mfu_tracker is None. While this has negligible performance impact, it's slightly inconsistent—either gate the entire timing block or make it explicit that timing is always active.

Also note this script uses manual time.perf_counter() timing while ESM2 scripts use perf_logger.last_step_time. This difference is acceptable since different recipes may have different PerfLogger implementations, but it's worth documenting if intentional.

Optional: Gate timing reset with MFU tracker check
                if mfu_tracker is not None:
                    step_time = time.perf_counter() - step_start_time
                    mfu_info = mfu_tracker.compute_mfu(step_time)
                    if dist_config.is_main_process():
                        logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"])
-                step_start_time = time.perf_counter()
+                    step_start_time = time.perf_counter()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py` around
lines 362 - 367, The timing reset is currently unconditional; to be consistent,
only reset step_start_time when mfu tracking is active. Inside the block that
checks mfu_tracker is not None (the one that calls mfu_tracker.compute_mfu(...)
and logs via logger.info when dist_config.is_main_process()), move the line that
sets step_start_time = time.perf_counter() into that same guarded block (or
explicitly comment why timing should remain unconditional). This references the
mfu_tracker variable, compute_mfu method, step_start_time variable,
dist_config.is_main_process(), and logger.info to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py`:
- Around line 24-41: The tests unsafely import the local flops.py via sys.path
and a plain "from flops import ...", causing module-cache collisions across
pytest sessions; modify the test to load the local flops.py under a unique
module name (e.g., using importlib.util.spec_from_file_location or
importlib.machinery.SourceFileLoader with a name derived from the test file path
or parent directory) and then import the symbols MFUTracker, ModelFLOPsConfig,
compute_flops_analytical, compute_flops_hyena, compute_flops_simplified,
estimate_cp_comm_bytes, and from_hf_config from that module, ensuring you do not
rely on sys.path insertion or the global "flops" entry in sys.modules so each
recipe test loads its own isolated module.

In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py`:
- Around line 222-227: The MFU log omits the per-step duration even though
step_time is computed; update the logging call inside the mfu_tracker block
(where mfu_info = mfu_tracker.compute_mfu(step_time) and inside the
dist_config.is_main_process() check) to include step_time in the logger.info
message alongside mfu_info["mfu"] and mfu_info["tflops_per_gpu"] (e.g., add a
placeholder for step_time and pass step_time as an argument) so each MFU line
reports MFU, TFLOPS/GPU, and the per-step time.

In `@bionemo-recipes/recipes/llama3_native_te/flops.py`:
- Around line 607-647: The CLI flag args.formula is unused: either honor it or
remove it; to fix, branch on args.formula (values "simplified" or "analytical")
before calling compute_flops_simplified and compute_flops_analytical so you only
compute and print the requested method. Specifically, use args.formula to decide
whether to call
compute_flops_simplified(b,s,config.hidden_size,config.num_hidden_layers,config.vocab_size)
or compute_flops_analytical(config,b,s) (and only compute the other if you need
a diff), adjust the printing loop to only print the selected result (and only
compute/print the Difference when both are intentionally computed), or
alternatively remove the --formula argument from the argparse setup if you want
both methods always shown.
- Around line 453-483: In measure_bus_bandwidth, reject world sizes != 2 instead
of assuming peers by updating the initial guard (world_size) to raise an error
when world_size is not exactly 2; ensure you check world_size at the top of the
function and raise a ValueError (or similar) with a clear message referencing
the peer calculation using peer = 1 - rank so callers know only 2-process runs
are supported; keep the rest of the logic unchanged once the guard ensures rank
is 0 or 1.
- Around line 104-140: from_hf_config currently computes derived fields
(head_dim, has_lm_head) before kwargs.update(overrides), which allows overrides
like num_attention_heads or vocab_size to leave derived values stale; update the
function so overrides are applied before computing derived fields (or apply
kwargs.update(overrides) and then recompute head_dim = hidden_size //
num_attention_heads and has_lm_head = vocab_size > 0 afterwards), ensuring
derived values (head_dim, has_lm_head and any other fields dependent on
overrides) reflect the final values passed to ModelFLOPsConfig; keep references
to from_hf_config, kwargs.update(overrides), head_dim, has_lm_head,
num_attention_heads, vocab_size and ModelFLOPsConfig to locate the changes.
- Around line 69-75: detect_gpu_peak_tflops currently calls
torch.cuda.get_device_name(0) unguarded and will crash on CPU-only hosts; update
detect_gpu_peak_tflops to first check torch.cuda.is_available() and only call
torch.cuda.get_device_name(0) when CUDA is available, otherwise return (None,
"unknown") so callers like _cli_gpu_info and MFUTracker(..., peak_tflops=None)
safely receive a fallback; keep the GPU_PEAK_TFLOPS_BF16 lookup logic intact
when a device name is present.
- Around line 486-507: The load_model_config function currently hardcodes
trust_remote_code=True when calling AutoConfig.from_pretrained; change it to
accept explicit keyword-only parameters (e.g., *, trust_remote_code: bool =
False, revision: Optional[str] = None) so callers must opt-in to remote code
execution and can pin a revision, then pass those parameters into
AutoConfig.from_pretrained(config_path, trust_remote_code=trust_remote_code,
revision=revision) and update any call sites accordingly.

In `@bionemo-recipes/recipes/llama3_native_te/train_ddp.py`:
- Around line 151-157: The MFUTracker is being configured with per-micro-batch
FLOPs which underestimates utilization when gradient accumulation is used;
update the MFUTracker initialization (the MFUTracker(...) call) to account for
gradient accumulation by multiplying the effective batch_size (or otherwise
scaling FLOPs) by the gradient accumulation factor (use
args.gradient_accumulation_steps or args.grad_acc_steps as present in the
training args) so the tracker uses full-step work rather than a single
micro-batch, and apply the same correction to the related MFU/config
calculations around this block.

In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py`:
- Around line 187-193: The MFUTracker is using args.dataset.micro_batch_size as
the FLOPs batch size which ignores gradient accumulation and underestimates MFU;
update the mfu_tracker initialization (symbol: MFUTracker / variable:
mfu_tracker) to use the effective batch size = args.dataset.micro_batch_size *
args.grad_acc_steps (or otherwise multiply by grad_acc_steps) when setting the
batch_size parameter, and apply the same change wherever MFUTracker is
constructed (also the other block around the second occurrence).

In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py`:
- Around line 165-170: MFU is undercounted because MFUTracker is initialized
with a single micro-batch; multiply the batch work by gradient accumulation
steps so FLOPs/ MFU reflect a full optimizer step. Update the MFUTracker
initialization (mfu_tracker / MFUTracker) to use an effective batch size =
args.dataset.micro_batch_size * grad_acc_steps (or pass grad_acc_steps
separately) and also ensure any FLOPs or wall-time computations (the code path
that computes wall time for a full optimizer step) scale FLOPs by grad_acc_steps
so the computed FLOPs and MFU reflect the total work per optimizer step.

---

Outside diff comments:
In `@bionemo-recipes/recipes/esm2_native_te/perf_logger.py`:
- Around line 108-120: The perf logger currently only updates
self.last_step_time inside the logging branch guarded by self.logging_frequency,
causing stale values on non-logging steps; update the step handler in
perf_logger.py so that when a step is not a logging step you explicitly reset
self.last_step_time (e.g., set to 0.0 or None) to avoid stale MFU inputs,
keeping the existing logic that updates self.previous_step_time and
last_step_time when (step % self.logging_frequency == 0).

---

Nitpick comments:
In `@bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py`:
- Around line 24-25: The test currently mutates sys.path via the
sys.path.insert(0, ...) call in test_flops.py which is fragile; instead move
that path-setup into a shared pytest configuration so tests don't each modify
sys.path: create or update pytest's conftest.py to add the parent directory to
sys.path (or better, configure package discovery via pyproject.toml/paths or
install the package in editable mode) and remove the sys.path.insert(0, ...)
line from test_flops.py so imports are resolved by pytest configuration rather
than per-test mutation.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py`:
- Around line 362-367: The timing reset is currently unconditional; to be
consistent, only reset step_start_time when mfu tracking is active. Inside the
block that checks mfu_tracker is not None (the one that calls
mfu_tracker.compute_mfu(...) and logs via logger.info when
dist_config.is_main_process()), move the line that sets step_start_time =
time.perf_counter() into that same guarded block (or explicitly comment why
timing should remain unconditional). This references the mfu_tracker variable,
compute_mfu method, step_start_time variable, dist_config.is_main_process(), and
logger.info to locate the code to change.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py`:
- Around line 354-359: The unconditional reset of step_start_time happens even
when mfu_tracker is None; update the block around mfu_tracker so step_start_time
is only reset when mfu_tracker is present (i.e., move the step_start_time =
time.perf_counter() inside the if mfu_tracker is not None branch, after calling
mfu_tracker.compute_mfu and optional dist_config.is_main_process()/logger.info)
to match the optional flow used elsewhere (see mfu_tracker, compute_mfu,
dist_config.is_main_process, logger.info).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 158596f3-a749-4ec2-88e4-a5b4f9b08cc4

📥 Commits

Reviewing files that changed from the base of the PR and between 86c8329 and 27255a1.

📒 Files selected for processing (30)
  • bionemo-recipes/recipes/codonfm_native_te/README.md
  • bionemo-recipes/recipes/codonfm_native_te/flops.py
  • bionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yaml
  • bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py
  • bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py
  • bionemo-recipes/recipes/esm2_native_te/README.md
  • bionemo-recipes/recipes/esm2_native_te/flops.py
  • bionemo-recipes/recipes/esm2_native_te/hydra_config/defaults.yaml
  • bionemo-recipes/recipes/esm2_native_te/perf_logger.py
  • bionemo-recipes/recipes/esm2_native_te/tests/test_flops.py
  • bionemo-recipes/recipes/esm2_native_te/train_ddp.py
  • bionemo-recipes/recipes/esm2_native_te/train_ddp_cp.py
  • bionemo-recipes/recipes/esm2_native_te/train_fsdp2.py
  • bionemo-recipes/recipes/esm2_native_te/train_fsdp2_cp.py
  • bionemo-recipes/recipes/esm2_native_te/train_mfsdp.py
  • bionemo-recipes/recipes/llama3_native_te/README.md
  • bionemo-recipes/recipes/llama3_native_te/flops.py
  • bionemo-recipes/recipes/llama3_native_te/hydra_config/defaults.yaml
  • bionemo-recipes/recipes/llama3_native_te/model_configs/meta-llama/Llama-3.1-8B/config.json
  • bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py
  • bionemo-recipes/recipes/llama3_native_te/train_ddp.py
  • bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py
  • bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/README.md
  • bionemo-recipes/recipes/opengenome2_llama_native_te/flops.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/defaults.yaml
  • bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_flops.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py
  • ci/scripts/check_copied_files.py

Comment on lines +24 to +41
import sys
from pathlib import Path

import pytest


# Add parent directory so we can import flops
sys.path.insert(0, str(Path(__file__).parent.parent))

from flops import (
MFUTracker,
ModelFLOPsConfig,
compute_flops_analytical,
compute_flops_hyena,
compute_flops_simplified,
estimate_cp_comm_bytes,
from_hf_config,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate the test file and its parent structure
find . -name "test_flops.py" -type f 2>/dev/null | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 329


🏁 Script executed:

# Check if flops.py exists in the parent directories of test files
find . -name "flops.py" -type f 2>/dev/null | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 285


🏁 Script executed:

# Search for other test files using the same sys.path pattern
rg "sys\.path\.insert.*parent" --type py 2>/dev/null | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 2354


🏁 Script executed:

# Verify what each test_flops.py imports from "flops"
for f in ./bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py ./bionemo-recipes/recipes/esm2_native_te/tests/test_flops.py ./bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py; do
  echo "=== $f ==="
  sed -n '1,45p' "$f" | tail -20
done

Repository: NVIDIA/bionemo-framework

Length of output: 1631


🏁 Script executed:

# Check how pytest is typically invoked - look for pytest config or CI configs
find . -name "pytest.ini" -o -name "pyproject.toml" -o -name "setup.cfg" | grep -E "(pytest|pyproject)" | head -10

Repository: NVIDIA/bionemo-framework

Length of output: 658


🏁 Script executed:

# Check pyproject.toml for pytest configuration
head -200 ./pyproject.toml 2>/dev/null | grep -A 20 "\[tool.pytest"

Repository: NVIDIA/bionemo-framework

Length of output: 50


🏁 Script executed:

# Check pytest.ini directly
cat ./pytest.ini 2>/dev/null

Repository: NVIDIA/bionemo-framework

Length of output: 433


🏁 Script executed:

# Check main pyproject.toml for pytest config
rg "\[tool\.pytest" -A 10 ./pyproject.toml 2>/dev/null

Repository: NVIDIA/bionemo-framework

Length of output: 50


🏁 Script executed:

# View the actual imports in llama3_native_te test_flops.py more completely
head -30 ./bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py

Repository: NVIDIA/bionemo-framework

Length of output: 1062


🏁 Script executed:

# Check if there's a conftest.py that might manipulate sys.path at the recipe level
find ./bionemo-recipes/recipes -name "conftest.py" -type f

Repository: NVIDIA/bionemo-framework

Length of output: 706


🏁 Script executed:

# Check the conftest.py files for the affected recipes
for f in ./bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py ./bionemo-recipes/recipes/llama3_native_te/tests/conftest.py ./bionemo-recipes/recipes/esm2_native_te/tests/conftest.py; do
  echo "=== $f ==="
  cat "$f" 2>/dev/null || echo "Not found"
  echo ""
done

Repository: NVIDIA/bionemo-framework

Length of output: 10467


🏁 Script executed:

# Test if flops modules would actually collide - simulate what happens when importing both
python3 << 'EOF'
import sys
from pathlib import Path
import importlib

# Simulate first recipe test importing flops
recipe1_dir = "./bionemo-recipes/recipes/llama3_native_te"
sys.path.insert(0, recipe1_dir)

# This would make "flops" from llama3 available
print("sys.path after recipe1:", sys.path[:3])

# Simulate second recipe test in SAME process (as pytest would do)
recipe2_dir = "./bionemo-recipes/recipes/codonfm_native_te"
# If we just append another dir, we'd have both in sys.path
sys.path.insert(0, recipe2_dir)
print("sys.path after recipe2:", sys.path[:3])

# When we try to import flops now, Python will use the first match in sys.path
# but if flops is already in sys.modules, it uses that cached version
print("\nThis demonstrates the caching risk:")
print("- First test imports flops from llama3_native_te")
print("- Module 'flops' gets cached in sys.modules")
print("- Second test also imports 'flops'")
print("- Python finds 'flops' in sys.modules and returns cached version")
print("- Test uses wrong recipe's flops.py!")
EOF

Repository: NVIDIA/bionemo-framework

Length of output: 557


🏁 Script executed:

# Check if the flops.py files differ between recipes
echo "=== Comparing llama3 vs codonfm flops.py size ==="
wc -l ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.py

Repository: NVIDIA/bionemo-framework

Length of output: 246


🏁 Script executed:

# Check if they're identical or different
diff ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.py | head -50

Repository: NVIDIA/bionemo-framework

Length of output: 359


🏁 Script executed:

# Verify pytest test discovery behavior - check if tests are typically run together
find ./bionemo-recipes/recipes -name "test_*.py" -o -name "*_test.py" | wc -l

Repository: NVIDIA/bionemo-framework

Length of output: 70


🏁 Script executed:

# Check if CI configuration runs all recipe tests together
find . -name "*.yml" -o -name "*.yaml" | xargs grep -l "pytest.*recipes" 2>/dev/null | head -5

Repository: NVIDIA/bionemo-framework

Length of output: 50


🏁 Script executed:

# Check if conftest.py at a higher level might run tests across recipes
find ./bionemo-recipes/recipes -name "conftest.py" | head -5
ls -la ./bionemo-recipes/recipes/conftest.py 2>/dev/null || echo "No conftest.py at recipes level"
ls -la ./bionemo-recipes/conftest.py 2>/dev/null || echo "No conftest.py at bionemo-recipes level"

Repository: NVIDIA/bionemo-framework

Length of output: 453


🏁 Script executed:

# Verify the exact issue: show what happens when flops is imported twice with same name
python3 << 'EOF'
import sys
from pathlib import Path

# Simulate pytest running both tests in sequence in same process
print("=== Scenario: Both test_flops.py files run in same pytest process ===\n")

# Test 1: llama3_native_te/tests/test_flops.py
print("1. Test llama3_native_te imports flops:")
sys.path.insert(0, str(Path("./bionemo-recipes/recipes/llama3_native_te")))
import flops as flops_v1
print(f"   Module: {flops_v1.__file__}")
print(f"   sys.modules['flops']: {sys.modules.get('flops', 'NOT SET')}")

# Simulate cleaning up sys.path for next test
sys.path.pop(0)
sys.path.insert(0, str(Path("./bionemo-recipes/recipes/codonfm_native_te")))

# Test 2: codonfm_native_te/tests/test_flops.py
print("\n2. Test codonfm_native_te tries to import flops:")
# This would normally do: from flops import MFUTracker, ...
# But since 'flops' is already in sys.modules, Python returns the cached version!
import flops as flops_v2
print(f"   Module: {flops_v2.__file__}")
print(f"   Same module? {flops_v1 is flops_v2}")
print(f"\n   RESULT: Test got llama3's flops, not codonfm's!")
EOF

Repository: NVIDIA/bionemo-framework

Length of output: 431


🏁 Script executed:

# Verify if the flops.py files are actually different (not just comments)
diff -w ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.py | grep -v "^[0-9]" | head -30

Repository: NVIDIA/bionemo-framework

Length of output: 350


🏁 Script executed:

# Check test execution patterns in CI/config files
find . -name "*.sh" -o -name "*.yml" -o -name "*.yaml" | xargs grep -l "test_flops" 2>/dev/null | head -3

Repository: NVIDIA/bionemo-framework

Length of output: 50


Import the local flops.py under a unique module name.

This sys.path + from flops import ... pattern is unsafe when multiple recipe test files run in the same pytest process: the first imported flops module stays cached in sys.modules, so later test suites can end up testing the wrong recipe implementation. While the flops.py files are currently identical (copied from llama3_native_te), this pattern creates a fragile dependency on that equivalence. Please fix this in the source test template and re-sync the copies.

🔧 One safe pattern
-import sys
+import importlib.util
 from pathlib import Path

 import pytest


-# Add parent directory so we can import flops
-sys.path.insert(0, str(Path(__file__).parent.parent))
-
-from flops import (
-    MFUTracker,
-    ModelFLOPsConfig,
-    compute_flops_analytical,
-    compute_flops_hyena,
-    compute_flops_simplified,
-    estimate_cp_comm_bytes,
-    from_hf_config,
-)
+FLOPS_PATH = Path(__file__).parent.parent / "flops.py"
+SPEC = importlib.util.spec_from_file_location(f"{__name__}_flops", FLOPS_PATH)
+assert SPEC is not None and SPEC.loader is not None
+flops = importlib.util.module_from_spec(SPEC)
+SPEC.loader.exec_module(flops)
+
+MFUTracker = flops.MFUTracker
+ModelFLOPsConfig = flops.ModelFLOPsConfig
+compute_flops_analytical = flops.compute_flops_analytical
+compute_flops_hyena = flops.compute_flops_hyena
+compute_flops_simplified = flops.compute_flops_simplified
+estimate_cp_comm_bytes = flops.estimate_cp_comm_bytes
+from_hf_config = flops.from_hf_config
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py` around lines
24 - 41, The tests unsafely import the local flops.py via sys.path and a plain
"from flops import ...", causing module-cache collisions across pytest sessions;
modify the test to load the local flops.py under a unique module name (e.g.,
using importlib.util.spec_from_file_location or
importlib.machinery.SourceFileLoader with a name derived from the test file path
or parent directory) and then import the symbols MFUTracker, ModelFLOPsConfig,
compute_flops_analytical, compute_flops_hyena, compute_flops_simplified,
estimate_cp_comm_bytes, and from_hf_config from that module, ensuring you do not
rely on sys.path insertion or the global "flops" entry in sys.modules so each
recipe test loads its own isolated module.

Comment on lines +222 to +227
if mfu_tracker is not None:
step_time = time.perf_counter() - step_start_time
mfu_info = mfu_tracker.compute_mfu(step_time)
if dist_config.is_main_process():
logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"])
step_start_time = time.perf_counter()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Include step_time in the MFU log line for complete per-step telemetry.

You already compute step_time on Line 223, but the emitted message omits it.

Proposed fix
                     if mfu_tracker is not None:
                         step_time = time.perf_counter() - step_start_time
                         mfu_info = mfu_tracker.compute_mfu(step_time)
                         if dist_config.is_main_process():
-                            logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"])
+                            logger.info(
+                                "Step %d MFU: %.1f%% (%.2f TFLOPS/GPU, step_time=%.3fs)",
+                                step,
+                                mfu_info["mfu"],
+                                mfu_info["tflops_per_gpu"],
+                                step_time,
+                            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py` around lines 222 -
227, The MFU log omits the per-step duration even though step_time is computed;
update the logging call inside the mfu_tracker block (where mfu_info =
mfu_tracker.compute_mfu(step_time) and inside the dist_config.is_main_process()
check) to include step_time in the logger.info message alongside mfu_info["mfu"]
and mfu_info["tflops_per_gpu"] (e.g., add a placeholder for step_time and pass
step_time as an argument) so each MFU line reports MFU, TFLOPS/GPU, and the
per-step time.

Comment on lines +69 to +75
def detect_gpu_peak_tflops():
"""Auto-detect GPU peak bf16 TFLOPS from device name via substring match."""
device_name = torch.cuda.get_device_name(0)
for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items():
if gpu_key.lower() in device_name.lower():
return tflops, device_name
return None, device_name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "flops.py" | grep -E "(llama|flops)"

Repository: NVIDIA/bionemo-framework

Length of output: 285


🏁 Script executed:

cd bionemo-recipes && find . -type f -name "flops.py" 2>/dev/null | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 221


🏁 Script executed:

cat -n ./recipes/llama3_native_te/flops.py

Repository: NVIDIA/bionemo-framework

Length of output: 135


🏁 Script executed:

cat -n ./bionemo-recipes/recipes/llama3_native_te/flops.py

Repository: NVIDIA/bionemo-framework

Length of output: 32632


Guard CUDA-name lookup before touching torch.cuda.

The function calls torch.cuda.get_device_name(0) without checking whether CUDA is available, causing it to fail on CPU-only hosts instead of falling back to "unknown". _cli_gpu_info() and MFUTracker(..., peak_tflops=None) will crash in non-CUDA environments. Check torch.cuda.is_available() (per PyTorch documentation) before invoking CUDA device queries.

Minimal fix
 def detect_gpu_peak_tflops():
     """Auto-detect GPU peak bf16 TFLOPS from device name via substring match."""
+    if not torch.cuda.is_available():
+        return None, "unknown"
     device_name = torch.cuda.get_device_name(0)
     for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items():
         if gpu_key.lower() in device_name.lower():
             return tflops, device_name
     return None, device_name
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def detect_gpu_peak_tflops():
"""Auto-detect GPU peak bf16 TFLOPS from device name via substring match."""
device_name = torch.cuda.get_device_name(0)
for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items():
if gpu_key.lower() in device_name.lower():
return tflops, device_name
return None, device_name
def detect_gpu_peak_tflops():
"""Auto-detect GPU peak bf16 TFLOPS from device name via substring match."""
if not torch.cuda.is_available():
return None, "unknown"
device_name = torch.cuda.get_device_name(0)
for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items():
if gpu_key.lower() in device_name.lower():
return tflops, device_name
return None, device_name
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 69 - 75,
detect_gpu_peak_tflops currently calls torch.cuda.get_device_name(0) unguarded
and will crash on CPU-only hosts; update detect_gpu_peak_tflops to first check
torch.cuda.is_available() and only call torch.cuda.get_device_name(0) when CUDA
is available, otherwise return (None, "unknown") so callers like _cli_gpu_info
and MFUTracker(..., peak_tflops=None) safely receive a fallback; keep the
GPU_PEAK_TFLOPS_BF16 lookup logic intact when a device name is present.

Comment on lines +104 to +140
def from_hf_config(config_dict, **overrides):
"""Create ModelFLOPsConfig from an HF-compatible config dict.

Auto-detects architecture:
- GQA vs MHA: from num_key_value_heads (absent = MHA)
- Gated MLP (3 proj) vs standard FFN (2 proj): from model_type
- LM head: from vocab_size > 0

Args:
config_dict: Dict with standard HF config keys (hidden_size, num_hidden_layers, etc.).
Works with config.json dicts, config.to_dict(), or MODEL_PRESETS dicts.
**overrides: Explicit overrides for any field (e.g., num_mlp_projections=3).
"""
h = config_dict["hidden_size"]
n_heads = config_dict["num_attention_heads"]
n_kv = config_dict.get("num_key_value_heads", n_heads)
vocab = config_dict.get("vocab_size", 0)
model_type = config_dict.get("model_type", "")

# Detect gated MLP (3 projections) vs standard FFN (2 projections).
# Llama/Mistral/Qwen use SwiGLU (gate + up + down = 3 projections).
# ESM2/CodonFM/Geneformer/BERT use standard FFN (up + down = 2 projections).
num_mlp_proj = 3 if model_type in GATED_MLP_MODEL_TYPES else 2

kwargs = {
"hidden_size": h,
"num_hidden_layers": config_dict["num_hidden_layers"],
"num_attention_heads": n_heads,
"num_kv_heads": n_kv,
"head_dim": h // n_heads,
"intermediate_size": config_dict["intermediate_size"],
"num_mlp_projections": num_mlp_proj,
"vocab_size": vocab,
"has_lm_head": vocab > 0,
}
kwargs.update(overrides)
return ModelFLOPsConfig(**kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Recompute derived fields after applying overrides.

kwargs.update(overrides) can leave head_dim and has_lm_head inconsistent with the overridden values. For example, vocab_size=0 still keeps the precomputed has_lm_head unless the caller knows to override both, and overriding num_attention_heads leaves the old head_dim.

♻️ Suggested fix
 def from_hf_config(config_dict, **overrides):
@@
-    kwargs = {
-        "hidden_size": h,
-        "num_hidden_layers": config_dict["num_hidden_layers"],
-        "num_attention_heads": n_heads,
-        "num_kv_heads": n_kv,
-        "head_dim": h // n_heads,
-        "intermediate_size": config_dict["intermediate_size"],
-        "num_mlp_projections": num_mlp_proj,
-        "vocab_size": vocab,
-        "has_lm_head": vocab > 0,
-    }
-    kwargs.update(overrides)
-    return ModelFLOPsConfig(**kwargs)
+    kwargs = {
+        "hidden_size": h,
+        "num_hidden_layers": config_dict["num_hidden_layers"],
+        "num_attention_heads": n_heads,
+        "num_kv_heads": n_kv,
+        "intermediate_size": config_dict["intermediate_size"],
+        "num_mlp_projections": num_mlp_proj,
+        "vocab_size": vocab,
+    }
+    kwargs.update(overrides)
+    kwargs.setdefault("head_dim", kwargs["hidden_size"] // kwargs["num_attention_heads"])
+    kwargs.setdefault("has_lm_head", kwargs["vocab_size"] > 0)
+    return ModelFLOPsConfig(**kwargs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 104 - 140,
from_hf_config currently computes derived fields (head_dim, has_lm_head) before
kwargs.update(overrides), which allows overrides like num_attention_heads or
vocab_size to leave derived values stale; update the function so overrides are
applied before computing derived fields (or apply kwargs.update(overrides) and
then recompute head_dim = hidden_size // num_attention_heads and has_lm_head =
vocab_size > 0 afterwards), ensuring derived values (head_dim, has_lm_head and
any other fields dependent on overrides) reflect the final values passed to
ModelFLOPsConfig; keep references to from_hf_config, kwargs.update(overrides),
head_dim, has_lm_head, num_attention_heads, vocab_size and ModelFLOPsConfig to
locate the changes.

Comment on lines +453 to +483
def measure_bus_bandwidth(device, world_size, num_iters=20, num_elements=10_000_000):
"""Measure unidirectional P2P bandwidth via send/recv (matches CP ring pattern)."""
if world_size <= 1:
return 0.0

rank = dist.get_rank()
tensor = torch.randn(num_elements, device=device, dtype=torch.bfloat16)
peer = 1 - rank

for _ in range(5):
if rank == 0:
dist.send(tensor, dst=peer)
dist.recv(tensor, src=peer)
else:
dist.recv(tensor, src=peer)
dist.send(tensor, dst=peer)
torch.cuda.synchronize()

dist.barrier()
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(num_iters):
if rank == 0:
dist.send(tensor, dst=peer)
else:
dist.recv(tensor, src=peer)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start

data_bytes = tensor.nelement() * tensor.element_size()
return num_iters * data_bytes / elapsed / 1e9
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject world sizes other than 2 in the bandwidth helper.

peer = 1 - rank only works for ranks 0 and 1. If this gets launched with more than two processes, ranks >= 2 compute invalid peers and can hang the send/recv loop. Please fail fast here instead of assuming the caller followed the docstring.

🔒 Suggested guard
 def measure_bus_bandwidth(device, world_size, num_iters=20, num_elements=10_000_000):
     """Measure unidirectional P2P bandwidth via send/recv (matches CP ring pattern)."""
     if world_size <= 1:
         return 0.0
+    if world_size != 2:
+        raise ValueError("measure_bus_bandwidth currently supports exactly 2 ranks")

     rank = dist.get_rank()
     tensor = torch.randn(num_elements, device=device, dtype=torch.bfloat16)
-    peer = 1 - rank
+    peer = 1 if rank == 0 else 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 453 - 483, In
measure_bus_bandwidth, reject world sizes != 2 instead of assuming peers by
updating the initial guard (world_size) to raise an error when world_size is not
exactly 2; ensure you check world_size at the top of the function and raise a
ValueError (or similar) with a clear message referencing the peer calculation
using peer = 1 - rank so callers know only 2-process runs are supported; keep
the rest of the logic unchanged once the guard ensures rank is 0 or 1.

Comment on lines +486 to +507
def load_model_config(config_path):
"""Load model config dict from a local path or HuggingFace model ID.

Supports:
- Local directory: ./model_configs/lingua-1B (reads config.json inside)
- Local file: ./model_configs/lingua-1B/config.json
- HF model ID: nvidia/esm2_t36_3B_UR50D (fetches from HuggingFace Hub)
"""
import json
from pathlib import Path

path = Path(config_path)
if path.is_dir():
path = path / "config.json"
if path.exists():
return json.loads(path.read_text())

# Fall back to HuggingFace Hub
from transformers import AutoConfig

hf_config = AutoConfig.from_pretrained(config_path, trust_remote_code=True)
return hf_config.to_dict()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and read the file
find . -type f -name "flops.py" | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 285


🏁 Script executed:

# Also, let's search for the specific function
rg "def load_model_config" -A 25

Repository: NVIDIA/bionemo-framework

Length of output: 8820


🏁 Script executed:

# Check if there are any calls to load_model_config in the codebase
rg "load_model_config\(" --type py

Repository: NVIDIA/bionemo-framework

Length of output: 1391


🏁 Script executed:

# Verify the actual line numbers in the target file
sed -n '486,507p' bionemo-recipes/recipes/llama3_native_te/flops.py | cat -n

Repository: NVIDIA/bionemo-framework

Length of output: 947


🏁 Script executed:

# Check that the docstring and function signature comply with line length (119 chars)
head -n 510 bionemo-recipes/recipes/llama3_native_te/flops.py | tail -n 25 | cat -n

Repository: NVIDIA/bionemo-framework

Length of output: 1050


Remove hardcoded trust_remote_code=True and add explicit opt-in parameters.

AutoConfig.from_pretrained(config_path, trust_remote_code=True) unconditionally enables execution of arbitrary code from the Hugging Face Hub. Per Hugging Face documentation, trust_remote_code defaults to False and should only be enabled as an explicit opt-in for trusted sources. Add keyword-only parameters to make this configurable and allow revision pinning:

Suggested fix
-def load_model_config(config_path):
+def load_model_config(config_path, *, trust_remote_code=False, revision=None):
     """Load model config dict from a local path or HuggingFace model ID.
 
     Supports:
     - Local directory: ./model_configs/lingua-1B (reads config.json inside)
     - Local file: ./model_configs/lingua-1B/config.json
     - HF model ID: nvidia/esm2_t36_3B_UR50D (fetches from HuggingFace Hub)
     """
@@ -494,5 +494,7 @@ def load_model_config(config_path):
     from transformers import AutoConfig
 
-    hf_config = AutoConfig.from_pretrained(config_path, trust_remote_code=True)
+    hf_config = AutoConfig.from_pretrained(
+        config_path,
+        trust_remote_code=trust_remote_code,
+        revision=revision,
+    )
     return hf_config.to_dict()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 486 - 507,
The load_model_config function currently hardcodes trust_remote_code=True when
calling AutoConfig.from_pretrained; change it to accept explicit keyword-only
parameters (e.g., *, trust_remote_code: bool = False, revision: Optional[str] =
None) so callers must opt-in to remote code execution and can pin a revision,
then pass those parameters into AutoConfig.from_pretrained(config_path,
trust_remote_code=trust_remote_code, revision=revision) and update any call
sites accordingly.

Comment on lines +607 to +647
parser = argparse.ArgumentParser()
parser.add_argument("command")
parser.add_argument("--config-path", default="./model_configs/lingua-1B")
parser.add_argument("--batch-size", type=int, default=1)
parser.add_argument("--seq-len", type=int, default=4096)
parser.add_argument("--formula", default="analytical", choices=["analytical", "simplified"])
parser.add_argument("--num-gpus", type=int, default=1, help="Number of GPUs for per-GPU FLOPs and comm estimates")
parser.add_argument("--cp-size", type=int, default=1, help="Context parallelism size for comm overhead estimate")
parser.add_argument("--p2p-bw", type=float, default=6.0, help="P2P bandwidth in GB/s for comm time estimate")
args = parser.parse_args()

cfg_dict = load_model_config(args.config_path)
config = from_hf_config(cfg_dict)
b, s = args.batch_size, args.seq_len

print(
f"Config: H={config.hidden_size}, L={config.num_hidden_layers},"
f" n_heads={config.num_attention_heads}, n_kv={config.num_kv_heads},"
f" I={config.intermediate_size}, V={config.vocab_size}"
)
print(
f"MLP: {config.num_mlp_projections} projections"
f" ({'SwiGLU/GLU' if config.num_mlp_projections == 3 else 'standard FFN'})"
)
print(f"Batch: B={b}, S={s}, GPUs={args.num_gpus}, CP={args.cp_size}")
print()

simplified = compute_flops_simplified(b, s, config.hidden_size, config.num_hidden_layers, config.vocab_size)
analytical, breakdown, lm_head = compute_flops_analytical(config, b, s)

print(f"{'Method':<24} {'FLOPs/step':>14} {'Per-GPU':>14} {'Exact':>30}")
print("-" * 86)
for name, flops in [("Simplified (README)", simplified), ("Analytical", analytical)]:
per_gpu = flops // max(args.num_gpus, 1)
print(f"{name:<24} {format_flops(flops):>14} {format_flops(per_gpu):>14} {format_flops_exact(flops):>30}")

if simplified != analytical:
diff = analytical - simplified
print(f"\nDifference: {format_flops_exact(diff)} ({diff / simplified * 100:+.2f}%)")
else:
print("\nFormulas agree exactly for this config.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

--formula is currently a no-op.

The parser accepts --formula, but this code always computes and prints both simplified and analytical FLOPs and never branches on args.formula. Either honor the flag or remove it, otherwise the CLI advertises behavior it does not implement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 607 - 647,
The CLI flag args.formula is unused: either honor it or remove it; to fix,
branch on args.formula (values "simplified" or "analytical") before calling
compute_flops_simplified and compute_flops_analytical so you only compute and
print the requested method. Specifically, use args.formula to decide whether to
call
compute_flops_simplified(b,s,config.hidden_size,config.num_hidden_layers,config.vocab_size)
or compute_flops_analytical(config,b,s) (and only compute the other if you need
a diff), adjust the printing loop to only print the selected result (and only
compute/print the Difference when both are intentionally computed), or
alternatively remove the --formula argument from the argparse setup if you want
both methods always shown.

Comment on lines +151 to +157
mfu_tracker = MFUTracker(
config=from_hf_config(config.to_dict()),
batch_size=args.dataset.micro_batch_size,
seq_len=args.dataset.max_seq_length,
num_gpus=dist_config.world_size,
parallelism={"dp": dist_config.world_size},
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

DDP MFU is underestimated under gradient accumulation.

Line 206 measures full-step duration, but Line 153 configures FLOPs for a single micro-batch.
This biases MFU low by grad_acc_steps.

Suggested fix
         mfu_tracker = MFUTracker(
             config=from_hf_config(config.to_dict()),
-            batch_size=args.dataset.micro_batch_size,
+            batch_size=args.dataset.micro_batch_size * args.grad_acc_steps,
             seq_len=args.dataset.max_seq_length,
             num_gpus=dist_config.world_size,
             parallelism={"dp": dist_config.world_size},
         )

Also applies to: 205-210

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/train_ddp.py` around lines 151 -
157, The MFUTracker is being configured with per-micro-batch FLOPs which
underestimates utilization when gradient accumulation is used; update the
MFUTracker initialization (the MFUTracker(...) call) to account for gradient
accumulation by multiplying the effective batch_size (or otherwise scaling
FLOPs) by the gradient accumulation factor (use args.gradient_accumulation_steps
or args.grad_acc_steps as present in the training args) so the tracker uses
full-step work rather than a single micro-batch, and apply the same correction
to the related MFU/config calculations around this block.

Comment on lines +187 to +193
mfu_tracker = MFUTracker(
config=from_hf_config(config.to_dict()),
batch_size=args.dataset.micro_batch_size,
seq_len=args.dataset.max_seq_length,
num_gpus=dist_config.world_size,
parallelism={"dp": dist_config.world_size // args.cp_size, "cp": args.cp_size},
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

MFU math misses gradient accumulation scaling.

Line 244 times a full optimizer step, but Line 189 sets FLOPs batch size to one micro-batch.
With grad_acc_steps > 1, MFU is underestimated.

Suggested fix
         mfu_tracker = MFUTracker(
             config=from_hf_config(config.to_dict()),
-            batch_size=args.dataset.micro_batch_size,
+            batch_size=args.dataset.micro_batch_size * args.grad_acc_steps,
             seq_len=args.dataset.max_seq_length,
             num_gpus=dist_config.world_size,
             parallelism={"dp": dist_config.world_size // args.cp_size, "cp": args.cp_size},
         )

Also applies to: 243-248

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py` around lines 187
- 193, The MFUTracker is using args.dataset.micro_batch_size as the FLOPs batch
size which ignores gradient accumulation and underestimates MFU; update the
mfu_tracker initialization (symbol: MFUTracker / variable: mfu_tracker) to use
the effective batch size = args.dataset.micro_batch_size * args.grad_acc_steps
(or otherwise multiply by grad_acc_steps) when setting the batch_size parameter,
and apply the same change wherever MFUTracker is constructed (also the other
block around the second occurrence).

Comment on lines +165 to +170
mfu_tracker = MFUTracker(
config=from_hf_config(config.to_dict()),
batch_size=args.dataset.micro_batch_size,
seq_len=args.dataset.max_seq_length,
num_gpus=dist_config.world_size,
parallelism={"dp": dist_config.world_size},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

MFU is underreported when grad_acc_steps > 1.

Line 219 computes wall time for a full optimizer step, but Lines 167-170 initialize FLOPs with a single micro-batch.
That underestimates FLOPs/MFU by a factor of grad_acc_steps.

Suggested fix
         mfu_tracker = MFUTracker(
             config=from_hf_config(config.to_dict()),
-            batch_size=args.dataset.micro_batch_size,
+            batch_size=args.dataset.micro_batch_size * args.grad_acc_steps,
             seq_len=args.dataset.max_seq_length,
             num_gpus=dist_config.world_size,
             parallelism={"dp": dist_config.world_size},
         )

Also applies to: 218-223

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py` around lines 165 -
170, MFU is undercounted because MFUTracker is initialized with a single
micro-batch; multiply the batch work by gradient accumulation steps so FLOPs/
MFU reflect a full optimizer step. Update the MFUTracker initialization
(mfu_tracker / MFUTracker) to use an effective batch size =
args.dataset.micro_batch_size * grad_acc_steps (or pass grad_acc_steps
separately) and also ensure any FLOPs or wall-time computations (the code path
that computes wall time for a full optimizer step) scale FLOPs by grad_acc_steps
so the computed FLOPs and MFU reflect the total work per optimizer step.

@gagank1 gagank1 requested a review from pstjohn April 13, 2026 21:35
"bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py",
],
# FLOPs / MFU module - synced across recipes
"bionemo-recipes/recipes/llama3_native_te/flops.py": [
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to copy these to each recipe? it's just to calculate flops? maybe we just put it in one and point folks at it.

This should maybe be in the models/ definition though? is this just printing out a total flops for a given model size?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flops.py has 2 usage paths:

  1. flops counting for a given model config and comms bandwidth/overhead calculation using CLI
  2. can be used in the recipes' train scripts with config log_mfu=true to log MFU with each step

so it doesn't go purely with the model defs.

Copy link
Copy Markdown
Collaborator

@trvachov trvachov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Across all the added code, I think you shouldn't use batch_size in isolation as a variable name, or at least if you do make it explicit that in that function it doesn't matter whether it's micro or global batch. Otherwise, use micro_batch_size or global_batch_size, because it gets very confusing if the distributed flops math is right without this.

The comm overhead calculations here have a number of issues including assumptions about model precision and also about FSDP2 communication patterns. Adding comm overhead calculations is a reasonable idea but I think it should be a separate PR. Each PR should be one-idea -> one PR.

There is a flops.py CLI, but I ran torchrun --nproc_per_node=2 flops.py bandwidth as written in the docstring it and it just hangs. This was on 8xA100 80GB PCIe, pytorch:26.02-py3 container. Does this work for you?

Regarding overall code structure --> indeed there is a lot of duplicated code, but that is also to some degree the philosophy of recipes. @pstjohn Can you help with some guidance here? Should flops.py really be in perf_logger.py to reduce duplicate files? Also I think the FLOPS metrics should likely be recorded in WANDB just like all of the other metrics we have.

else:
raise ValueError(f"Unknown formula: {formula!r}. Use 'analytical', 'simplified', or 'hyena'.")

self.per_gpu_flops = self.total_flops // max(num_gpus, 1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why divide by num_gpus? You mentioned you tested this with CP, num_gpus>2 and got reasonable numbers?

16*H^2 for 2 MLP projections with I=4H.
"""
b, s, h = batch_size, seq_len, hidden_size
return (24 * b * s * h * h + 4 * b * s * s * h) * (3 * num_layers) + (6 * b * s * h * vocab_size)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work with gradient accumulation?

if mfu_tracker is not None and perf_logger.last_step_time is not None:
mfu_info = mfu_tracker.compute_mfu(perf_logger.last_step_time)
if dist_config.is_main_process():
logger.info(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this take into account there is a logging_frequency variable? It seems like this would spam this every step.

if dp_size > 1:
# FSDP reduce-scatter estimate: ~2 * model_params * dtype_bytes * (dp-1)/dp
model_params = _estimate_model_params(self.config)
total += 2 * model_params * 2 * (dp_size - 1) // dp_size
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok...a number of concerns with this bit of code:

  1. The second "2" --> is that supposed to be dtype_bytes = 2? That works for bf16/fp16 but we sometimes have weights in FP32.
  2. The comment says there is an FSDP2 reduce-scatter comm estimate, but isn't step 1 of FSDP2 an all gather for each shared module? I am just not sure I understand the distributed comm steps "~2 * model_params * dtype_bytes * (dp-1)/dp" is trying to model.
  3. Is the assumption here that you can sum up all the per-module FSDP2 comms into "model_params"? FSDP2 does per module sharding so I'm trying to understand how you get an estimate the comms without doing a sum over the modules that are being sharded.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes this assumes bf16, fixing that to be generalized
  2. the comment is wrong, fixing that.
  3. yes, although it can be a slight overestimation since it assumes every param is sharded.

With module i with p_i parameters, total bytes across all modules for both directions:

sum_over_modules(2 * p_i * dtype_bytes * (dp-1)/dp)
= 2 * dtype_bytes * (dp-1)/dp * sum(p_i)
= 2 * dtype_bytes * (dp-1)/dp * total_params

Returns:
Dict with comm_bytes, estimated_comm_time, comm_pct.
"""
bw = measured_bw_gbps or 6.0 # Default ~PCIe Gen3 x8
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why PCIe gen3? Isn't that >15 years old?

Also -- have you done the measurement on PCIe Gen3? I ask because, if this 6.0 is a "datasheet number", it might be very hard to measure especially through pytorch because of launch overhead. You might need a very specialized environment to measure max-PCIe speeds.



def _cli_bandwidth():
"""Measure P2P bandwidth. Launch with: torchrun --nproc_per_node=2 flops.py bandwidth."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this CLI command run for you? It just hangs indefinitely for me...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it runs in around 3 seconds for me. are you having the same issue on another machine?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, infinite hang. Log:

root@2eafaa0982e9:/workspace/bionemo-framework/bionemo-recipes/recipes/llama3_native_te# torchrun --nproc_per_node=2 flops.py bandwidth
W0415 13:48:45.772000 309 torch/distributed/run.py:851]
W0415 13:48:45.772000 309 torch/distributed/run.py:851] *****************************************
W0415 13:48:45.772000 309 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0415 13:48:45.772000 309 torch/distributed/run.py:851] *****************************************

Measuring P2P bandwidth between 2 GPUs...
  GPU 0: NVIDIA A100 80GB PCIe
  GPU 1: NVIDIA A100 80GB PCIe
[rank0]:[W415 13:48:48.106866174 ProcessGroupNCCL.cpp:4084] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2.  In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
[rank1]:[W415 13:48:48.112559519 ProcessGroupNCCL.cpp:4084] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2.  In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
[rank0]:[W415 13:48:48.352708490 ProcessGroupNCCL.cpp:4084] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2.  In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
[rank1]:[W415 13:48:48.353136693 ProcessGroupNCCL.cpp:4084] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2.  In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
^CW0415 13:57:20.219000 309 torch/distributed/elastic/agent/server/api.py:739] Received 2 death signal, shutting down workers
W0415 13:57:20.220000 309 torch/distributed/elastic/multiprocessing/api.py:1010] Sending process 376 closing signal SIGINT
W0415 13:57:20.221000 309 torch/distributed/elastic/multiprocessing/api.py:1010] Sending process 377 closing signal SIGINT
^Z
[1]+  Stopped                 torchrun --nproc_per_node=2 flops.py bandwidth

@pstjohn
Copy link
Copy Markdown
Collaborator

pstjohn commented Apr 15, 2026

A couple general thoughts:

  1. The simplicity of our training scripts is incredibly important; even as it stands today i'm not happy with how much has crept in there. These need to be accessible and readable to folks coming to the repository for the first time, and be basic scaffolds agents can build from. I really don't think a flops calculator like this needs to touch every single training script in our recipes. At most, it would go in the perf logger, but even there, I'd rather see this be something like https://huggingface.co/docs/accelerate/usage_guides/model_size_estimator where this lives as a javascript app in our docs, not necessarily something that runs alongside model training.
  2. Logging Flops/MFU directly is notoriously a footgun when it comes to performance, especially if we end up re-calculating model flops every iteration. Are we 100% sure we're not doing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants