Generalize MFU/FLOPs module across recipes with log_mfu training hook#1548
Generalize MFU/FLOPs module across recipes with log_mfu training hook#1548
Conversation
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
📝 WalkthroughWalkthroughThis PR introduces Model FLOPs Utilization (MFU) tracking across multiple native training recipes (LLaMA3, ESM2, CodonFM, OpenGenome2). A new Changes
Sequence Diagram(s)sequenceDiagram
participant TrainScript as Training Script
participant MFUTracker as MFUTracker
participant FlopsModule as FLOPs Compute
participant GPU as GPU
TrainScript->>MFUTracker: __init__(config, batch_size, seq_len, ...)
MFUTracker->>FlopsModule: from_hf_config(model_config)
FlopsModule-->>MFUTracker: ModelFLOPsConfig
MFUTracker->>FlopsModule: compute_flops_analytical/simplified()
FlopsModule-->>MFUTracker: total_training_flops
MFUTracker->>GPU: detect_gpu_peak_tflops()
GPU-->>MFUTracker: peak_tflops_bf16
loop Each Optimizer Step
TrainScript->>TrainScript: measure step_time (perf_counter)
TrainScript->>MFUTracker: compute_mfu(step_time)
MFUTracker->>MFUTracker: mfu = (total_flops / step_time / peak_tflops)
MFUTracker-->>TrainScript: {mfu_percent, tflops_per_gpu}
TrainScript->>TrainScript: log metrics
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
| if rank == 0: | ||
| print("Counting FLOPs with HF model (meta device)...") | ||
| hf_config_meta = LlamaConfig.from_pretrained(args.config_path) | ||
| hf_config_meta._attn_implementation = "eager" |
There was a problem hiding this comment.
This is going to be the slowest possible attention, can you use fa2 or sdpa?
There was a problem hiding this comment.
This was the old version of the PR, no longer includes HF comparison
- Add compare_mfu_validate.py: golden value tests comparing CP vs non-CP execution for both TE and HF models, validating loss, logits (cosine sim > 0.99), and gradients (cosine sim > 0.8) following the pattern from models/llama3/tests/test_cp_bshd.py - Fix silent RoPE bug in HF CP: position_ids were not passed to context_parallel buffers, causing each rank to auto-generate [0..S/CP-1] instead of correct global positions - Switch FLOPs counting from eager to SDPA attention for consistency with the actual training implementation (identical counts on meta device) - Add exact FLOPs output (full integers with commas) alongside abbreviated values in both single-GPU and multi-GPU scripts - Switch bandwidth measurement from all-reduce to all-gather for more accurate pure data movement measurement matching CP ring attention - Add position_ids and max_length_q/k support to measure_step_time Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Replaces all-gather with explicit send/recv between rank 0 and rank 1, matching CP ring attention's actual communication pattern. Measures 6.6 GB/s unidirectional on PCIe Gen 3 x8 (vs 3.2 GB/s all-gather, 4.0 GB/s all-reduce). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Adds subcommands for using MFU utilities outside of full training scripts: - gpu-info: Print GPU name, detected peak TFLOPS, and known GPU table - flops: Compute FLOPs from a model config (README + first-principles) - cp-comm: Estimate CP ring attention communication volume - bandwidth: Measure unidirectional P2P bandwidth via send/recv (torchrun) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Add architecture-independent flops.py module (source: models/esm2/flops.py) supporting Llama (GQA+SwiGLU), ESM2, CodonFM (MHA+GELU), and Evo2 (Hyena). Key additions: - ModelFLOPsConfig dataclass with from_hf_config() auto-detection - Analytical, simplified (README), and Hyena FLOPs formulas - MFUTracker class for training script integration - Communication overhead estimation (CP ring attention, FSDP) - CLI: gpu-info, flops, cp-comm, bandwidth Training integration (gated by log_mfu: false in hydra config): - 11 training scripts across 4 native_te recipes - llama3_native_te (3), esm2_native_te (5), codonfm_native_te (1), opengenome2_llama_native_te (2) Removes superseded Llama-specific compare_mfu*.py benchmark scripts. Moves first_principles.md to models/esm2/ alongside the source module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
- Fix RTX 4090 (330→165) and RTX 3090 (142→71): were using sparse values - Remove A5000, RTX 4090, RTX 3090 (not relevant for data center workloads) - Add B300 and GB300 Blackwell Ultra at 2,500 TFLOPS dense BF16 - Add authoritative NVIDIA source URLs for all GPU specs as comments Sources: B300/GB300: https://www.nvidia.com/en-us/data-center/gb300-nvl72/ (360 PFLOPS BF16 sparse / 72 GPUs / 2 = 2,500 TFLOPS dense per GPU) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Remove functions that were only used by the now-deleted compare_mfu*.py standalone benchmark scripts: - compute_flops_first_principles(), compute_flops_readme() (backward compat wrappers) - measure_step_time(), split_for_cp_bshd(), count_flops_with_model(), cleanup_model() - Unused imports: gc, nullcontext, FlopCounterMode Retained: MFUTracker, from_hf_config, compute_flops_analytical, compute_flops_simplified, compute_flops_hyena, CLI, and all utilities needed by the training script log_mfu integration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
- Move flops.py source from models/esm2/ to recipes/llama3_native_te/ (it's a training utility, not a model utility) - Add test_flops.py with 27 tests covering config auto-detection, analytical/simplified/hyena formulas, MFUTracker, and CP comm estimation - Update check_copied_files.py: source is now llama3_native_te, copied to esm2, codonfm, opengenome2 recipes (both flops.py and test_flops.py) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Document the log_mfu=true flag and flops.py CLI utilities in all 4 native_te recipe READMEs: llama3, esm2, codonfm, opengenome2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
/ok to test 27255a1 |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
bionemo-recipes/recipes/esm2_native_te/perf_logger.py (1)
108-120:⚠️ Potential issue | 🟠 MajorReset
last_step_timeon non-logging steps to prevent stale MFU inputs.
self.last_step_timeis only updated inside the logging-frequency branch, so it can retain an old value. Since training code readsperf_logger.last_step_timeeach step, MFU can be computed with stale timing.Proposed fix
with torch.no_grad(): + self.last_step_time = None # FSDP2's clip_grad_norm_ returns a DTensor; convert to local tensor for torchmetrics compatibility. if isinstance(grad_norm, DTensor): grad_norm = grad_norm.to_local() @@ if step % self.logging_frequency == 0 and step > 0: @@ step_time = elapsed_time / self.logging_frequency self.last_step_time = step_time🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/esm2_native_te/perf_logger.py` around lines 108 - 120, The perf logger currently only updates self.last_step_time inside the logging branch guarded by self.logging_frequency, causing stale values on non-logging steps; update the step handler in perf_logger.py so that when a step is not a logging step you explicitly reset self.last_step_time (e.g., set to 0.0 or None) to avoid stale MFU inputs, keeping the existing logic that updates self.previous_step_time and last_step_time when (step % self.logging_frequency == 0).
🧹 Nitpick comments (3)
bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py (1)
354-359: Same minor note astrain_fsdp2_cp.py.The
step_start_timereset on line 359 happens unconditionally. Same optional refactor applies if you want consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py` around lines 354 - 359, The unconditional reset of step_start_time happens even when mfu_tracker is None; update the block around mfu_tracker so step_start_time is only reset when mfu_tracker is present (i.e., move the step_start_time = time.perf_counter() inside the if mfu_tracker is not None branch, after calling mfu_tracker.compute_mfu and optional dist_config.is_main_process()/logger.info) to match the optional flow used elsewhere (see mfu_tracker, compute_mfu, dist_config.is_main_process, logger.info).bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py (1)
24-25: Consider using a more robust import approach.The
sys.path.insert(0, ...)pattern works but can be fragile. Since this test file will be copied across recipes, consider whether pytest'sconftest.pyorpyproject.tomlconfiguration could provide a cleaner solution.That said, this is a common pattern in recipe-style code and works correctly here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py` around lines 24 - 25, The test currently mutates sys.path via the sys.path.insert(0, ...) call in test_flops.py which is fragile; instead move that path-setup into a shared pytest configuration so tests don't each modify sys.path: create or update pytest's conftest.py to add the parent directory to sys.path (or better, configure package discovery via pyproject.toml/paths or install the package in editable mode) and remove the sys.path.insert(0, ...) line from test_flops.py so imports are resolved by pytest configuration rather than per-test mutation.bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py (1)
362-367: Minor inconsistency:step_start_timereset occurs unconditionally.The timer reset on line 367 happens even when
mfu_tracker is None. While this has negligible performance impact, it's slightly inconsistent—either gate the entire timing block or make it explicit that timing is always active.Also note this script uses manual
time.perf_counter()timing while ESM2 scripts useperf_logger.last_step_time. This difference is acceptable since different recipes may have differentPerfLoggerimplementations, but it's worth documenting if intentional.Optional: Gate timing reset with MFU tracker check
if mfu_tracker is not None: step_time = time.perf_counter() - step_start_time mfu_info = mfu_tracker.compute_mfu(step_time) if dist_config.is_main_process(): logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"]) - step_start_time = time.perf_counter() + step_start_time = time.perf_counter()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py` around lines 362 - 367, The timing reset is currently unconditional; to be consistent, only reset step_start_time when mfu tracking is active. Inside the block that checks mfu_tracker is not None (the one that calls mfu_tracker.compute_mfu(...) and logs via logger.info when dist_config.is_main_process()), move the line that sets step_start_time = time.perf_counter() into that same guarded block (or explicitly comment why timing should remain unconditional). This references the mfu_tracker variable, compute_mfu method, step_start_time variable, dist_config.is_main_process(), and logger.info to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py`:
- Around line 24-41: The tests unsafely import the local flops.py via sys.path
and a plain "from flops import ...", causing module-cache collisions across
pytest sessions; modify the test to load the local flops.py under a unique
module name (e.g., using importlib.util.spec_from_file_location or
importlib.machinery.SourceFileLoader with a name derived from the test file path
or parent directory) and then import the symbols MFUTracker, ModelFLOPsConfig,
compute_flops_analytical, compute_flops_hyena, compute_flops_simplified,
estimate_cp_comm_bytes, and from_hf_config from that module, ensuring you do not
rely on sys.path insertion or the global "flops" entry in sys.modules so each
recipe test loads its own isolated module.
In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py`:
- Around line 222-227: The MFU log omits the per-step duration even though
step_time is computed; update the logging call inside the mfu_tracker block
(where mfu_info = mfu_tracker.compute_mfu(step_time) and inside the
dist_config.is_main_process() check) to include step_time in the logger.info
message alongside mfu_info["mfu"] and mfu_info["tflops_per_gpu"] (e.g., add a
placeholder for step_time and pass step_time as an argument) so each MFU line
reports MFU, TFLOPS/GPU, and the per-step time.
In `@bionemo-recipes/recipes/llama3_native_te/flops.py`:
- Around line 607-647: The CLI flag args.formula is unused: either honor it or
remove it; to fix, branch on args.formula (values "simplified" or "analytical")
before calling compute_flops_simplified and compute_flops_analytical so you only
compute and print the requested method. Specifically, use args.formula to decide
whether to call
compute_flops_simplified(b,s,config.hidden_size,config.num_hidden_layers,config.vocab_size)
or compute_flops_analytical(config,b,s) (and only compute the other if you need
a diff), adjust the printing loop to only print the selected result (and only
compute/print the Difference when both are intentionally computed), or
alternatively remove the --formula argument from the argparse setup if you want
both methods always shown.
- Around line 453-483: In measure_bus_bandwidth, reject world sizes != 2 instead
of assuming peers by updating the initial guard (world_size) to raise an error
when world_size is not exactly 2; ensure you check world_size at the top of the
function and raise a ValueError (or similar) with a clear message referencing
the peer calculation using peer = 1 - rank so callers know only 2-process runs
are supported; keep the rest of the logic unchanged once the guard ensures rank
is 0 or 1.
- Around line 104-140: from_hf_config currently computes derived fields
(head_dim, has_lm_head) before kwargs.update(overrides), which allows overrides
like num_attention_heads or vocab_size to leave derived values stale; update the
function so overrides are applied before computing derived fields (or apply
kwargs.update(overrides) and then recompute head_dim = hidden_size //
num_attention_heads and has_lm_head = vocab_size > 0 afterwards), ensuring
derived values (head_dim, has_lm_head and any other fields dependent on
overrides) reflect the final values passed to ModelFLOPsConfig; keep references
to from_hf_config, kwargs.update(overrides), head_dim, has_lm_head,
num_attention_heads, vocab_size and ModelFLOPsConfig to locate the changes.
- Around line 69-75: detect_gpu_peak_tflops currently calls
torch.cuda.get_device_name(0) unguarded and will crash on CPU-only hosts; update
detect_gpu_peak_tflops to first check torch.cuda.is_available() and only call
torch.cuda.get_device_name(0) when CUDA is available, otherwise return (None,
"unknown") so callers like _cli_gpu_info and MFUTracker(..., peak_tflops=None)
safely receive a fallback; keep the GPU_PEAK_TFLOPS_BF16 lookup logic intact
when a device name is present.
- Around line 486-507: The load_model_config function currently hardcodes
trust_remote_code=True when calling AutoConfig.from_pretrained; change it to
accept explicit keyword-only parameters (e.g., *, trust_remote_code: bool =
False, revision: Optional[str] = None) so callers must opt-in to remote code
execution and can pin a revision, then pass those parameters into
AutoConfig.from_pretrained(config_path, trust_remote_code=trust_remote_code,
revision=revision) and update any call sites accordingly.
In `@bionemo-recipes/recipes/llama3_native_te/train_ddp.py`:
- Around line 151-157: The MFUTracker is being configured with per-micro-batch
FLOPs which underestimates utilization when gradient accumulation is used;
update the MFUTracker initialization (the MFUTracker(...) call) to account for
gradient accumulation by multiplying the effective batch_size (or otherwise
scaling FLOPs) by the gradient accumulation factor (use
args.gradient_accumulation_steps or args.grad_acc_steps as present in the
training args) so the tracker uses full-step work rather than a single
micro-batch, and apply the same correction to the related MFU/config
calculations around this block.
In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py`:
- Around line 187-193: The MFUTracker is using args.dataset.micro_batch_size as
the FLOPs batch size which ignores gradient accumulation and underestimates MFU;
update the mfu_tracker initialization (symbol: MFUTracker / variable:
mfu_tracker) to use the effective batch size = args.dataset.micro_batch_size *
args.grad_acc_steps (or otherwise multiply by grad_acc_steps) when setting the
batch_size parameter, and apply the same change wherever MFUTracker is
constructed (also the other block around the second occurrence).
In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py`:
- Around line 165-170: MFU is undercounted because MFUTracker is initialized
with a single micro-batch; multiply the batch work by gradient accumulation
steps so FLOPs/ MFU reflect a full optimizer step. Update the MFUTracker
initialization (mfu_tracker / MFUTracker) to use an effective batch size =
args.dataset.micro_batch_size * grad_acc_steps (or pass grad_acc_steps
separately) and also ensure any FLOPs or wall-time computations (the code path
that computes wall time for a full optimizer step) scale FLOPs by grad_acc_steps
so the computed FLOPs and MFU reflect the total work per optimizer step.
---
Outside diff comments:
In `@bionemo-recipes/recipes/esm2_native_te/perf_logger.py`:
- Around line 108-120: The perf logger currently only updates
self.last_step_time inside the logging branch guarded by self.logging_frequency,
causing stale values on non-logging steps; update the step handler in
perf_logger.py so that when a step is not a logging step you explicitly reset
self.last_step_time (e.g., set to 0.0 or None) to avoid stale MFU inputs,
keeping the existing logic that updates self.previous_step_time and
last_step_time when (step % self.logging_frequency == 0).
---
Nitpick comments:
In `@bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py`:
- Around line 24-25: The test currently mutates sys.path via the
sys.path.insert(0, ...) call in test_flops.py which is fragile; instead move
that path-setup into a shared pytest configuration so tests don't each modify
sys.path: create or update pytest's conftest.py to add the parent directory to
sys.path (or better, configure package discovery via pyproject.toml/paths or
install the package in editable mode) and remove the sys.path.insert(0, ...)
line from test_flops.py so imports are resolved by pytest configuration rather
than per-test mutation.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py`:
- Around line 362-367: The timing reset is currently unconditional; to be
consistent, only reset step_start_time when mfu tracking is active. Inside the
block that checks mfu_tracker is not None (the one that calls
mfu_tracker.compute_mfu(...) and logs via logger.info when
dist_config.is_main_process()), move the line that sets step_start_time =
time.perf_counter() into that same guarded block (or explicitly comment why
timing should remain unconditional). This references the mfu_tracker variable,
compute_mfu method, step_start_time variable, dist_config.is_main_process(), and
logger.info to locate the code to change.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py`:
- Around line 354-359: The unconditional reset of step_start_time happens even
when mfu_tracker is None; update the block around mfu_tracker so step_start_time
is only reset when mfu_tracker is present (i.e., move the step_start_time =
time.perf_counter() inside the if mfu_tracker is not None branch, after calling
mfu_tracker.compute_mfu and optional dist_config.is_main_process()/logger.info)
to match the optional flow used elsewhere (see mfu_tracker, compute_mfu,
dist_config.is_main_process, logger.info).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 158596f3-a749-4ec2-88e4-a5b4f9b08cc4
📒 Files selected for processing (30)
bionemo-recipes/recipes/codonfm_native_te/README.mdbionemo-recipes/recipes/codonfm_native_te/flops.pybionemo-recipes/recipes/codonfm_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/codonfm_native_te/tests/test_flops.pybionemo-recipes/recipes/codonfm_native_te/train_fsdp2.pybionemo-recipes/recipes/esm2_native_te/README.mdbionemo-recipes/recipes/esm2_native_te/flops.pybionemo-recipes/recipes/esm2_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/esm2_native_te/perf_logger.pybionemo-recipes/recipes/esm2_native_te/tests/test_flops.pybionemo-recipes/recipes/esm2_native_te/train_ddp.pybionemo-recipes/recipes/esm2_native_te/train_ddp_cp.pybionemo-recipes/recipes/esm2_native_te/train_fsdp2.pybionemo-recipes/recipes/esm2_native_te/train_fsdp2_cp.pybionemo-recipes/recipes/esm2_native_te/train_mfsdp.pybionemo-recipes/recipes/llama3_native_te/README.mdbionemo-recipes/recipes/llama3_native_te/flops.pybionemo-recipes/recipes/llama3_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/llama3_native_te/model_configs/meta-llama/Llama-3.1-8B/config.jsonbionemo-recipes/recipes/llama3_native_te/tests/test_flops.pybionemo-recipes/recipes/llama3_native_te/train_ddp.pybionemo-recipes/recipes/llama3_native_te/train_fsdp2.pybionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.pybionemo-recipes/recipes/opengenome2_llama_native_te/README.mdbionemo-recipes/recipes/opengenome2_llama_native_te/flops.pybionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_flops.pybionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.pybionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.pyci/scripts/check_copied_files.py
| import sys | ||
| from pathlib import Path | ||
|
|
||
| import pytest | ||
|
|
||
|
|
||
| # Add parent directory so we can import flops | ||
| sys.path.insert(0, str(Path(__file__).parent.parent)) | ||
|
|
||
| from flops import ( | ||
| MFUTracker, | ||
| ModelFLOPsConfig, | ||
| compute_flops_analytical, | ||
| compute_flops_hyena, | ||
| compute_flops_simplified, | ||
| estimate_cp_comm_bytes, | ||
| from_hf_config, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate the test file and its parent structure
find . -name "test_flops.py" -type f 2>/dev/null | head -20Repository: NVIDIA/bionemo-framework
Length of output: 329
🏁 Script executed:
# Check if flops.py exists in the parent directories of test files
find . -name "flops.py" -type f 2>/dev/null | head -20Repository: NVIDIA/bionemo-framework
Length of output: 285
🏁 Script executed:
# Search for other test files using the same sys.path pattern
rg "sys\.path\.insert.*parent" --type py 2>/dev/null | head -20Repository: NVIDIA/bionemo-framework
Length of output: 2354
🏁 Script executed:
# Verify what each test_flops.py imports from "flops"
for f in ./bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py ./bionemo-recipes/recipes/esm2_native_te/tests/test_flops.py ./bionemo-recipes/recipes/llama3_native_te/tests/test_flops.py; do
echo "=== $f ==="
sed -n '1,45p' "$f" | tail -20
doneRepository: NVIDIA/bionemo-framework
Length of output: 1631
🏁 Script executed:
# Check how pytest is typically invoked - look for pytest config or CI configs
find . -name "pytest.ini" -o -name "pyproject.toml" -o -name "setup.cfg" | grep -E "(pytest|pyproject)" | head -10Repository: NVIDIA/bionemo-framework
Length of output: 658
🏁 Script executed:
# Check pyproject.toml for pytest configuration
head -200 ./pyproject.toml 2>/dev/null | grep -A 20 "\[tool.pytest"Repository: NVIDIA/bionemo-framework
Length of output: 50
🏁 Script executed:
# Check pytest.ini directly
cat ./pytest.ini 2>/dev/nullRepository: NVIDIA/bionemo-framework
Length of output: 433
🏁 Script executed:
# Check main pyproject.toml for pytest config
rg "\[tool\.pytest" -A 10 ./pyproject.toml 2>/dev/nullRepository: NVIDIA/bionemo-framework
Length of output: 50
🏁 Script executed:
# View the actual imports in llama3_native_te test_flops.py more completely
head -30 ./bionemo-recipes/recipes/llama3_native_te/tests/test_flops.pyRepository: NVIDIA/bionemo-framework
Length of output: 1062
🏁 Script executed:
# Check if there's a conftest.py that might manipulate sys.path at the recipe level
find ./bionemo-recipes/recipes -name "conftest.py" -type fRepository: NVIDIA/bionemo-framework
Length of output: 706
🏁 Script executed:
# Check the conftest.py files for the affected recipes
for f in ./bionemo-recipes/recipes/codonfm_native_te/tests/conftest.py ./bionemo-recipes/recipes/llama3_native_te/tests/conftest.py ./bionemo-recipes/recipes/esm2_native_te/tests/conftest.py; do
echo "=== $f ==="
cat "$f" 2>/dev/null || echo "Not found"
echo ""
doneRepository: NVIDIA/bionemo-framework
Length of output: 10467
🏁 Script executed:
# Test if flops modules would actually collide - simulate what happens when importing both
python3 << 'EOF'
import sys
from pathlib import Path
import importlib
# Simulate first recipe test importing flops
recipe1_dir = "./bionemo-recipes/recipes/llama3_native_te"
sys.path.insert(0, recipe1_dir)
# This would make "flops" from llama3 available
print("sys.path after recipe1:", sys.path[:3])
# Simulate second recipe test in SAME process (as pytest would do)
recipe2_dir = "./bionemo-recipes/recipes/codonfm_native_te"
# If we just append another dir, we'd have both in sys.path
sys.path.insert(0, recipe2_dir)
print("sys.path after recipe2:", sys.path[:3])
# When we try to import flops now, Python will use the first match in sys.path
# but if flops is already in sys.modules, it uses that cached version
print("\nThis demonstrates the caching risk:")
print("- First test imports flops from llama3_native_te")
print("- Module 'flops' gets cached in sys.modules")
print("- Second test also imports 'flops'")
print("- Python finds 'flops' in sys.modules and returns cached version")
print("- Test uses wrong recipe's flops.py!")
EOFRepository: NVIDIA/bionemo-framework
Length of output: 557
🏁 Script executed:
# Check if the flops.py files differ between recipes
echo "=== Comparing llama3 vs codonfm flops.py size ==="
wc -l ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.pyRepository: NVIDIA/bionemo-framework
Length of output: 246
🏁 Script executed:
# Check if they're identical or different
diff ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.py | head -50Repository: NVIDIA/bionemo-framework
Length of output: 359
🏁 Script executed:
# Verify pytest test discovery behavior - check if tests are typically run together
find ./bionemo-recipes/recipes -name "test_*.py" -o -name "*_test.py" | wc -lRepository: NVIDIA/bionemo-framework
Length of output: 70
🏁 Script executed:
# Check if CI configuration runs all recipe tests together
find . -name "*.yml" -o -name "*.yaml" | xargs grep -l "pytest.*recipes" 2>/dev/null | head -5Repository: NVIDIA/bionemo-framework
Length of output: 50
🏁 Script executed:
# Check if conftest.py at a higher level might run tests across recipes
find ./bionemo-recipes/recipes -name "conftest.py" | head -5
ls -la ./bionemo-recipes/recipes/conftest.py 2>/dev/null || echo "No conftest.py at recipes level"
ls -la ./bionemo-recipes/conftest.py 2>/dev/null || echo "No conftest.py at bionemo-recipes level"Repository: NVIDIA/bionemo-framework
Length of output: 453
🏁 Script executed:
# Verify the exact issue: show what happens when flops is imported twice with same name
python3 << 'EOF'
import sys
from pathlib import Path
# Simulate pytest running both tests in sequence in same process
print("=== Scenario: Both test_flops.py files run in same pytest process ===\n")
# Test 1: llama3_native_te/tests/test_flops.py
print("1. Test llama3_native_te imports flops:")
sys.path.insert(0, str(Path("./bionemo-recipes/recipes/llama3_native_te")))
import flops as flops_v1
print(f" Module: {flops_v1.__file__}")
print(f" sys.modules['flops']: {sys.modules.get('flops', 'NOT SET')}")
# Simulate cleaning up sys.path for next test
sys.path.pop(0)
sys.path.insert(0, str(Path("./bionemo-recipes/recipes/codonfm_native_te")))
# Test 2: codonfm_native_te/tests/test_flops.py
print("\n2. Test codonfm_native_te tries to import flops:")
# This would normally do: from flops import MFUTracker, ...
# But since 'flops' is already in sys.modules, Python returns the cached version!
import flops as flops_v2
print(f" Module: {flops_v2.__file__}")
print(f" Same module? {flops_v1 is flops_v2}")
print(f"\n RESULT: Test got llama3's flops, not codonfm's!")
EOFRepository: NVIDIA/bionemo-framework
Length of output: 431
🏁 Script executed:
# Verify if the flops.py files are actually different (not just comments)
diff -w ./bionemo-recipes/recipes/llama3_native_te/flops.py ./bionemo-recipes/recipes/codonfm_native_te/flops.py | grep -v "^[0-9]" | head -30Repository: NVIDIA/bionemo-framework
Length of output: 350
🏁 Script executed:
# Check test execution patterns in CI/config files
find . -name "*.sh" -o -name "*.yml" -o -name "*.yaml" | xargs grep -l "test_flops" 2>/dev/null | head -3Repository: NVIDIA/bionemo-framework
Length of output: 50
Import the local flops.py under a unique module name.
This sys.path + from flops import ... pattern is unsafe when multiple recipe test files run in the same pytest process: the first imported flops module stays cached in sys.modules, so later test suites can end up testing the wrong recipe implementation. While the flops.py files are currently identical (copied from llama3_native_te), this pattern creates a fragile dependency on that equivalence. Please fix this in the source test template and re-sync the copies.
🔧 One safe pattern
-import sys
+import importlib.util
from pathlib import Path
import pytest
-# Add parent directory so we can import flops
-sys.path.insert(0, str(Path(__file__).parent.parent))
-
-from flops import (
- MFUTracker,
- ModelFLOPsConfig,
- compute_flops_analytical,
- compute_flops_hyena,
- compute_flops_simplified,
- estimate_cp_comm_bytes,
- from_hf_config,
-)
+FLOPS_PATH = Path(__file__).parent.parent / "flops.py"
+SPEC = importlib.util.spec_from_file_location(f"{__name__}_flops", FLOPS_PATH)
+assert SPEC is not None and SPEC.loader is not None
+flops = importlib.util.module_from_spec(SPEC)
+SPEC.loader.exec_module(flops)
+
+MFUTracker = flops.MFUTracker
+ModelFLOPsConfig = flops.ModelFLOPsConfig
+compute_flops_analytical = flops.compute_flops_analytical
+compute_flops_hyena = flops.compute_flops_hyena
+compute_flops_simplified = flops.compute_flops_simplified
+estimate_cp_comm_bytes = flops.estimate_cp_comm_bytes
+from_hf_config = flops.from_hf_config🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/codonfm_native_te/tests/test_flops.py` around lines
24 - 41, The tests unsafely import the local flops.py via sys.path and a plain
"from flops import ...", causing module-cache collisions across pytest sessions;
modify the test to load the local flops.py under a unique module name (e.g.,
using importlib.util.spec_from_file_location or
importlib.machinery.SourceFileLoader with a name derived from the test file path
or parent directory) and then import the symbols MFUTracker, ModelFLOPsConfig,
compute_flops_analytical, compute_flops_hyena, compute_flops_simplified,
estimate_cp_comm_bytes, and from_hf_config from that module, ensuring you do not
rely on sys.path insertion or the global "flops" entry in sys.modules so each
recipe test loads its own isolated module.
| if mfu_tracker is not None: | ||
| step_time = time.perf_counter() - step_start_time | ||
| mfu_info = mfu_tracker.compute_mfu(step_time) | ||
| if dist_config.is_main_process(): | ||
| logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"]) | ||
| step_start_time = time.perf_counter() |
There was a problem hiding this comment.
Include step_time in the MFU log line for complete per-step telemetry.
You already compute step_time on Line 223, but the emitted message omits it.
Proposed fix
if mfu_tracker is not None:
step_time = time.perf_counter() - step_start_time
mfu_info = mfu_tracker.compute_mfu(step_time)
if dist_config.is_main_process():
- logger.info("MFU: %.1f%% (%.2f TFLOPS/GPU)", mfu_info["mfu"], mfu_info["tflops_per_gpu"])
+ logger.info(
+ "Step %d MFU: %.1f%% (%.2f TFLOPS/GPU, step_time=%.3fs)",
+ step,
+ mfu_info["mfu"],
+ mfu_info["tflops_per_gpu"],
+ step_time,
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/codonfm_native_te/train_fsdp2.py` around lines 222 -
227, The MFU log omits the per-step duration even though step_time is computed;
update the logging call inside the mfu_tracker block (where mfu_info =
mfu_tracker.compute_mfu(step_time) and inside the dist_config.is_main_process()
check) to include step_time in the logger.info message alongside mfu_info["mfu"]
and mfu_info["tflops_per_gpu"] (e.g., add a placeholder for step_time and pass
step_time as an argument) so each MFU line reports MFU, TFLOPS/GPU, and the
per-step time.
| def detect_gpu_peak_tflops(): | ||
| """Auto-detect GPU peak bf16 TFLOPS from device name via substring match.""" | ||
| device_name = torch.cuda.get_device_name(0) | ||
| for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items(): | ||
| if gpu_key.lower() in device_name.lower(): | ||
| return tflops, device_name | ||
| return None, device_name |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "flops.py" | grep -E "(llama|flops)"Repository: NVIDIA/bionemo-framework
Length of output: 285
🏁 Script executed:
cd bionemo-recipes && find . -type f -name "flops.py" 2>/dev/null | head -20Repository: NVIDIA/bionemo-framework
Length of output: 221
🏁 Script executed:
cat -n ./recipes/llama3_native_te/flops.pyRepository: NVIDIA/bionemo-framework
Length of output: 135
🏁 Script executed:
cat -n ./bionemo-recipes/recipes/llama3_native_te/flops.pyRepository: NVIDIA/bionemo-framework
Length of output: 32632
Guard CUDA-name lookup before touching torch.cuda.
The function calls torch.cuda.get_device_name(0) without checking whether CUDA is available, causing it to fail on CPU-only hosts instead of falling back to "unknown". _cli_gpu_info() and MFUTracker(..., peak_tflops=None) will crash in non-CUDA environments. Check torch.cuda.is_available() (per PyTorch documentation) before invoking CUDA device queries.
Minimal fix
def detect_gpu_peak_tflops():
"""Auto-detect GPU peak bf16 TFLOPS from device name via substring match."""
+ if not torch.cuda.is_available():
+ return None, "unknown"
device_name = torch.cuda.get_device_name(0)
for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items():
if gpu_key.lower() in device_name.lower():
return tflops, device_name
return None, device_name📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def detect_gpu_peak_tflops(): | |
| """Auto-detect GPU peak bf16 TFLOPS from device name via substring match.""" | |
| device_name = torch.cuda.get_device_name(0) | |
| for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items(): | |
| if gpu_key.lower() in device_name.lower(): | |
| return tflops, device_name | |
| return None, device_name | |
| def detect_gpu_peak_tflops(): | |
| """Auto-detect GPU peak bf16 TFLOPS from device name via substring match.""" | |
| if not torch.cuda.is_available(): | |
| return None, "unknown" | |
| device_name = torch.cuda.get_device_name(0) | |
| for gpu_key, tflops in GPU_PEAK_TFLOPS_BF16.items(): | |
| if gpu_key.lower() in device_name.lower(): | |
| return tflops, device_name | |
| return None, device_name |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 69 - 75,
detect_gpu_peak_tflops currently calls torch.cuda.get_device_name(0) unguarded
and will crash on CPU-only hosts; update detect_gpu_peak_tflops to first check
torch.cuda.is_available() and only call torch.cuda.get_device_name(0) when CUDA
is available, otherwise return (None, "unknown") so callers like _cli_gpu_info
and MFUTracker(..., peak_tflops=None) safely receive a fallback; keep the
GPU_PEAK_TFLOPS_BF16 lookup logic intact when a device name is present.
| def from_hf_config(config_dict, **overrides): | ||
| """Create ModelFLOPsConfig from an HF-compatible config dict. | ||
|
|
||
| Auto-detects architecture: | ||
| - GQA vs MHA: from num_key_value_heads (absent = MHA) | ||
| - Gated MLP (3 proj) vs standard FFN (2 proj): from model_type | ||
| - LM head: from vocab_size > 0 | ||
|
|
||
| Args: | ||
| config_dict: Dict with standard HF config keys (hidden_size, num_hidden_layers, etc.). | ||
| Works with config.json dicts, config.to_dict(), or MODEL_PRESETS dicts. | ||
| **overrides: Explicit overrides for any field (e.g., num_mlp_projections=3). | ||
| """ | ||
| h = config_dict["hidden_size"] | ||
| n_heads = config_dict["num_attention_heads"] | ||
| n_kv = config_dict.get("num_key_value_heads", n_heads) | ||
| vocab = config_dict.get("vocab_size", 0) | ||
| model_type = config_dict.get("model_type", "") | ||
|
|
||
| # Detect gated MLP (3 projections) vs standard FFN (2 projections). | ||
| # Llama/Mistral/Qwen use SwiGLU (gate + up + down = 3 projections). | ||
| # ESM2/CodonFM/Geneformer/BERT use standard FFN (up + down = 2 projections). | ||
| num_mlp_proj = 3 if model_type in GATED_MLP_MODEL_TYPES else 2 | ||
|
|
||
| kwargs = { | ||
| "hidden_size": h, | ||
| "num_hidden_layers": config_dict["num_hidden_layers"], | ||
| "num_attention_heads": n_heads, | ||
| "num_kv_heads": n_kv, | ||
| "head_dim": h // n_heads, | ||
| "intermediate_size": config_dict["intermediate_size"], | ||
| "num_mlp_projections": num_mlp_proj, | ||
| "vocab_size": vocab, | ||
| "has_lm_head": vocab > 0, | ||
| } | ||
| kwargs.update(overrides) | ||
| return ModelFLOPsConfig(**kwargs) |
There was a problem hiding this comment.
Recompute derived fields after applying overrides.
kwargs.update(overrides) can leave head_dim and has_lm_head inconsistent with the overridden values. For example, vocab_size=0 still keeps the precomputed has_lm_head unless the caller knows to override both, and overriding num_attention_heads leaves the old head_dim.
♻️ Suggested fix
def from_hf_config(config_dict, **overrides):
@@
- kwargs = {
- "hidden_size": h,
- "num_hidden_layers": config_dict["num_hidden_layers"],
- "num_attention_heads": n_heads,
- "num_kv_heads": n_kv,
- "head_dim": h // n_heads,
- "intermediate_size": config_dict["intermediate_size"],
- "num_mlp_projections": num_mlp_proj,
- "vocab_size": vocab,
- "has_lm_head": vocab > 0,
- }
- kwargs.update(overrides)
- return ModelFLOPsConfig(**kwargs)
+ kwargs = {
+ "hidden_size": h,
+ "num_hidden_layers": config_dict["num_hidden_layers"],
+ "num_attention_heads": n_heads,
+ "num_kv_heads": n_kv,
+ "intermediate_size": config_dict["intermediate_size"],
+ "num_mlp_projections": num_mlp_proj,
+ "vocab_size": vocab,
+ }
+ kwargs.update(overrides)
+ kwargs.setdefault("head_dim", kwargs["hidden_size"] // kwargs["num_attention_heads"])
+ kwargs.setdefault("has_lm_head", kwargs["vocab_size"] > 0)
+ return ModelFLOPsConfig(**kwargs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 104 - 140,
from_hf_config currently computes derived fields (head_dim, has_lm_head) before
kwargs.update(overrides), which allows overrides like num_attention_heads or
vocab_size to leave derived values stale; update the function so overrides are
applied before computing derived fields (or apply kwargs.update(overrides) and
then recompute head_dim = hidden_size // num_attention_heads and has_lm_head =
vocab_size > 0 afterwards), ensuring derived values (head_dim, has_lm_head and
any other fields dependent on overrides) reflect the final values passed to
ModelFLOPsConfig; keep references to from_hf_config, kwargs.update(overrides),
head_dim, has_lm_head, num_attention_heads, vocab_size and ModelFLOPsConfig to
locate the changes.
| def measure_bus_bandwidth(device, world_size, num_iters=20, num_elements=10_000_000): | ||
| """Measure unidirectional P2P bandwidth via send/recv (matches CP ring pattern).""" | ||
| if world_size <= 1: | ||
| return 0.0 | ||
|
|
||
| rank = dist.get_rank() | ||
| tensor = torch.randn(num_elements, device=device, dtype=torch.bfloat16) | ||
| peer = 1 - rank | ||
|
|
||
| for _ in range(5): | ||
| if rank == 0: | ||
| dist.send(tensor, dst=peer) | ||
| dist.recv(tensor, src=peer) | ||
| else: | ||
| dist.recv(tensor, src=peer) | ||
| dist.send(tensor, dst=peer) | ||
| torch.cuda.synchronize() | ||
|
|
||
| dist.barrier() | ||
| torch.cuda.synchronize() | ||
| start = time.perf_counter() | ||
| for _ in range(num_iters): | ||
| if rank == 0: | ||
| dist.send(tensor, dst=peer) | ||
| else: | ||
| dist.recv(tensor, src=peer) | ||
| torch.cuda.synchronize() | ||
| elapsed = time.perf_counter() - start | ||
|
|
||
| data_bytes = tensor.nelement() * tensor.element_size() | ||
| return num_iters * data_bytes / elapsed / 1e9 |
There was a problem hiding this comment.
Reject world sizes other than 2 in the bandwidth helper.
peer = 1 - rank only works for ranks 0 and 1. If this gets launched with more than two processes, ranks >= 2 compute invalid peers and can hang the send/recv loop. Please fail fast here instead of assuming the caller followed the docstring.
🔒 Suggested guard
def measure_bus_bandwidth(device, world_size, num_iters=20, num_elements=10_000_000):
"""Measure unidirectional P2P bandwidth via send/recv (matches CP ring pattern)."""
if world_size <= 1:
return 0.0
+ if world_size != 2:
+ raise ValueError("measure_bus_bandwidth currently supports exactly 2 ranks")
rank = dist.get_rank()
tensor = torch.randn(num_elements, device=device, dtype=torch.bfloat16)
- peer = 1 - rank
+ peer = 1 if rank == 0 else 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 453 - 483, In
measure_bus_bandwidth, reject world sizes != 2 instead of assuming peers by
updating the initial guard (world_size) to raise an error when world_size is not
exactly 2; ensure you check world_size at the top of the function and raise a
ValueError (or similar) with a clear message referencing the peer calculation
using peer = 1 - rank so callers know only 2-process runs are supported; keep
the rest of the logic unchanged once the guard ensures rank is 0 or 1.
| def load_model_config(config_path): | ||
| """Load model config dict from a local path or HuggingFace model ID. | ||
|
|
||
| Supports: | ||
| - Local directory: ./model_configs/lingua-1B (reads config.json inside) | ||
| - Local file: ./model_configs/lingua-1B/config.json | ||
| - HF model ID: nvidia/esm2_t36_3B_UR50D (fetches from HuggingFace Hub) | ||
| """ | ||
| import json | ||
| from pathlib import Path | ||
|
|
||
| path = Path(config_path) | ||
| if path.is_dir(): | ||
| path = path / "config.json" | ||
| if path.exists(): | ||
| return json.loads(path.read_text()) | ||
|
|
||
| # Fall back to HuggingFace Hub | ||
| from transformers import AutoConfig | ||
|
|
||
| hf_config = AutoConfig.from_pretrained(config_path, trust_remote_code=True) | ||
| return hf_config.to_dict() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and read the file
find . -type f -name "flops.py" | head -20Repository: NVIDIA/bionemo-framework
Length of output: 285
🏁 Script executed:
# Also, let's search for the specific function
rg "def load_model_config" -A 25Repository: NVIDIA/bionemo-framework
Length of output: 8820
🏁 Script executed:
# Check if there are any calls to load_model_config in the codebase
rg "load_model_config\(" --type pyRepository: NVIDIA/bionemo-framework
Length of output: 1391
🏁 Script executed:
# Verify the actual line numbers in the target file
sed -n '486,507p' bionemo-recipes/recipes/llama3_native_te/flops.py | cat -nRepository: NVIDIA/bionemo-framework
Length of output: 947
🏁 Script executed:
# Check that the docstring and function signature comply with line length (119 chars)
head -n 510 bionemo-recipes/recipes/llama3_native_te/flops.py | tail -n 25 | cat -nRepository: NVIDIA/bionemo-framework
Length of output: 1050
Remove hardcoded trust_remote_code=True and add explicit opt-in parameters.
AutoConfig.from_pretrained(config_path, trust_remote_code=True) unconditionally enables execution of arbitrary code from the Hugging Face Hub. Per Hugging Face documentation, trust_remote_code defaults to False and should only be enabled as an explicit opt-in for trusted sources. Add keyword-only parameters to make this configurable and allow revision pinning:
Suggested fix
-def load_model_config(config_path):
+def load_model_config(config_path, *, trust_remote_code=False, revision=None):
"""Load model config dict from a local path or HuggingFace model ID.
Supports:
- Local directory: ./model_configs/lingua-1B (reads config.json inside)
- Local file: ./model_configs/lingua-1B/config.json
- HF model ID: nvidia/esm2_t36_3B_UR50D (fetches from HuggingFace Hub)
"""
@@ -494,5 +494,7 @@ def load_model_config(config_path):
from transformers import AutoConfig
- hf_config = AutoConfig.from_pretrained(config_path, trust_remote_code=True)
+ hf_config = AutoConfig.from_pretrained(
+ config_path,
+ trust_remote_code=trust_remote_code,
+ revision=revision,
+ )
return hf_config.to_dict()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 486 - 507,
The load_model_config function currently hardcodes trust_remote_code=True when
calling AutoConfig.from_pretrained; change it to accept explicit keyword-only
parameters (e.g., *, trust_remote_code: bool = False, revision: Optional[str] =
None) so callers must opt-in to remote code execution and can pin a revision,
then pass those parameters into AutoConfig.from_pretrained(config_path,
trust_remote_code=trust_remote_code, revision=revision) and update any call
sites accordingly.
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("command") | ||
| parser.add_argument("--config-path", default="./model_configs/lingua-1B") | ||
| parser.add_argument("--batch-size", type=int, default=1) | ||
| parser.add_argument("--seq-len", type=int, default=4096) | ||
| parser.add_argument("--formula", default="analytical", choices=["analytical", "simplified"]) | ||
| parser.add_argument("--num-gpus", type=int, default=1, help="Number of GPUs for per-GPU FLOPs and comm estimates") | ||
| parser.add_argument("--cp-size", type=int, default=1, help="Context parallelism size for comm overhead estimate") | ||
| parser.add_argument("--p2p-bw", type=float, default=6.0, help="P2P bandwidth in GB/s for comm time estimate") | ||
| args = parser.parse_args() | ||
|
|
||
| cfg_dict = load_model_config(args.config_path) | ||
| config = from_hf_config(cfg_dict) | ||
| b, s = args.batch_size, args.seq_len | ||
|
|
||
| print( | ||
| f"Config: H={config.hidden_size}, L={config.num_hidden_layers}," | ||
| f" n_heads={config.num_attention_heads}, n_kv={config.num_kv_heads}," | ||
| f" I={config.intermediate_size}, V={config.vocab_size}" | ||
| ) | ||
| print( | ||
| f"MLP: {config.num_mlp_projections} projections" | ||
| f" ({'SwiGLU/GLU' if config.num_mlp_projections == 3 else 'standard FFN'})" | ||
| ) | ||
| print(f"Batch: B={b}, S={s}, GPUs={args.num_gpus}, CP={args.cp_size}") | ||
| print() | ||
|
|
||
| simplified = compute_flops_simplified(b, s, config.hidden_size, config.num_hidden_layers, config.vocab_size) | ||
| analytical, breakdown, lm_head = compute_flops_analytical(config, b, s) | ||
|
|
||
| print(f"{'Method':<24} {'FLOPs/step':>14} {'Per-GPU':>14} {'Exact':>30}") | ||
| print("-" * 86) | ||
| for name, flops in [("Simplified (README)", simplified), ("Analytical", analytical)]: | ||
| per_gpu = flops // max(args.num_gpus, 1) | ||
| print(f"{name:<24} {format_flops(flops):>14} {format_flops(per_gpu):>14} {format_flops_exact(flops):>30}") | ||
|
|
||
| if simplified != analytical: | ||
| diff = analytical - simplified | ||
| print(f"\nDifference: {format_flops_exact(diff)} ({diff / simplified * 100:+.2f}%)") | ||
| else: | ||
| print("\nFormulas agree exactly for this config.") |
There was a problem hiding this comment.
--formula is currently a no-op.
The parser accepts --formula, but this code always computes and prints both simplified and analytical FLOPs and never branches on args.formula. Either honor the flag or remove it, otherwise the CLI advertises behavior it does not implement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/flops.py` around lines 607 - 647,
The CLI flag args.formula is unused: either honor it or remove it; to fix,
branch on args.formula (values "simplified" or "analytical") before calling
compute_flops_simplified and compute_flops_analytical so you only compute and
print the requested method. Specifically, use args.formula to decide whether to
call
compute_flops_simplified(b,s,config.hidden_size,config.num_hidden_layers,config.vocab_size)
or compute_flops_analytical(config,b,s) (and only compute the other if you need
a diff), adjust the printing loop to only print the selected result (and only
compute/print the Difference when both are intentionally computed), or
alternatively remove the --formula argument from the argparse setup if you want
both methods always shown.
| mfu_tracker = MFUTracker( | ||
| config=from_hf_config(config.to_dict()), | ||
| batch_size=args.dataset.micro_batch_size, | ||
| seq_len=args.dataset.max_seq_length, | ||
| num_gpus=dist_config.world_size, | ||
| parallelism={"dp": dist_config.world_size}, | ||
| ) |
There was a problem hiding this comment.
DDP MFU is underestimated under gradient accumulation.
Line 206 measures full-step duration, but Line 153 configures FLOPs for a single micro-batch.
This biases MFU low by grad_acc_steps.
Suggested fix
mfu_tracker = MFUTracker(
config=from_hf_config(config.to_dict()),
- batch_size=args.dataset.micro_batch_size,
+ batch_size=args.dataset.micro_batch_size * args.grad_acc_steps,
seq_len=args.dataset.max_seq_length,
num_gpus=dist_config.world_size,
parallelism={"dp": dist_config.world_size},
)Also applies to: 205-210
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/train_ddp.py` around lines 151 -
157, The MFUTracker is being configured with per-micro-batch FLOPs which
underestimates utilization when gradient accumulation is used; update the
MFUTracker initialization (the MFUTracker(...) call) to account for gradient
accumulation by multiplying the effective batch_size (or otherwise scaling
FLOPs) by the gradient accumulation factor (use args.gradient_accumulation_steps
or args.grad_acc_steps as present in the training args) so the tracker uses
full-step work rather than a single micro-batch, and apply the same correction
to the related MFU/config calculations around this block.
| mfu_tracker = MFUTracker( | ||
| config=from_hf_config(config.to_dict()), | ||
| batch_size=args.dataset.micro_batch_size, | ||
| seq_len=args.dataset.max_seq_length, | ||
| num_gpus=dist_config.world_size, | ||
| parallelism={"dp": dist_config.world_size // args.cp_size, "cp": args.cp_size}, | ||
| ) |
There was a problem hiding this comment.
MFU math misses gradient accumulation scaling.
Line 244 times a full optimizer step, but Line 189 sets FLOPs batch size to one micro-batch.
With grad_acc_steps > 1, MFU is underestimated.
Suggested fix
mfu_tracker = MFUTracker(
config=from_hf_config(config.to_dict()),
- batch_size=args.dataset.micro_batch_size,
+ batch_size=args.dataset.micro_batch_size * args.grad_acc_steps,
seq_len=args.dataset.max_seq_length,
num_gpus=dist_config.world_size,
parallelism={"dp": dist_config.world_size // args.cp_size, "cp": args.cp_size},
)Also applies to: 243-248
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2_cp.py` around lines 187
- 193, The MFUTracker is using args.dataset.micro_batch_size as the FLOPs batch
size which ignores gradient accumulation and underestimates MFU; update the
mfu_tracker initialization (symbol: MFUTracker / variable: mfu_tracker) to use
the effective batch size = args.dataset.micro_batch_size * args.grad_acc_steps
(or otherwise multiply by grad_acc_steps) when setting the batch_size parameter,
and apply the same change wherever MFUTracker is constructed (also the other
block around the second occurrence).
| mfu_tracker = MFUTracker( | ||
| config=from_hf_config(config.to_dict()), | ||
| batch_size=args.dataset.micro_batch_size, | ||
| seq_len=args.dataset.max_seq_length, | ||
| num_gpus=dist_config.world_size, | ||
| parallelism={"dp": dist_config.world_size}, |
There was a problem hiding this comment.
MFU is underreported when grad_acc_steps > 1.
Line 219 computes wall time for a full optimizer step, but Lines 167-170 initialize FLOPs with a single micro-batch.
That underestimates FLOPs/MFU by a factor of grad_acc_steps.
Suggested fix
mfu_tracker = MFUTracker(
config=from_hf_config(config.to_dict()),
- batch_size=args.dataset.micro_batch_size,
+ batch_size=args.dataset.micro_batch_size * args.grad_acc_steps,
seq_len=args.dataset.max_seq_length,
num_gpus=dist_config.world_size,
parallelism={"dp": dist_config.world_size},
)Also applies to: 218-223
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py` around lines 165 -
170, MFU is undercounted because MFUTracker is initialized with a single
micro-batch; multiply the batch work by gradient accumulation steps so FLOPs/
MFU reflect a full optimizer step. Update the MFUTracker initialization
(mfu_tracker / MFUTracker) to use an effective batch size =
args.dataset.micro_batch_size * grad_acc_steps (or pass grad_acc_steps
separately) and also ensure any FLOPs or wall-time computations (the code path
that computes wall time for a full optimizer step) scale FLOPs by grad_acc_steps
so the computed FLOPs and MFU reflect the total work per optimizer step.
| "bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py", | ||
| ], | ||
| # FLOPs / MFU module - synced across recipes | ||
| "bionemo-recipes/recipes/llama3_native_te/flops.py": [ |
There was a problem hiding this comment.
do we have to copy these to each recipe? it's just to calculate flops? maybe we just put it in one and point folks at it.
This should maybe be in the models/ definition though? is this just printing out a total flops for a given model size?
There was a problem hiding this comment.
flops.py has 2 usage paths:
- flops counting for a given model config and comms bandwidth/overhead calculation using CLI
- can be used in the recipes' train scripts with config
log_mfu=trueto log MFU with each step
so it doesn't go purely with the model defs.
trvachov
left a comment
There was a problem hiding this comment.
Across all the added code, I think you shouldn't use batch_size in isolation as a variable name, or at least if you do make it explicit that in that function it doesn't matter whether it's micro or global batch. Otherwise, use micro_batch_size or global_batch_size, because it gets very confusing if the distributed flops math is right without this.
The comm overhead calculations here have a number of issues including assumptions about model precision and also about FSDP2 communication patterns. Adding comm overhead calculations is a reasonable idea but I think it should be a separate PR. Each PR should be one-idea -> one PR.
There is a flops.py CLI, but I ran torchrun --nproc_per_node=2 flops.py bandwidth as written in the docstring it and it just hangs. This was on 8xA100 80GB PCIe, pytorch:26.02-py3 container. Does this work for you?
Regarding overall code structure --> indeed there is a lot of duplicated code, but that is also to some degree the philosophy of recipes. @pstjohn Can you help with some guidance here? Should flops.py really be in perf_logger.py to reduce duplicate files? Also I think the FLOPS metrics should likely be recorded in WANDB just like all of the other metrics we have.
| else: | ||
| raise ValueError(f"Unknown formula: {formula!r}. Use 'analytical', 'simplified', or 'hyena'.") | ||
|
|
||
| self.per_gpu_flops = self.total_flops // max(num_gpus, 1) |
There was a problem hiding this comment.
Why divide by num_gpus? You mentioned you tested this with CP, num_gpus>2 and got reasonable numbers?
| 16*H^2 for 2 MLP projections with I=4H. | ||
| """ | ||
| b, s, h = batch_size, seq_len, hidden_size | ||
| return (24 * b * s * h * h + 4 * b * s * s * h) * (3 * num_layers) + (6 * b * s * h * vocab_size) |
There was a problem hiding this comment.
Does this work with gradient accumulation?
| if mfu_tracker is not None and perf_logger.last_step_time is not None: | ||
| mfu_info = mfu_tracker.compute_mfu(perf_logger.last_step_time) | ||
| if dist_config.is_main_process(): | ||
| logger.info( |
There was a problem hiding this comment.
Does this take into account there is a logging_frequency variable? It seems like this would spam this every step.
| if dp_size > 1: | ||
| # FSDP reduce-scatter estimate: ~2 * model_params * dtype_bytes * (dp-1)/dp | ||
| model_params = _estimate_model_params(self.config) | ||
| total += 2 * model_params * 2 * (dp_size - 1) // dp_size |
There was a problem hiding this comment.
Ok...a number of concerns with this bit of code:
- The second "2" --> is that supposed to be dtype_bytes = 2? That works for bf16/fp16 but we sometimes have weights in FP32.
- The comment says there is an FSDP2 reduce-scatter comm estimate, but isn't step 1 of FSDP2 an all gather for each shared module? I am just not sure I understand the distributed comm steps "~2 * model_params * dtype_bytes * (dp-1)/dp" is trying to model.
- Is the assumption here that you can sum up all the per-module FSDP2 comms into "model_params"? FSDP2 does per module sharding so I'm trying to understand how you get an estimate the comms without doing a sum over the modules that are being sharded.
There was a problem hiding this comment.
- yes this assumes bf16, fixing that to be generalized
- the comment is wrong, fixing that.
- yes, although it can be a slight overestimation since it assumes every param is sharded.
With module i with p_i parameters, total bytes across all modules for both directions:
sum_over_modules(2 * p_i * dtype_bytes * (dp-1)/dp)
= 2 * dtype_bytes * (dp-1)/dp * sum(p_i)
= 2 * dtype_bytes * (dp-1)/dp * total_params
| Returns: | ||
| Dict with comm_bytes, estimated_comm_time, comm_pct. | ||
| """ | ||
| bw = measured_bw_gbps or 6.0 # Default ~PCIe Gen3 x8 |
There was a problem hiding this comment.
Why PCIe gen3? Isn't that >15 years old?
Also -- have you done the measurement on PCIe Gen3? I ask because, if this 6.0 is a "datasheet number", it might be very hard to measure especially through pytorch because of launch overhead. You might need a very specialized environment to measure max-PCIe speeds.
|
|
||
|
|
||
| def _cli_bandwidth(): | ||
| """Measure P2P bandwidth. Launch with: torchrun --nproc_per_node=2 flops.py bandwidth.""" |
There was a problem hiding this comment.
Does this CLI command run for you? It just hangs indefinitely for me...
There was a problem hiding this comment.
yes it runs in around 3 seconds for me. are you having the same issue on another machine?
There was a problem hiding this comment.
Yep, infinite hang. Log:
root@2eafaa0982e9:/workspace/bionemo-framework/bionemo-recipes/recipes/llama3_native_te# torchrun --nproc_per_node=2 flops.py bandwidth
W0415 13:48:45.772000 309 torch/distributed/run.py:851]
W0415 13:48:45.772000 309 torch/distributed/run.py:851] *****************************************
W0415 13:48:45.772000 309 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0415 13:48:45.772000 309 torch/distributed/run.py:851] *****************************************
Measuring P2P bandwidth between 2 GPUs...
GPU 0: NVIDIA A100 80GB PCIe
GPU 1: NVIDIA A100 80GB PCIe
[rank0]:[W415 13:48:48.106866174 ProcessGroupNCCL.cpp:4084] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
[rank1]:[W415 13:48:48.112559519 ProcessGroupNCCL.cpp:4084] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
[rank0]:[W415 13:48:48.352708490 ProcessGroupNCCL.cpp:4084] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
[rank1]:[W415 13:48:48.353136693 ProcessGroupNCCL.cpp:4084] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator())
^CW0415 13:57:20.219000 309 torch/distributed/elastic/agent/server/api.py:739] Received 2 death signal, shutting down workers
W0415 13:57:20.220000 309 torch/distributed/elastic/multiprocessing/api.py:1010] Sending process 376 closing signal SIGINT
W0415 13:57:20.221000 309 torch/distributed/elastic/multiprocessing/api.py:1010] Sending process 377 closing signal SIGINT
^Z
[1]+ Stopped torchrun --nproc_per_node=2 flops.py bandwidth
|
A couple general thoughts:
|
Description
Adds an architecture-independent FLOPs counting and MFU tracking module (flops.py) that can be used by any native_te recipe. The module auto-detects model architecture (GQA vs MHA, SwiGLU vs standard FFN) from HuggingFace config dicts and supports analytical, simplified, and Hyena (Evo2) FLOPs formulas.
MFU tracking is integrated into all 11 training scripts across 4 recipes, gated behind a log_mfu: false Hydra flag
Some MFU results: https://linear.app/nvidia/issue/BIO-406/mfu-calculation-experiment#comment-50296297
Usage
During training:
CLI tool:
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request with one of these commands:
See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.
Pre-submit Checklist
Summary by CodeRabbit
New Features
log_mfu=trueparameter.Documentation
flops.pyCLI commands for performance profiling and GPU analysis.Tests