fix(audio): lazy-load audio stages and fix tutorial notebook paths by mohammadaaftabv · Pull Request #1835 · NVIDIA-NeMo/Curator

mohammadaaftabv · 2026-04-20T08:47:45Z

Summary

Empty nemo_curator/stages/audio/__init__.py — removes eager imports of all 13 audio stages at package load time. Previously, importing any audio stage transitively pulled in every optional dependency (e.g. onnxruntime for SIGMOSFilterStage), causing ModuleNotFoundError in tutorials that only need a subset. This now follows the same pattern used by the image modality.
Fix tutorials/audio/readspeech/pipeline.py — the only internal consumer of the old top-level import. Updated to import AudioDataFilterStage from its specific subpackage (nemo_curator.stages.audio.advanced_pipelines).
Fix ALM tutorial notebook — use relative paths (../../../tests/fixtures/...) instead of relying on subprocess/git rev-parse for repo root discovery. Also corrects the API to use ALMManifestReader (composite stage) instead of ALMManifestReaderStage.
Fix FLEURS tutorial notebook — use simple relative path (./example_audio/fleurs) for data directory.
Standardize RayClient usage across all ALM and FLEURS tutorial scripts (main.py, pipeline.py, run.py) and notebooks (alm_tutorial.ipynb, fleurs_tutorial.ipynb). RayClient manages the Ray cluster lifecycle (start/stop, port allocation, dashboard) — both XennaExecutor and RayDataExecutor run on top of Ray, so the client is needed regardless of backend. Scripts use try/finally for cleanup; notebooks use separate cells.
Backend comparison in notebooks — both notebooks now run the pipeline with both Xenna and Ray Data backends, compare results (entries, windows, WER, audio duration) for correctness, and report wall-clock timing with speedup.
Add RayClient to benchmark scripts — alm_pipeline_benchmark.py and audio_fleurs_benchmark.py now start/stop RayClient for standalone invocation. When run via the nightly runner, the runner's existing RayClient is reused (detected via RAY_ADDRESS).
Update benchmark docs — ALM_BENCHMARK.md and AUDIO_PROFILING.md updated with both-backend results. No performance regression from RayClient changes.
Update READMEs — ALM and FLEURS READMEs updated to document RayClient usage and show full execution patterns (including RayClient) in composability examples.

Benchmark Results (10,000 entries, repeat-factor=2000)

Metric	Xenna	Ray Data
Builder windows	362,000	362,000
Filtered windows	50,000	50,000
Filtered duration	6,071,000s	6,071,000s
Execution time	95.52s	29.89s
Throughput (entries/sec)	104.69	334.56

Both backends produce identical results. No performance regression from RayClient changes.

Test Plan

pytest tests/stages/audio/alm/ -v — 48 passed
ALM main.py — xenna + ray_data, single file + directory input
FLEURS pipeline.py — xenna + ray_data
FLEURS run.py (Hydra) — xenna + ray_data
ALM benchmark script — xenna + ray_data, small + large scale (repeat-factor=2000)
FLEURS benchmark script — xenna + ray_data
Benchmark numbers verified: identical results across backends, no regression

copy-pr-bot · 2026-04-20T08:47:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-04-20T08:50:05Z

Greptile Summary

This PR fixes eager imports in nemo_curator/stages/audio/__init__.py (preventing ModuleNotFoundError when only a subset of audio stages is needed), corrects tutorial notebook paths and API usage, and standardizes RayClient lifecycle management across all ALM and FLEURS tutorial scripts and benchmark scripts.

P1: audio_fleurs_benchmark.py calls ray_client.start() outside the try block — if Ray fails to start, write_benchmark_results is never called and the failure record is silently dropped (same pattern already flagged for alm_pipeline_benchmark.py).
P2: utils.py's setup_executor forwards config to all executor types without guarding for Xenna-only parameters, potentially calling RayDataExecutor(config=...) or RayActorPoolExecutor(config=...) with arguments they may not accept.

Confidence Score: 4/5

Safe to merge with the P1 in the fleurs benchmark script addressed; core library change is correct and well-scoped.

One P1 finding in the benchmark script (RayClient.start() outside try) limits the ceiling to 4/5. The core fix to the audio init.py is correct and the tutorial path/API fixes are clean.

benchmarking/scripts/audio_fleurs_benchmark.py (P1: RayClient.start outside try), benchmarking/scripts/utils.py (P2: config forwarded to all executor types)

Important Files Changed

Filename	Overview
benchmarking/scripts/audio_fleurs_benchmark.py	Adds RayClient lifecycle and --execution-mode flag; RayClient.start() placed outside try block, same as already-flagged alm benchmark issue
benchmarking/scripts/alm_pipeline_benchmark.py	Adds RayClient lifecycle and --execution-mode flag; ray_client.start() outside try already flagged in a previous thread
benchmarking/scripts/utils.py	setup_executor updated to accept optional config dict; passes config to all executor types without guarding for Xenna-only parameter
nemo_curator/stages/audio/init.py	Emptied of all eager imports; fixes transitive optional dependency loading at import time, matching the image modality pattern
tutorials/audio/alm/main.py	Adds RayClient lifecycle and execution_mode support; correct try/finally pattern for tutorial context
tutorials/audio/fleurs/pipeline.py	Adds RayClient lifecycle; correctly guards config to Xenna only; uses relative path for data directory
tutorials/audio/alm/alm_tutorial.ipynb	New notebook; fixes fixture path to use relative ../../../tests/fixtures/... and uses ALMManifestReader correctly
tutorials/audio/alm/README.md	Adds RayClient documentation, execution mode guide, composability section; composability code example issues flagged in previous threads

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Benchmark Script Entry] --> B[RayClient instantiated]
    B --> C[ray_client.start called OUTSIDE try]
    C --> D{Start succeeds?}
    D -->|Yes| E[try block runs pipeline]
    D -->|No - exception| F[finally block SKIPPED]
    E --> G[finally: write_benchmark_results + ray_client.stop]
    F --> H[Results lost, cluster not cleaned up]

    style F fill:#ff6b6b,color:#fff
    style H fill:#ff6b6b,color:#fff

_{Reviews (8): Last reviewed commit: "feat(audio): add execution mode support,..." | Re-trigger Greptile}

mohammadaaftabv · 2026-04-20T08:51:56Z

/ok to test dd0986b

greptile-apps · 2026-04-20T08:54:09Z

+from nemo_curator.pipeline import Pipeline
+from nemo_curator.stages.audio.alm.alm_manifest_reader import ALMManifestReaderStage
+from nemo_curator.stages.audio.alm.alm_data_builder import ALMDataBuilderStage
+from nemo_curator.stages.audio.alm.alm_data_overlap import ALMDataOverlapStage
+
+pipeline = Pipeline(
+    name="custom-alm",
+    stages=[
+        ALMManifestReaderStage(manifest_paths=["data.jsonl"]),
+        ALMDataBuilderStage(target_window_duration=120.0),
+        ALMDataOverlapStage(overlap_percentage=50),
+        # Add downstream stages: sharding, feature extraction, etc.
+    ],
+)
+```


Composability example uses wrong class and non-existent parameter

ALMManifestReaderStage is the low-level inner stage that receives a FileGroupTask via its process() method — it has no constructor argument at all. Passing manifest_paths=["data.jsonl"] will cause a TypeError at instantiation, and even if it didn't, the pipeline would stall waiting for a FileGroupTask that never arrives. The user-facing composite stage (corrected in the notebook in this very PR) is ALMManifestReader, which accepts manifest_path (singular, str | list[str]).

greptile-apps · 2026-04-20T11:00:32Z

+from nemo_curator.stages.audio.alm.alm_manifest_reader import ALMManifestReaderStage
+from nemo_curator.stages.audio.alm.alm_data_builder import ALMDataBuilderStage
+from nemo_curator.stages.audio.alm.alm_data_overlap import ALMDataOverlapStage
+
+pipeline = Pipeline(
+    name="custom-alm",
+    stages=[
+        ALMManifestReaderStage(manifest_paths=["data.jsonl"]),
+        ALMDataBuilderStage(target_window_duration=120.0),


Composability example still uses wrong class and parameter

The composability snippet on line 568 instantiates ALMManifestReaderStage(manifest_paths=["data.jsonl"]). ALMManifestReaderStage is the low-level inner stage that expects a FileGroupTask via process() — it accepts no constructor arguments. The notebook was corrected in this PR to use ALMManifestReader(manifest_path=MANIFEST_PATH) (the composite, user-facing stage), but this documentation block was not updated.

Suggested change

from nemo_curator.stages.audio.alm.alm_manifest_reader import ALMManifestReaderStage

from nemo_curator.stages.audio.alm.alm_data_builder import ALMDataBuilderStage

from nemo_curator.stages.audio.alm.alm_data_overlap import ALMDataOverlapStage

pipeline = Pipeline(

name="custom-alm",

stages=[

ALMManifestReaderStage(manifest_paths=["data.jsonl"]),

ALMDataBuilderStage(target_window_duration=120.0),

from nemo_curator.pipeline import Pipeline

from nemo_curator.stages.audio.alm.alm_manifest_reader import ALMManifestReader

from nemo_curator.stages.audio.alm.alm_data_builder import ALMDataBuilderStage

from nemo_curator.stages.audio.alm.alm_data_overlap import ALMDataOverlapStage

pipeline = Pipeline(

name="custom-alm",

stages=[

ALMManifestReader(manifest_path="data.jsonl"),

ALMDataBuilderStage(target_window_duration=120.0),

ALMDataOverlapStage(overlap_percentage=50),

# Add downstream stages: sharding, feature extraction, etc.

],

)

sarahyurick · 2026-04-20T17:02:37Z

+        }
+      ],
+      "source": [
+        "os.environ[\"RAY_MAX_LIMIT_FROM_API_SERVER\"] = \"100000\"\n",


You might be getting the error since you are setting RAY_MAX_LIMIT_FROM_API_SERVER after Curator and Ray have been imported. Maybe setting it in the very first cell of the notebook, before any other imports, might help?

I also noticed there isn't a RayClient in this notebook?

…and standardized sections - Add decision table, data availability, and system deps tables to top-level audio README - Create shared README_TEMPLATE.md for consistent tutorial documentation - Standardize fleurs/README.md: add pipeline flow diagram, full output schema, composability section, WER threshold justification, and troubleshooting table Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

…hooting Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

- Empty nemo_curator/stages/audio/__init__.py to prevent eager imports of optional deps (onnxruntime) at package load time, matching the image modality pattern. - Update readspeech/pipeline.py to import from specific subpackage. - Fix ALM notebook: use relative paths and correct ALMManifestReader API. - Fix FLEURS notebook: use relative paths for data directory. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

- FLEURS: use os.path.abspath() for RAW_DATA_DIR so Ray workers resolve the same path as the notebook kernel. - ALM: add RAY_MAX_LIMIT_FROM_API_SERVER env var to avoid Xenna monitoring hitting the Ray API server limit. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

…p notebook outputs - ALM README: use ALMManifestReader (composite stage) instead of ALMManifestReaderStage, fix manifest_paths -> manifest_path param - ALM notebook: move RAY_MAX_LIMIT_FROM_API_SERVER to first cell before imports (fixes RayStateApiException) - Strip notebook outputs to fix secrets-detector CI (base64 images flagged as high-entropy strings) Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

Add RayClient().start() before XennaExecutor, matching the pattern from callhome_diar/run.py and single_speaker_filter/run.py. Addresses Sarah review comment about missing RayClient. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

…ate benchmarks Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

greptile-apps · 2026-04-23T12:54:49Z

Want your agent to iterate on Greptile's feedback? Try greploops.

greptile-apps · 2026-04-23T12:54:53Z

+    ray_client = RayClient()
+    ray_client.start()
+
    try:
        result_dict.update(run_alm_pipeline_benchmark(**run_args))
        success_code = 0 if result_dict["metrics"]["is_success"] else 1
    finally:
        write_benchmark_results(result_dict, args.benchmark_results_path)
+        ray_client.stop()


write_benchmark_results skipped when RayClient.start() fails

ray_client.start() is placed outside the try block, so if it raises a RuntimeError (e.g., Ray can't become responsive), the finally clause never executes and write_benchmark_results is never called. The result_dict already has is_success: False, but that record is lost — exactly the case where CI needs it most.

Move ray_client.start() inside the try block so the finally clause runs regardless:

ray_client = RayClient() try: ray_client.start() result_dict.update(run_alm_pipeline_benchmark(**run_args)) success_code = 0 if result_dict["metrics"]["is_success"] else 1 finally: write_benchmark_results(result_dict, args.benchmark_results_path) ray_client.stop()

sarahyurick · 2026-04-23T16:17:56Z

 from loguru import logger
 from utils import setup_executor, write_benchmark_results

+from nemo_curator.core.client import RayClient


We prefer not to have the benchmarking scripts use RayClient themselves: #1593

sarahyurick · 2026-04-23T16:19:27Z


-This creates a `.venv` with all base, dev, test, and audio dependencies resolved
-from the lockfile. If you don't have `uv`, you can fall back to pip:
+If you don't have `uv`, fall back to pip:


Let's not encourage falling back to pip. Users should install with uv, or use Docker.

sarahyurick · 2026-04-23T16:21:40Z

This notebook is rendering invalid for me.

sarahyurick · 2026-04-23T16:21:59Z

Same here, it says invalid notebook.

…benchmarks Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

greptile-apps · 2026-04-25T07:37:05Z

+    ray_client = RayClient()
+    ray_client.start()
+
    try:
        results.update(run_audio_fleurs_benchmark(**vars(args)))
    finally:
        write_benchmark_results(results, args.benchmark_results_path)
+        ray_client.stop()


write_benchmark_results lost when RayClient.start() fails

ray_client.start() is called outside the try block (same pattern flagged in alm_pipeline_benchmark.py). If it raises — e.g., Ray cannot become responsive — the finally clause is skipped, so write_benchmark_results is never called and the is_success: False record is silently lost.

Suggested change

ray_client = RayClient()

ray_client.start()

try:

results.update(run_audio_fleurs_benchmark(**vars(args)))

finally:

write_benchmark_results(results, args.benchmark_results_path)

ray_client.stop()

ray_client = RayClient()

try:

ray_client.start()

results.update(run_audio_fleurs_benchmark(**vars(args)))

finally:

write_benchmark_results(results, args.benchmark_results_path)

ray_client.stop()

mohammadaaftabv requested a review from a team as a code owner April 20, 2026 08:47

mohammadaaftabv requested review from ayushdg and removed request for a team April 20, 2026 08:47

copy-pr-bot Bot temporarily deployed to test April 20, 2026 08:52 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 20, 2026 08:52 Inactive

greptile-apps Bot reviewed Apr 20, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci April 20, 2026 09:03 Inactive

mohammadaaftabv requested review from karpnv and sarahyurick April 20, 2026 10:32

mohammadaaftabv force-pushed the audio-tutorial-improvements branch from eb870af to 0fecc0c Compare April 20, 2026 10:57

greptile-apps Bot reviewed Apr 20, 2026

View reviewed changes

sarahyurick requested changes Apr 20, 2026

View reviewed changes

mohammadaaftabv added 6 commits April 23, 2026 12:57

docs(audio): add tuning, performance, notebooks, quickstart, troubles…

db95870

…hooting Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

docs(audio): add matplotlib visualizations to FLEURS and ALM notebooks

a9ef43b

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

mohammadaaftabv force-pushed the audio-tutorial-improvements branch from 0fecc0c to 050eb31 Compare April 23, 2026 07:43

mohammadaaftabv added 2 commits April 23, 2026 13:20

feat(audio): standardize RayClient usage, add backend comparison, upd…

75dd851

…ate benchmarks Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

mohammadaaftabv requested review from praateekmahajan and rlratzel as code owners April 23, 2026 12:51

greptile-apps Bot reviewed Apr 23, 2026

View reviewed changes

sarahyurick requested changes Apr 23, 2026

View reviewed changes

feat(audio): add execution mode support, update profiling with fresh …

b868fbc

…benchmarks Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>

mohammadaaftabv requested a review from sarahyurick April 25, 2026 07:35

greptile-apps Bot reviewed Apr 25, 2026

View reviewed changes

Conversation

mohammadaaftabv commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark Results (10,000 entries, repeat-factor=2000)

Test Plan

Uh oh!

copy-pr-bot Bot commented Apr 20, 2026

Uh oh!

greptile-apps Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

mohammadaaftabv commented Apr 20, 2026

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Apr 23, 2026

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mohammadaaftabv commented Apr 20, 2026 •

edited

Loading

greptile-apps Bot commented Apr 20, 2026 •

edited

Loading