Skip to content

fix(audio): lazy-load audio stages and fix tutorial notebook paths#1835

Open
mohammadaaftabv wants to merge 9 commits intoNVIDIA-NeMo:mainfrom
mohammadaaftabv:audio-tutorial-improvements
Open

fix(audio): lazy-load audio stages and fix tutorial notebook paths#1835
mohammadaaftabv wants to merge 9 commits intoNVIDIA-NeMo:mainfrom
mohammadaaftabv:audio-tutorial-improvements

Conversation

@mohammadaaftabv
Copy link
Copy Markdown
Contributor

@mohammadaaftabv mohammadaaftabv commented Apr 20, 2026

Summary

  • Empty nemo_curator/stages/audio/__init__.py — removes eager imports of all 13 audio stages at package load time. Previously, importing any audio stage transitively pulled in every optional dependency (e.g. onnxruntime for SIGMOSFilterStage), causing ModuleNotFoundError in tutorials that only need a subset. This now follows the same pattern used by the image modality.

  • Fix tutorials/audio/readspeech/pipeline.py — the only internal consumer of the old top-level import. Updated to import AudioDataFilterStage from its specific subpackage (nemo_curator.stages.audio.advanced_pipelines).

  • Fix ALM tutorial notebook — use relative paths (../../../tests/fixtures/...) instead of relying on subprocess/git rev-parse for repo root discovery. Also corrects the API to use ALMManifestReader (composite stage) instead of ALMManifestReaderStage.

  • Fix FLEURS tutorial notebook — use simple relative path (./example_audio/fleurs) for data directory.

  • Standardize RayClient usage across all ALM and FLEURS tutorial scripts (main.py, pipeline.py, run.py) and notebooks (alm_tutorial.ipynb, fleurs_tutorial.ipynb). RayClient manages the Ray cluster lifecycle (start/stop, port allocation, dashboard) — both XennaExecutor and RayDataExecutor run on top of Ray, so the client is needed regardless of backend. Scripts use try/finally for cleanup; notebooks use separate cells.

  • Backend comparison in notebooks — both notebooks now run the pipeline with both Xenna and Ray Data backends, compare results (entries, windows, WER, audio duration) for correctness, and report wall-clock timing with speedup.

  • Add RayClient to benchmark scriptsalm_pipeline_benchmark.py and audio_fleurs_benchmark.py now start/stop RayClient for standalone invocation. When run via the nightly runner, the runner's existing RayClient is reused (detected via RAY_ADDRESS).

  • Update benchmark docsALM_BENCHMARK.md and AUDIO_PROFILING.md updated with both-backend results. No performance regression from RayClient changes.

  • Update READMEs — ALM and FLEURS READMEs updated to document RayClient usage and show full execution patterns (including RayClient) in composability examples.

Benchmark Results (10,000 entries, repeat-factor=2000)

Metric Xenna Ray Data
Builder windows 362,000 362,000
Filtered windows 50,000 50,000
Filtered duration 6,071,000s 6,071,000s
Execution time 95.52s 29.89s
Throughput (entries/sec) 104.69 334.56

Both backends produce identical results. No performance regression from RayClient changes.

Test Plan

  • pytest tests/stages/audio/alm/ -v — 48 passed
  • ALM main.py — xenna + ray_data, single file + directory input
  • FLEURS pipeline.py — xenna + ray_data
  • FLEURS run.py (Hydra) — xenna + ray_data
  • ALM benchmark script — xenna + ray_data, small + large scale (repeat-factor=2000)
  • FLEURS benchmark script — xenna + ray_data
  • Benchmark numbers verified: identical results across backends, no regression

@mohammadaaftabv mohammadaaftabv requested a review from a team as a code owner April 20, 2026 08:47
@mohammadaaftabv mohammadaaftabv requested review from ayushdg and removed request for a team April 20, 2026 08:47
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 20, 2026

Greptile Summary

This PR fixes eager imports in nemo_curator/stages/audio/__init__.py (preventing ModuleNotFoundError when only a subset of audio stages is needed), corrects tutorial notebook paths and API usage, and standardizes RayClient lifecycle management across all ALM and FLEURS tutorial scripts and benchmark scripts.

  • P1: audio_fleurs_benchmark.py calls ray_client.start() outside the try block — if Ray fails to start, write_benchmark_results is never called and the failure record is silently dropped (same pattern already flagged for alm_pipeline_benchmark.py).
  • P2: utils.py's setup_executor forwards config to all executor types without guarding for Xenna-only parameters, potentially calling RayDataExecutor(config=...) or RayActorPoolExecutor(config=...) with arguments they may not accept.

Confidence Score: 4/5

Safe to merge with the P1 in the fleurs benchmark script addressed; core library change is correct and well-scoped.

One P1 finding in the benchmark script (RayClient.start() outside try) limits the ceiling to 4/5. The core fix to the audio init.py is correct and the tutorial path/API fixes are clean.

benchmarking/scripts/audio_fleurs_benchmark.py (P1: RayClient.start outside try), benchmarking/scripts/utils.py (P2: config forwarded to all executor types)

Important Files Changed

Filename Overview
benchmarking/scripts/audio_fleurs_benchmark.py Adds RayClient lifecycle and --execution-mode flag; RayClient.start() placed outside try block, same as already-flagged alm benchmark issue
benchmarking/scripts/alm_pipeline_benchmark.py Adds RayClient lifecycle and --execution-mode flag; ray_client.start() outside try already flagged in a previous thread
benchmarking/scripts/utils.py setup_executor updated to accept optional config dict; passes config to all executor types without guarding for Xenna-only parameter
nemo_curator/stages/audio/init.py Emptied of all eager imports; fixes transitive optional dependency loading at import time, matching the image modality pattern
tutorials/audio/alm/main.py Adds RayClient lifecycle and execution_mode support; correct try/finally pattern for tutorial context
tutorials/audio/fleurs/pipeline.py Adds RayClient lifecycle; correctly guards config to Xenna only; uses relative path for data directory
tutorials/audio/alm/alm_tutorial.ipynb New notebook; fixes fixture path to use relative ../../../tests/fixtures/... and uses ALMManifestReader correctly
tutorials/audio/alm/README.md Adds RayClient documentation, execution mode guide, composability section; composability code example issues flagged in previous threads

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Benchmark Script Entry] --> B[RayClient instantiated]
    B --> C[ray_client.start called OUTSIDE try]
    C --> D{Start succeeds?}
    D -->|Yes| E[try block runs pipeline]
    D -->|No - exception| F[finally block SKIPPED]
    E --> G[finally: write_benchmark_results + ray_client.stop]
    F --> H[Results lost, cluster not cleaned up]

    style F fill:#ff6b6b,color:#fff
    style H fill:#ff6b6b,color:#fff
Loading

Reviews (8): Last reviewed commit: "feat(audio): add execution mode support,..." | Re-trigger Greptile

@mohammadaaftabv
Copy link
Copy Markdown
Contributor Author

/ok to test dd0986b

Comment on lines +560 to +574
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.audio.alm.alm_manifest_reader import ALMManifestReaderStage
from nemo_curator.stages.audio.alm.alm_data_builder import ALMDataBuilderStage
from nemo_curator.stages.audio.alm.alm_data_overlap import ALMDataOverlapStage

pipeline = Pipeline(
name="custom-alm",
stages=[
ALMManifestReaderStage(manifest_paths=["data.jsonl"]),
ALMDataBuilderStage(target_window_duration=120.0),
ALMDataOverlapStage(overlap_percentage=50),
# Add downstream stages: sharding, feature extraction, etc.
],
)
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Composability example uses wrong class and non-existent parameter

ALMManifestReaderStage is the low-level inner stage that receives a FileGroupTask via its process() method — it has no constructor argument at all. Passing manifest_paths=["data.jsonl"] will cause a TypeError at instantiation, and even if it didn't, the pipeline would stall waiting for a FileGroupTask that never arrives. The user-facing composite stage (corrected in the notebook in this very PR) is ALMManifestReader, which accepts manifest_path (singular, str | list[str]).

Comment thread tutorials/audio/alm/README.md Outdated
Comment on lines +561 to +569
from nemo_curator.stages.audio.alm.alm_manifest_reader import ALMManifestReaderStage
from nemo_curator.stages.audio.alm.alm_data_builder import ALMDataBuilderStage
from nemo_curator.stages.audio.alm.alm_data_overlap import ALMDataOverlapStage

pipeline = Pipeline(
name="custom-alm",
stages=[
ALMManifestReaderStage(manifest_paths=["data.jsonl"]),
ALMDataBuilderStage(target_window_duration=120.0),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Composability example still uses wrong class and parameter

The composability snippet on line 568 instantiates ALMManifestReaderStage(manifest_paths=["data.jsonl"]). ALMManifestReaderStage is the low-level inner stage that expects a FileGroupTask via process() — it accepts no constructor arguments. The notebook was corrected in this PR to use ALMManifestReader(manifest_path=MANIFEST_PATH) (the composite, user-facing stage), but this documentation block was not updated.

Suggested change
from nemo_curator.stages.audio.alm.alm_manifest_reader import ALMManifestReaderStage
from nemo_curator.stages.audio.alm.alm_data_builder import ALMDataBuilderStage
from nemo_curator.stages.audio.alm.alm_data_overlap import ALMDataOverlapStage
pipeline = Pipeline(
name="custom-alm",
stages=[
ALMManifestReaderStage(manifest_paths=["data.jsonl"]),
ALMDataBuilderStage(target_window_duration=120.0),
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.audio.alm.alm_manifest_reader import ALMManifestReader
from nemo_curator.stages.audio.alm.alm_data_builder import ALMDataBuilderStage
from nemo_curator.stages.audio.alm.alm_data_overlap import ALMDataOverlapStage
pipeline = Pipeline(
name="custom-alm",
stages=[
ALMManifestReader(manifest_path="data.jsonl"),
ALMDataBuilderStage(target_window_duration=120.0),
ALMDataOverlapStage(overlap_percentage=50),
# Add downstream stages: sharding, feature extraction, etc.
],
)

Comment thread tutorials/audio/alm/alm_tutorial.ipynb Outdated
}
],
"source": [
"os.environ[\"RAY_MAX_LIMIT_FROM_API_SERVER\"] = \"100000\"\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be getting the error since you are setting RAY_MAX_LIMIT_FROM_API_SERVER after Curator and Ray have been imported. Maybe setting it in the very first cell of the notebook, before any other imports, might help?

I also noticed there isn't a RayClient in this notebook?

…and standardized sections

- Add decision table, data availability, and system deps tables to top-level audio README
- Create shared README_TEMPLATE.md for consistent tutorial documentation
- Standardize fleurs/README.md: add pipeline flow diagram, full output schema,
  composability section, WER threshold justification, and troubleshooting table

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
…hooting

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
- Empty nemo_curator/stages/audio/__init__.py to prevent eager imports
  of optional deps (onnxruntime) at package load time, matching the
  image modality pattern.
- Update readspeech/pipeline.py to import from specific subpackage.
- Fix ALM notebook: use relative paths and correct ALMManifestReader API.
- Fix FLEURS notebook: use relative paths for data directory.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
- FLEURS: use os.path.abspath() for RAW_DATA_DIR so Ray workers resolve
  the same path as the notebook kernel.
- ALM: add RAY_MAX_LIMIT_FROM_API_SERVER env var to avoid Xenna
  monitoring hitting the Ray API server limit.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
…p notebook outputs

- ALM README: use ALMManifestReader (composite stage) instead of
  ALMManifestReaderStage, fix manifest_paths -> manifest_path param
- ALM notebook: move RAY_MAX_LIMIT_FROM_API_SERVER to first cell
  before imports (fixes RayStateApiException)
- Strip notebook outputs to fix secrets-detector CI (base64 images
  flagged as high-entropy strings)

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
@mohammadaaftabv mohammadaaftabv force-pushed the audio-tutorial-improvements branch from 0fecc0c to 050eb31 Compare April 23, 2026 07:43
Add RayClient().start() before XennaExecutor, matching the pattern
from callhome_diar/run.py and single_speaker_filter/run.py.
Addresses Sarah review comment about missing RayClient.

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
…ate benchmarks

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 23, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

Comment on lines +251 to +259
ray_client = RayClient()
ray_client.start()

try:
result_dict.update(run_alm_pipeline_benchmark(**run_args))
success_code = 0 if result_dict["metrics"]["is_success"] else 1
finally:
write_benchmark_results(result_dict, args.benchmark_results_path)
ray_client.stop()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 write_benchmark_results skipped when RayClient.start() fails

ray_client.start() is placed outside the try block, so if it raises a RuntimeError (e.g., Ray can't become responsive), the finally clause never executes and write_benchmark_results is never called. The result_dict already has is_success: False, but that record is lost — exactly the case where CI needs it most.

Move ray_client.start() inside the try block so the finally clause runs regardless:

ray_client = RayClient()
try:
    ray_client.start()
    result_dict.update(run_alm_pipeline_benchmark(**run_args))
    success_code = 0 if result_dict["metrics"]["is_success"] else 1
finally:
    write_benchmark_results(result_dict, args.benchmark_results_path)
    ray_client.stop()

from loguru import logger
from utils import setup_executor, write_benchmark_results

from nemo_curator.core.client import RayClient
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer not to have the benchmarking scripts use RayClient themselves: #1593


This creates a `.venv` with all base, dev, test, and audio dependencies resolved
from the lockfile. If you don't have `uv`, you can fall back to pip:
If you don't have `uv`, fall back to pip:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not encourage falling back to pip. Users should install with uv, or use Docker.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This notebook is rendering invalid for me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, it says invalid notebook.

…benchmarks

Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
Comment on lines +189 to +196
ray_client = RayClient()
ray_client.start()

try:
results.update(run_audio_fleurs_benchmark(**vars(args)))
finally:
write_benchmark_results(results, args.benchmark_results_path)
ray_client.stop()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 write_benchmark_results lost when RayClient.start() fails

ray_client.start() is called outside the try block (same pattern flagged in alm_pipeline_benchmark.py). If it raises — e.g., Ray cannot become responsive — the finally clause is skipped, so write_benchmark_results is never called and the is_success: False record is silently lost.

Suggested change
ray_client = RayClient()
ray_client.start()
try:
results.update(run_audio_fleurs_benchmark(**vars(args)))
finally:
write_benchmark_results(results, args.benchmark_results_path)
ray_client.stop()
ray_client = RayClient()
try:
ray_client.start()
results.update(run_audio_fleurs_benchmark(**vars(args)))
finally:
write_benchmark_results(results, args.benchmark_results_path)
ray_client.stop()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants