fix(audio): lazy-load audio stages and fix tutorial notebook paths#1835
fix(audio): lazy-load audio stages and fix tutorial notebook paths#1835mohammadaaftabv wants to merge 9 commits intoNVIDIA-NeMo:mainfrom
Conversation
Greptile SummaryThis PR fixes eager imports in
Confidence Score: 4/5Safe to merge with the P1 in the fleurs benchmark script addressed; core library change is correct and well-scoped. One P1 finding in the benchmark script (RayClient.start() outside try) limits the ceiling to 4/5. The core fix to the audio init.py is correct and the tutorial path/API fixes are clean. benchmarking/scripts/audio_fleurs_benchmark.py (P1: RayClient.start outside try), benchmarking/scripts/utils.py (P2: config forwarded to all executor types) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Benchmark Script Entry] --> B[RayClient instantiated]
B --> C[ray_client.start called OUTSIDE try]
C --> D{Start succeeds?}
D -->|Yes| E[try block runs pipeline]
D -->|No - exception| F[finally block SKIPPED]
E --> G[finally: write_benchmark_results + ray_client.stop]
F --> H[Results lost, cluster not cleaned up]
style F fill:#ff6b6b,color:#fff
style H fill:#ff6b6b,color:#fff
Reviews (8): Last reviewed commit: "feat(audio): add execution mode support,..." | Re-trigger Greptile |
|
/ok to test dd0986b |
| from nemo_curator.pipeline import Pipeline | ||
| from nemo_curator.stages.audio.alm.alm_manifest_reader import ALMManifestReaderStage | ||
| from nemo_curator.stages.audio.alm.alm_data_builder import ALMDataBuilderStage | ||
| from nemo_curator.stages.audio.alm.alm_data_overlap import ALMDataOverlapStage | ||
|
|
||
| pipeline = Pipeline( | ||
| name="custom-alm", | ||
| stages=[ | ||
| ALMManifestReaderStage(manifest_paths=["data.jsonl"]), | ||
| ALMDataBuilderStage(target_window_duration=120.0), | ||
| ALMDataOverlapStage(overlap_percentage=50), | ||
| # Add downstream stages: sharding, feature extraction, etc. | ||
| ], | ||
| ) | ||
| ``` |
There was a problem hiding this comment.
Composability example uses wrong class and non-existent parameter
ALMManifestReaderStage is the low-level inner stage that receives a FileGroupTask via its process() method — it has no constructor argument at all. Passing manifest_paths=["data.jsonl"] will cause a TypeError at instantiation, and even if it didn't, the pipeline would stall waiting for a FileGroupTask that never arrives. The user-facing composite stage (corrected in the notebook in this very PR) is ALMManifestReader, which accepts manifest_path (singular, str | list[str]).
eb870af to
0fecc0c
Compare
| from nemo_curator.stages.audio.alm.alm_manifest_reader import ALMManifestReaderStage | ||
| from nemo_curator.stages.audio.alm.alm_data_builder import ALMDataBuilderStage | ||
| from nemo_curator.stages.audio.alm.alm_data_overlap import ALMDataOverlapStage | ||
|
|
||
| pipeline = Pipeline( | ||
| name="custom-alm", | ||
| stages=[ | ||
| ALMManifestReaderStage(manifest_paths=["data.jsonl"]), | ||
| ALMDataBuilderStage(target_window_duration=120.0), |
There was a problem hiding this comment.
Composability example still uses wrong class and parameter
The composability snippet on line 568 instantiates ALMManifestReaderStage(manifest_paths=["data.jsonl"]). ALMManifestReaderStage is the low-level inner stage that expects a FileGroupTask via process() — it accepts no constructor arguments. The notebook was corrected in this PR to use ALMManifestReader(manifest_path=MANIFEST_PATH) (the composite, user-facing stage), but this documentation block was not updated.
| from nemo_curator.stages.audio.alm.alm_manifest_reader import ALMManifestReaderStage | |
| from nemo_curator.stages.audio.alm.alm_data_builder import ALMDataBuilderStage | |
| from nemo_curator.stages.audio.alm.alm_data_overlap import ALMDataOverlapStage | |
| pipeline = Pipeline( | |
| name="custom-alm", | |
| stages=[ | |
| ALMManifestReaderStage(manifest_paths=["data.jsonl"]), | |
| ALMDataBuilderStage(target_window_duration=120.0), | |
| from nemo_curator.pipeline import Pipeline | |
| from nemo_curator.stages.audio.alm.alm_manifest_reader import ALMManifestReader | |
| from nemo_curator.stages.audio.alm.alm_data_builder import ALMDataBuilderStage | |
| from nemo_curator.stages.audio.alm.alm_data_overlap import ALMDataOverlapStage | |
| pipeline = Pipeline( | |
| name="custom-alm", | |
| stages=[ | |
| ALMManifestReader(manifest_path="data.jsonl"), | |
| ALMDataBuilderStage(target_window_duration=120.0), | |
| ALMDataOverlapStage(overlap_percentage=50), | |
| # Add downstream stages: sharding, feature extraction, etc. | |
| ], | |
| ) |
| } | ||
| ], | ||
| "source": [ | ||
| "os.environ[\"RAY_MAX_LIMIT_FROM_API_SERVER\"] = \"100000\"\n", |
There was a problem hiding this comment.
You might be getting the error since you are setting RAY_MAX_LIMIT_FROM_API_SERVER after Curator and Ray have been imported. Maybe setting it in the very first cell of the notebook, before any other imports, might help?
I also noticed there isn't a RayClient in this notebook?
…and standardized sections - Add decision table, data availability, and system deps tables to top-level audio README - Create shared README_TEMPLATE.md for consistent tutorial documentation - Standardize fleurs/README.md: add pipeline flow diagram, full output schema, composability section, WER threshold justification, and troubleshooting table Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
…hooting Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
- Empty nemo_curator/stages/audio/__init__.py to prevent eager imports of optional deps (onnxruntime) at package load time, matching the image modality pattern. - Update readspeech/pipeline.py to import from specific subpackage. - Fix ALM notebook: use relative paths and correct ALMManifestReader API. - Fix FLEURS notebook: use relative paths for data directory. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
- FLEURS: use os.path.abspath() for RAW_DATA_DIR so Ray workers resolve the same path as the notebook kernel. - ALM: add RAY_MAX_LIMIT_FROM_API_SERVER env var to avoid Xenna monitoring hitting the Ray API server limit. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
…p notebook outputs - ALM README: use ALMManifestReader (composite stage) instead of ALMManifestReaderStage, fix manifest_paths -> manifest_path param - ALM notebook: move RAY_MAX_LIMIT_FROM_API_SERVER to first cell before imports (fixes RayStateApiException) - Strip notebook outputs to fix secrets-detector CI (base64 images flagged as high-entropy strings) Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
0fecc0c to
050eb31
Compare
Add RayClient().start() before XennaExecutor, matching the pattern from callhome_diar/run.py and single_speaker_filter/run.py. Addresses Sarah review comment about missing RayClient. Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
…ate benchmarks Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
| ray_client = RayClient() | ||
| ray_client.start() | ||
|
|
||
| try: | ||
| result_dict.update(run_alm_pipeline_benchmark(**run_args)) | ||
| success_code = 0 if result_dict["metrics"]["is_success"] else 1 | ||
| finally: | ||
| write_benchmark_results(result_dict, args.benchmark_results_path) | ||
| ray_client.stop() |
There was a problem hiding this comment.
write_benchmark_results skipped when RayClient.start() fails
ray_client.start() is placed outside the try block, so if it raises a RuntimeError (e.g., Ray can't become responsive), the finally clause never executes and write_benchmark_results is never called. The result_dict already has is_success: False, but that record is lost — exactly the case where CI needs it most.
Move ray_client.start() inside the try block so the finally clause runs regardless:
ray_client = RayClient()
try:
ray_client.start()
result_dict.update(run_alm_pipeline_benchmark(**run_args))
success_code = 0 if result_dict["metrics"]["is_success"] else 1
finally:
write_benchmark_results(result_dict, args.benchmark_results_path)
ray_client.stop()| from loguru import logger | ||
| from utils import setup_executor, write_benchmark_results | ||
|
|
||
| from nemo_curator.core.client import RayClient |
There was a problem hiding this comment.
We prefer not to have the benchmarking scripts use RayClient themselves: #1593
|
|
||
| This creates a `.venv` with all base, dev, test, and audio dependencies resolved | ||
| from the lockfile. If you don't have `uv`, you can fall back to pip: | ||
| If you don't have `uv`, fall back to pip: |
There was a problem hiding this comment.
Let's not encourage falling back to pip. Users should install with uv, or use Docker.
There was a problem hiding this comment.
This notebook is rendering invalid for me.
There was a problem hiding this comment.
Same here, it says invalid notebook.
…benchmarks Signed-off-by: aaftaabv@gmail.com <aaftaabv@gmail.com>
| ray_client = RayClient() | ||
| ray_client.start() | ||
|
|
||
| try: | ||
| results.update(run_audio_fleurs_benchmark(**vars(args))) | ||
| finally: | ||
| write_benchmark_results(results, args.benchmark_results_path) | ||
| ray_client.stop() |
There was a problem hiding this comment.
write_benchmark_results lost when RayClient.start() fails
ray_client.start() is called outside the try block (same pattern flagged in alm_pipeline_benchmark.py). If it raises — e.g., Ray cannot become responsive — the finally clause is skipped, so write_benchmark_results is never called and the is_success: False record is silently lost.
| ray_client = RayClient() | |
| ray_client.start() | |
| try: | |
| results.update(run_audio_fleurs_benchmark(**vars(args))) | |
| finally: | |
| write_benchmark_results(results, args.benchmark_results_path) | |
| ray_client.stop() | |
| ray_client = RayClient() | |
| try: | |
| ray_client.start() | |
| results.update(run_audio_fleurs_benchmark(**vars(args))) | |
| finally: | |
| write_benchmark_results(results, args.benchmark_results_path) | |
| ray_client.stop() |
Summary
Empty
nemo_curator/stages/audio/__init__.py— removes eager imports of all 13 audio stages at package load time. Previously, importing any audio stage transitively pulled in every optional dependency (e.g.onnxruntimeforSIGMOSFilterStage), causingModuleNotFoundErrorin tutorials that only need a subset. This now follows the same pattern used by theimagemodality.Fix
tutorials/audio/readspeech/pipeline.py— the only internal consumer of the old top-level import. Updated to importAudioDataFilterStagefrom its specific subpackage (nemo_curator.stages.audio.advanced_pipelines).Fix ALM tutorial notebook — use relative paths (
../../../tests/fixtures/...) instead of relying onsubprocess/git rev-parsefor repo root discovery. Also corrects the API to useALMManifestReader(composite stage) instead ofALMManifestReaderStage.Fix FLEURS tutorial notebook — use simple relative path (
./example_audio/fleurs) for data directory.Standardize
RayClientusage across all ALM and FLEURS tutorial scripts (main.py,pipeline.py,run.py) and notebooks (alm_tutorial.ipynb,fleurs_tutorial.ipynb).RayClientmanages the Ray cluster lifecycle (start/stop, port allocation, dashboard) — bothXennaExecutorandRayDataExecutorrun on top of Ray, so the client is needed regardless of backend. Scripts usetry/finallyfor cleanup; notebooks use separate cells.Backend comparison in notebooks — both notebooks now run the pipeline with both Xenna and Ray Data backends, compare results (entries, windows, WER, audio duration) for correctness, and report wall-clock timing with speedup.
Add
RayClientto benchmark scripts —alm_pipeline_benchmark.pyandaudio_fleurs_benchmark.pynow start/stopRayClientfor standalone invocation. When run via the nightly runner, the runner's existingRayClientis reused (detected viaRAY_ADDRESS).Update benchmark docs —
ALM_BENCHMARK.mdandAUDIO_PROFILING.mdupdated with both-backend results. No performance regression fromRayClientchanges.Update READMEs — ALM and FLEURS READMEs updated to document
RayClientusage and show full execution patterns (includingRayClient) in composability examples.Benchmark Results (10,000 entries, repeat-factor=2000)
Both backends produce identical results. No performance regression from
RayClientchanges.Test Plan
pytest tests/stages/audio/alm/ -v— 48 passedmain.py— xenna + ray_data, single file + directory inputpipeline.py— xenna + ray_datarun.py(Hydra) — xenna + ray_data