Skip to content

Add synthetic audio sanity check skill#1874

Open
fqian1107 wants to merge 2 commits intomainfrom
audio_curator_skills
Open

Add synthetic audio sanity check skill#1874
fqian1107 wants to merge 2 commits intomainfrom
audio_curator_skills

Conversation

@fqian1107
Copy link
Copy Markdown

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@fqian1107 fqian1107 requested review from a team as code owners April 27, 2026 11:55
@fqian1107 fqian1107 requested review from suiyoubi and removed request for a team April 27, 2026 11:55
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR adds a MagpieTTSStage dataclass that synthesises audio from text using the Magpie TTS model, along with a Claude skill (curator-synth-audio-sanity) that drives an end-to-end pipeline sanity check using synthetic audio. The SKILL.md is well-structured and its class references (ManifestReader, ManifestWriterStage) match the real codebase.

Several P1 defects flagged in prior review threads remain unaddressed in the current code: speaker always resolves to Sofia (ClassVar never updated from the instance field), setup_on_node guard never fires (return value of from_pretrained is discarded so _tts_model stays None), and sf.write will raise FileNotFoundError because generated_audio_dir is never created in setup(). The conflicting onnxruntime / onnxruntime-gpu entries in pyproject.toml are also still present.

Confidence Score: 2/5

Not safe to merge — multiple P1 defects from prior review remain unresolved in the current code.

Four P1 findings flagged in previous review threads are still present in the code as submitted: speaker always hardcoded to Sofia, setup_on_node guard never short-circuits, missing makedirs before sf.write, and conflicting onnxruntime/onnxruntime-gpu in pyproject.toml. Multiple unresolved P1s pull the score below the 4/5 ceiling for P1-only PRs.

nemo_curator/stages/audio/inference/tts/magpie.py requires the most attention; pyproject.toml also needs the onnxruntime conflict resolved.

Important Files Changed

Filename Overview
nemo_curator/stages/audio/inference/tts/magpie.py New MagpieTTS stage; multiple P1 defects flagged in prior review: speaker field silently ignored (ClassVar never updated), setup_on_node guard never fires (return value discarded), generated_audio_dir not created before sf.write, and wrong return-type annotation on generate_audio_file — all still present in current code.
pyproject.toml Adds nemo_toolkit[tts] to audio_common (heavy dep bloat for all audio users), and onnxruntime alongside onnxruntime-gpu (conflicting packages, flagged P1 in prior review); merge conflict markers reported earlier are resolved in current file.
.claude/skills/curator-synth-audio-sanity/SKILL.md New Claude skill defining an 8-step end-to-end audio pipeline sanity-check workflow; class names (ManifestReader, ManifestWriterStage) and constructor args match the actual codebase; backend name "xenna" is valid; no blocking issues.
nemo_curator/stages/audio/inference/tts/init.py Empty init.py to make the tts directory a package; no issues.
uv.lock Auto-generated lock file updated to reflect new nemo-toolkit[tts] transitive dependencies (kornia, janome, pyopenjtalk, pypinyin, seaborn, pynini, etc.); no manual issues.

Sequence Diagram

sequenceDiagram
    participant User
    participant Pipeline
    participant ManifestReader
    participant MagpieTTSStage
    participant DownstreamStages
    participant ManifestWriterStage

    User->>Pipeline: run()
    Pipeline->>ManifestReader: process(manifest.jsonl)
    ManifestReader-->>Pipeline: AudioTask(text, language)
    Pipeline->>MagpieTTSStage: setup_on_node() [downloads model]
    Pipeline->>MagpieTTSStage: setup() [loads model to GPU]
    Pipeline->>MagpieTTSStage: process(AudioTask)
    MagpieTTSStage->>MagpieTTSStage: generate_audio_file(transcript, language)
    Note over MagpieTTSStage: do_tts() with speaker_index=ClassVar[Sofia]<br/>(ignores instance speaker field)
    MagpieTTSStage->>MagpieTTSStage: sf.write(audio_filepath, ...)
    Note over MagpieTTSStage: FileNotFoundError if generated_audio_dir<br/>does not exist
    MagpieTTSStage-->>Pipeline: AudioTask(text, language, audio_filepath)
    Pipeline->>DownstreamStages: process(AudioTask)
    DownstreamStages-->>Pipeline: AudioTask (enriched)
    Pipeline->>ManifestWriterStage: write(result.jsonl)
Loading

Reviews (4): Last reviewed commit: "Rebase main" | Re-trigger Greptile

Comment on lines +61 to +63
speaker_map: ClassVar[dict[str, int]] = {"John": 0, "Sofia": 1, "Aria": 2, "Jason": 3, "Leo": 4}
speaker: str = "Sofia"
speaker_idx: ClassVar[int] = speaker_map["Sofia"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 speaker field is silently ignored — always uses Sofia

speaker_idx is a ClassVar set once at class-definition time to speaker_map["Sofia"] (value 1). Because it is a class variable, reassigning speaker at instantiation (e.g. MagpieTTSStage(..., speaker="John")) never changes speaker_idx. Every call to generate_audio_file therefore passes speaker_index=1 (Sofia) regardless of what the user configured, making the speaker field completely non-functional.

Fix by resolving the index dynamically from the instance field:

    speaker_map: ClassVar[dict[str, int]] = {"John": 0, "Sofia": 1, "Aria": 2, "Jason": 3, "Leo": 4}
    speaker: str = "Sofia"
    # Remove the ClassVar speaker_idx — derive it in generate_audio_file

    def generate_audio_file(self, transcript: str, language: str):
        speaker_index = self.speaker_map[self.speaker]
        audio, audio_len = self._tts_model.do_tts(
            transcript, language=language, apply_TN=False, speaker_index=speaker_index
        )
        return audio, audio_len

Comment on lines +72 to +84
def setup_on_node(
self, _node_info: NodeInfo | None = None, _worker_metadata: WorkerMetadata | None = None
) -> None:
if self._tts_model:
return
try:
kwargs: dict[str, Any] = {"model_name": self.model_name, "return_model_file": True}
if self.cache_dir is not None:
kwargs["cache_dir"] = self.cache_dir
MagpieTTSModel.from_pretrained(**kwargs)
except Exception as e:
msg = f"Failed to download {self.model_name}"
raise RuntimeError(msg) from e
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 setup_on_node guard never short-circuits

The early-return guard if self._tts_model: return checks an instance field that is never assigned inside setup_on_node — the return value of MagpieTTSModel.from_pretrained(...) is discarded. _tts_model therefore remains None after every invocation, so the guard never fires and the method attempts a fresh model-download call on every invocation. Either store the return value or drop the guard.

Suggested change
def setup_on_node(
self, _node_info: NodeInfo | None = None, _worker_metadata: WorkerMetadata | None = None
) -> None:
if self._tts_model:
return
try:
kwargs: dict[str, Any] = {"model_name": self.model_name, "return_model_file": True}
if self.cache_dir is not None:
kwargs["cache_dir"] = self.cache_dir
MagpieTTSModel.from_pretrained(**kwargs)
except Exception as e:
msg = f"Failed to download {self.model_name}"
raise RuntimeError(msg) from e
def setup_on_node(
self, _node_info: NodeInfo | None = None, _worker_metadata: WorkerMetadata | None = None
) -> None:
try:
kwargs: dict[str, Any] = {"model_name": self.model_name, "return_model_file": True}
if self.cache_dir is not None:
kwargs["cache_dir"] = self.cache_dir
MagpieTTSModel.from_pretrained(**kwargs)
except Exception as e:
msg = f"Failed to download {self.model_name}"
raise RuntimeError(msg) from e

Comment on lines +109 to +116
def generate_audio_file(self, transcript: str, language: str) -> str:
"""
Generate an audio file from the transcript.
"""
audio, audio_len = self._tts_model.do_tts(
transcript, language=language, apply_TN=False, speaker_index=self.speaker_idx
)
return audio, audio_len
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Incorrect return-type annotation on generate_audio_file

The annotation says -> str but the function returns a two-element tuple (audio, audio_len). Callers (including process) correctly unpack it as a tuple, so the runtime behaviour is fine, but the annotation is misleading.

Suggested change
def generate_audio_file(self, transcript: str, language: str) -> str:
"""
Generate an audio file from the transcript.
"""
audio, audio_len = self._tts_model.do_tts(
transcript, language=language, apply_TN=False, speaker_index=self.speaker_idx
)
return audio, audio_len
def generate_audio_file(self, transcript: str, language: str) -> tuple:

Comment thread pyproject.toml
Comment on lines +119 to +120
"onnxruntime>=1.20.1,<1.24", # Python bindings (split from GPU package in 1.21+)
"onnxruntime-gpu>=1.20.1,<1.24; platform_machine == 'x86_64'", # CUDA providers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Conflicting onnxruntime / onnxruntime-gpu entries will break installation

Both onnxruntime (CPU-only bindings) and onnxruntime-gpu (CUDA-enabled) install into the same onnxruntime Python namespace. Declaring both in the same extras group will cause pip/uv to flag a conflict or silently let one overwrite the other, potentially stripping CUDA support. The onnxruntime-gpu package already ships the Python bindings; the separate onnxruntime entry is redundant and harmful here.

If the goal is to guarantee the pure-Python onnxruntime module is installed even on non-x86_64 platforms, scope it with the inverse platform marker:

    "onnxruntime>=1.20.1,<1.24; platform_machine != 'x86_64'",          # CPU-only (non-x86)
    "onnxruntime-gpu>=1.20.1,<1.24; platform_machine == 'x86_64'",      # CUDA providers (x86)

Otherwise, drop the plain onnxruntime line entirely.

@fqian1107 fqian1107 force-pushed the audio_curator_skills branch from 052f542 to 17a4a08 Compare April 27, 2026 11:59
Comment on lines +86 to +97
def setup(self, _: WorkerMetadata | None = None) -> None:
if not self._tts_model:
try:
kwargs: dict[str, Any] = {"model_name": self.model_name}
if self.cache_dir is not None:
kwargs["cache_dir"] = self.cache_dir
self._tts_model = MagpieTTSModel.from_pretrained(**kwargs)
except Exception as e:
msg = f"Failed to load {self.model_name}"
raise RuntimeError(msg) from e
self._tts_model = self._tts_model.to(self._device)
logger.info(f"[{self.name}] Initialized MagpieTTS on {self._device}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 generated_audio_dir never guaranteed to exist before sf.write

setup() loads and moves the model but never ensures generated_audio_dir exists. When process() later calls sf.write(audio_filepath, ...) (line 135), soundfile will raise FileNotFoundError if the user-supplied directory path has not been created beforehand. The class API gives no indication that callers must pre-create this directory, so this will silently break direct usage of the stage. Add a makedirs call at the end of setup():

Suggested change
def setup(self, _: WorkerMetadata | None = None) -> None:
if not self._tts_model:
try:
kwargs: dict[str, Any] = {"model_name": self.model_name}
if self.cache_dir is not None:
kwargs["cache_dir"] = self.cache_dir
self._tts_model = MagpieTTSModel.from_pretrained(**kwargs)
except Exception as e:
msg = f"Failed to load {self.model_name}"
raise RuntimeError(msg) from e
self._tts_model = self._tts_model.to(self._device)
logger.info(f"[{self.name}] Initialized MagpieTTS on {self._device}")
def setup(self, _: WorkerMetadata | None = None) -> None:
if not self._tts_model:
try:
kwargs: dict[str, Any] = {"model_name": self.model_name}
if self.cache_dir is not None:
kwargs["cache_dir"] = self.cache_dir
self._tts_model = MagpieTTSModel.from_pretrained(**kwargs)
except Exception as e:
msg = f"Failed to load {self.model_name}"
raise RuntimeError(msg) from e
self._tts_model = self._tts_model.to(self._device)
os.makedirs(self.generated_audio_dir, exist_ok=True)
logger.info(f"[{self.name}] Initialized MagpieTTS on {self._device}")

Signed-off-by: Fan Qian <fqian@nvidia.com>
@fqian1107 fqian1107 force-pushed the audio_curator_skills branch from 17a4a08 to 7a0f7bb Compare April 28, 2026 13:07
Signed-off-by: Fan Qian <fqian@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant