Add English NTT-SMOKE benchmark by vmendelev · Pull Request #1472 · NVIDIA-NeMo/Skills

vmendelev · 2026-06-02T11:14:17Z

Summary

Add the English-only NTT-SMOKE (ntt-smoke.en) benchmark for NemotronTranscribe smoke testing.

The benchmark covers clean/noisy ASR, short ASR, real long-form AppTek calls, non-speech hallucination, prompt robustness, Preference-ASR audio-instruction checks, ContextASR context biasing, and a small superficial text control. The default English build is 5,075 rows, with 75 AppTek long-form rows because each long row contains many reference words.

Notes

Long-form data comes from AppTek Call-Center Dialogues, not synthetic stitching.
Preference-ASR is English-only in this PR.
Multilingual NTT-SMOKE data is intentionally excluded for now.
Metrics include WER, macro WER, success rate at 5% row WER, substitutions, insertions, deletions, reference words, and correct words.
The report workflow supports numeric W&B tables plus row-level hypothesis analysis for the brief model-issue conclusion.

Validation

python -m pytest tests/test_ntt_smoke_prepare.py -q
python -m compileall nemo_skills/dataset/ntt-smoke tests/test_ntt_smoke_prepare.py

Baseline Evidence

W&B report from promoted eval skill: https://wandb.ai/nvidia/nemo-skills/runs/9gthwvp0
Active checkpoint, Nemotron Omni BF16, and Qwen ASR 1.7B outputs were scored on the same ntt-smoke.en manifest.

Summary by CodeRabbit

New Features
- Introduced NTT-SMOKE English evaluation benchmark with mixed-manifest support for ASR, context biasing, text-MCQ, and hallucination tasks.
- Added data preparation and evaluation capabilities for the new benchmark.
- Integrated metrics computation with WER thresholds, confidence intervals, and hallucination tracking.
Documentation
- Added comprehensive documentation for NTT-SMOKE dataset setup, configuration, and evaluation.
Tests
- Added test suite for data preparation and evaluation workflows.

vmendelev · 2026-06-02T11:16:00Z

This PR is for those what are not yet using Gym. No need to review, since I will create a copy in Gym.

coderabbitai · 2026-06-02T11:21:47Z

📝 Walkthrough

Walkthrough

This PR introduces the NTT-SMOKE benchmark suite for NemotronTranscribe evaluation. The implementation provides dataset preparation from existing NeMo-Skills datasets, task-specific evaluation routing (ASR, MCQ, context-biasing, hallucination), WER-based metrics with confidence intervals, and comprehensive testing.

Changes

NTT-SMOKE Benchmark Suite

Layer / File(s)	Summary
Benchmark registration and documentation `nemo_skills/dataset/ntt-smoke/MEMO.md`, `README.md`, `__init__.py`, `en/__init__.py`	Documentation defining suite purpose, metrics, and reproducibility steps; package config registers `ntt-smoke.en` benchmark with `SCORE_MODULE`, `EVAL_ARGS`, and `GENERATION_ARGS`.
Task-specific evaluation logic `nemo_skills/dataset/ntt-smoke/ntt_smoke_eval.py`	`NTTSmokeEvaluator` routes evaluation by `task_type` (PreferenceASR, ContextASR, Text-MCQ, Hallucination, or standard audio), applying WER computation via `jiwer`, entity-level metrics, MCQ answer extraction, and success thresholding. Normalizers are loaded per-directory and cached.
Metrics computation and aggregation `nemo_skills/dataset/ntt-smoke/ntt_smoke_metrics.py`	`NTTSmokeMetrics` extends `AudioMetrics` to score missing fields via evaluator routing, threshold WER into correctness, aggregate prompt-group and language-specific WER with CI95 confidence intervals, and compute per-mode aggregates using weighted sums and macro averages.
Dataset preparation pipeline `nemo_skills/dataset/ntt-smoke/prepare.py`	Orchestrates manifest creation: deterministically samples and loads JSONL sources, augments with noisy audio at configured SNRs, constructs AppTek long-form rows, generates context-biasing/Text-MCQ/Preference-ASR examples with prompt variants, hallucination rows, and audio-instruction fallback, then balances and writes output.
Comprehensive test suite `tests/test_ntt_smoke_prepare.py`	Tests manifest generation (subtask fields, long-row handling, preference variants), evaluator functions (entity scoring, WER counting, thresholding), and metrics (pass@1 aggregates, confidence intervals, ASR-only mode) using synthetic fixtures and stubs for external dependencies.

Sequence Diagram: Evaluation and Metrics Flow

sequenceDiagram
  participant Model as Model Output
  participant Evaluator as NTTSmokeEvaluator
  participant Metrics as NTTSmokeMetrics
  participant Scorer as Score Aggregator
  
  Model->>Evaluator: eval_single(data_point)
  activate Evaluator
  alt task_type detected
    Evaluator->>Evaluator: Route by task_type
    Evaluator->>Evaluator: Compute WER/correctness/entities
  end
  Evaluator-->>Metrics: Return metrics dict
  deactivate Evaluator
  
  Metrics->>Metrics: update(predictions)
  Metrics->>Metrics: Score missing fields
  Metrics->>Metrics: Threshold WER to is_correct
  Metrics->>Metrics: Accumulate prompt/language metrics
  
  Metrics->>Metrics: get_metrics()
  Metrics->>Metrics: Compute CI95 for WER/hallucination
  Metrics-->>Scorer: Return aggregated metrics
  
  Scorer->>Scorer: compute_score()
  Scorer->>Scorer: Per-mode weighted aggregation
  Scorer-->>Model: Final benchmark score dict

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add English NTT-SMOKE benchmark' directly and concisely describes the main change—the introduction of the NTT-SMOKE English benchmark with associated evaluation infrastructure.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/ntt-smoke

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/dataset/ntt-smoke/ntt_smoke_eval.py`:
- Around line 264-275: The generic word-boundary fallback pattern "\b([A-J])\b"
in the patterns list causes single-letter words like "I" or "a" to be
mis-parsed; update the patterns array in the MCQ parsing block (the list used by
re.search in the loop) to remove that generic fallback and only include explicit
answer formats and the full-string single-letter pattern (keep patterns such as
r"(?:answer|option|choice)\s*(?:is|:)?\s*([A-J])\b", r"\b([A-J])\)",
r"\b([A-J])\.", and r"^\s*([A-J])\s*$"), ensuring the explicit formats are
checked before the full-string match and continue using re.IGNORECASE.

In `@nemo_skills/dataset/ntt-smoke/prepare.py`:
- Around line 623-632: The MUSAN rows built via the _with_metadata call are
missing task_type, so hallucination scoring is not applied; update the call to
_with_metadata (the block that passes variant,
subtask="hallucination.nonspeech", origin_dataset="musan", etc.) to include
task_type="Hallucination" so generated rows explicitly mark the benchmark as
hallucination tasks and enable the correct evaluator/metrics behavior.
- Around line 431-438: The current logic builds by_mode via _load_source but
then silently ignores empty mode lists when computing common_ids, allowing
partial groups; update the function to explicitly detect missing/empty modes
after creating by_mode (check each mode in the modes list against by_mode and
ensure rows are truthy), and if any mode is missing or empty (e.g.,
any(missing_modes := [m for m in modes if not by_mode.get(m)])), fail closed by
returning [] (or raising) rather than proceeding to set.intersection; reference
the variables/functions modes, by_mode, _load_source, and common_ids to locate
where to add this guard.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fa5a9b1c-f71d-4098-ad80-92e49008cc7f

📥 Commits

Reviewing files that changed from the base of the PR and between 8979a15 and 57dcda6.

📒 Files selected for processing (8)

nemo_skills/dataset/ntt-smoke/MEMO.md
nemo_skills/dataset/ntt-smoke/README.md
nemo_skills/dataset/ntt-smoke/__init__.py
nemo_skills/dataset/ntt-smoke/en/__init__.py
nemo_skills/dataset/ntt-smoke/ntt_smoke_eval.py
nemo_skills/dataset/ntt-smoke/ntt_smoke_metrics.py
nemo_skills/dataset/ntt-smoke/prepare.py
tests/test_ntt_smoke_prepare.py

coderabbitai · 2026-06-02T11:21:50Z

+    patterns = [
+        r"(?:answer|option|choice)\s*(?:is|:)?\s*([A-J])\b",
+        r"\b([A-J])\)",
+        r"\b([A-J])\.",
+        r"^\s*([A-J])\s*$",
+        r"\b([A-J])\b",
+    ]
+    for pattern in patterns:
+        match = re.search(pattern, clean, flags=re.IGNORECASE)
+        if match:
+            return match.group(1).upper()
+    return ""


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Drop the generic single-letter fallback from MCQ parsing.

With re.IGNORECASE, \b([A-J])\b treats ordinary words like I or a as answer choices, so outputs such as I think the answer is B can be mis-scored before the real option is even considered. Restrict matching to explicit answer formats or a full-string single-letter reply.

Proposed fix

patterns = [ - r"(?:answer|option|choice)\s*(?:is|:)?\s*([A-J])\b", - r"\b([A-J])\)", - r"\b([A-J])\.", - r"^\s*([A-J])\s*$", - r"\b([A-J])\b", + r"(?:answer|option|choice|choose|pick|select)\s*(?:is|:)?\s*[$\[]?([A-J])(?:[$\].]|\b)", + r"^\s*[$\[]?([A-J])[$\].]?\s*$", ]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/ntt-smoke/ntt_smoke_eval.py` around lines 264 - 275, The generic word-boundary fallback pattern "\b([A-J])\b" in the patterns list causes single-letter words like "I" or "a" to be mis-parsed; update the patterns array in the MCQ parsing block (the list used by re.search in the loop) to remove that generic fallback and only include explicit answer formats and the full-string single-letter pattern (keep patterns such as r"(?:answer|option|choice)\s*(?:is|:)?\s*([A-J])\b", r"\b([A-J])\)", r"\b([A-J])\.", and r"^\s*([A-J])\s*$"), ensuring the explicit formats are checked before the full-string match and continue using re.IGNORECASE.

coderabbitai · 2026-06-02T11:21:50Z

+    modes = ["contextless", "coarse", "fine"]
+    by_mode = {
+        mode: _load_source(source_root, "contextasr-bench", f"{mode}/test.jsonl")
+        for mode in modes
+    }
+    common_ids = set.intersection(*[set(row.get("uniq_id") for row in rows) for rows in by_mode.values() if rows])
+    if not common_ids:
+        return []


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail closed when any ContextASR mode is missing.

set.intersection(*[... if rows]) ignores empty manifests, so a missing contextless, coarse, or fine split silently produces partial groups instead of matched three-way comparisons. That makes the context-biasing benchmark inconsistent without any obvious failure signal.

Proposed fix

by_mode = { mode: _load_source(source_root, "contextasr-bench", f"{mode}/test.jsonl") for mode in modes } - common_ids = set.intersection(*[set(row.get("uniq_id") for row in rows) for rows in by_mode.values() if rows]) + if any(not rows for rows in by_mode.values()): + return [] + common_ids = set.intersection(*(set(row.get("uniq_id") for row in rows) for rows in by_mode.values())) if not common_ids: return []

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/ntt-smoke/prepare.py` around lines 431 - 438, The current logic builds by_mode via _load_source but then silently ignores empty mode lists when computing common_ids, allowing partial groups; update the function to explicitly detect missing/empty modes after creating by_mode (check each mode in the modes list against by_mode and ensure rows are truthy), and if any mode is missing or empty (e.g., any(missing_modes := [m for m in modes if not by_mode.get(m)])), fail closed by returning [] (or raising) rather than proceeding to set.intersection; reference the variables/functions modes, by_mode, _load_source, and common_ids to locate where to add this guard.

coderabbitai · 2026-06-02T11:21:50Z

+            _with_metadata(
+                row,
+                variant=variant,
+                subtask="hallucination.nonspeech",
+                origin_dataset="musan",
+                origin_split=row.get("category", "test"),
+                origin_manifest="musan/test.jsonl",
+                prompt=prompt,
+                prompt_variant=prompt_variant,
+            )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Set task_type="Hallucination" on generated MUSAN rows.

The evaluator and metrics only attach strict hallucination scoring when task_type == "Hallucination", but this builder currently relies on the upstream MUSAN manifest already carrying that field. If the source rows omit it, these benchmark entries are scored as plain ASR and the hallucination metrics vanish.

Proposed fix

out.append( _with_metadata( row, variant=variant, subtask="hallucination.nonspeech", origin_dataset="musan", origin_split=row.get("category", "test"), origin_manifest="musan/test.jsonl", + task_type="Hallucination", prompt=prompt, prompt_variant=prompt_variant, ) )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_with_metadata(

row,

variant=variant,

subtask="hallucination.nonspeech",

origin_dataset="musan",

origin_split=row.get("category", "test"),

origin_manifest="musan/test.jsonl",

prompt=prompt,

prompt_variant=prompt_variant,

)

_with_metadata(

row,

variant=variant,

subtask="hallucination.nonspeech",

origin_dataset="musan",

origin_split=row.get("category", "test"),

origin_manifest="musan/test.jsonl",

task_type="Hallucination",

prompt=prompt,

prompt_variant=prompt_variant,

)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/ntt-smoke/prepare.py` around lines 623 - 632, The MUSAN rows built via the _with_metadata call are missing task_type, so hallucination scoring is not applied; update the call to _with_metadata (the block that passes variant, subtask="hallucination.nonspeech", origin_dataset="musan", etc.) to include task_type="Hallucination" so generated rows explicitly mark the benchmark as hallucination tasks and enable the correct evaluator/metrics behavior.

Kipok · 2026-06-09T18:05:37Z

@vmendelev could you please fix lints and dco? Also looks like tests fail, do we need to add some packages to reqs?

vmendelev · 2026-06-10T15:24:29Z

Don't pay attention to this PR. We won't commit ti. I will push a separate into Gym directly.

Codex added 11 commits June 1, 2026 19:10

Add NTT-SMOKE speech eval benchmark

f667ae5

Tune NTT-SMOKE long-form defaults

05b4f84

Balance NTT-SMOKE long-form manifest rows

1385a51

Fix NTT-SMOKE multilingual scoring

3a3f7be

Revise NTT-SMOKE preference ASR coverage

7d3331f

Add NTT-SMOKE subset confidence intervals

3289e23

Document NTT-SMOKE baseline provenance

6969da9

Handle empty Preference-ASR normalized references

6c74ee2

Refine NTT-SMOKE English benchmark

9a1ee25

Score generation-only NTT-SMOKE outputs

50397c8

Polish NTT-SMOKE public docs

57dcda6

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add English NTT-SMOKE benchmark#1472

Add English NTT-SMOKE benchmark#1472
vmendelev wants to merge 11 commits into
mainfrom
codex/ntt-smoke

vmendelev commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vmendelev commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Walkthrough

Changes

Sequence Diagram: Evaluation and Metrics Flow

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Uh oh!

coderabbitai Bot Jun 2, 2026

Uh oh!

coderabbitai Bot Jun 2, 2026

Uh oh!

Kipok commented Jun 9, 2026

Uh oh!

vmendelev commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vmendelev commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Validation

Baseline Evidence

Summary by CodeRabbit

Uh oh!

vmendelev commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Walkthrough

Changes

Sequence Diagram: Evaluation and Metrics Flow

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok commented Jun 9, 2026

Uh oh!

vmendelev commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vmendelev commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading