Skip to content

Add English NTT-SMOKE benchmark#1472

Open
vmendelev wants to merge 11 commits into
mainfrom
codex/ntt-smoke
Open

Add English NTT-SMOKE benchmark#1472
vmendelev wants to merge 11 commits into
mainfrom
codex/ntt-smoke

Conversation

@vmendelev

@vmendelev vmendelev commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add the English-only NTT-SMOKE (ntt-smoke.en) benchmark for NemotronTranscribe smoke testing.

The benchmark covers clean/noisy ASR, short ASR, real long-form AppTek calls, non-speech hallucination, prompt robustness, Preference-ASR audio-instruction checks, ContextASR context biasing, and a small superficial text control. The default English build is 5,075 rows, with 75 AppTek long-form rows because each long row contains many reference words.

Notes

  • Long-form data comes from AppTek Call-Center Dialogues, not synthetic stitching.
  • Preference-ASR is English-only in this PR.
  • Multilingual NTT-SMOKE data is intentionally excluded for now.
  • Metrics include WER, macro WER, success rate at 5% row WER, substitutions, insertions, deletions, reference words, and correct words.
  • The report workflow supports numeric W&B tables plus row-level hypothesis analysis for the brief model-issue conclusion.

Validation

  • python -m pytest tests/test_ntt_smoke_prepare.py -q
  • python -m compileall nemo_skills/dataset/ntt-smoke tests/test_ntt_smoke_prepare.py

Baseline Evidence

Summary by CodeRabbit

  • New Features

    • Introduced NTT-SMOKE English evaluation benchmark with mixed-manifest support for ASR, context biasing, text-MCQ, and hallucination tasks.
    • Added data preparation and evaluation capabilities for the new benchmark.
    • Integrated metrics computation with WER thresholds, confidence intervals, and hallucination tracking.
  • Documentation

    • Added comprehensive documentation for NTT-SMOKE dataset setup, configuration, and evaluation.
  • Tests

    • Added test suite for data preparation and evaluation workflows.

@vmendelev

Copy link
Copy Markdown
Collaborator Author

This PR is for those what are not yet using Gym. No need to review, since I will create a copy in Gym.

@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces the NTT-SMOKE benchmark suite for NemotronTranscribe evaluation. The implementation provides dataset preparation from existing NeMo-Skills datasets, task-specific evaluation routing (ASR, MCQ, context-biasing, hallucination), WER-based metrics with confidence intervals, and comprehensive testing.

Changes

NTT-SMOKE Benchmark Suite

Layer / File(s) Summary
Benchmark registration and documentation
nemo_skills/dataset/ntt-smoke/MEMO.md, README.md, __init__.py, en/__init__.py
Documentation defining suite purpose, metrics, and reproducibility steps; package config registers ntt-smoke.en benchmark with SCORE_MODULE, EVAL_ARGS, and GENERATION_ARGS.
Task-specific evaluation logic
nemo_skills/dataset/ntt-smoke/ntt_smoke_eval.py
NTTSmokeEvaluator routes evaluation by task_type (PreferenceASR, ContextASR, Text-MCQ, Hallucination, or standard audio), applying WER computation via jiwer, entity-level metrics, MCQ answer extraction, and success thresholding. Normalizers are loaded per-directory and cached.
Metrics computation and aggregation
nemo_skills/dataset/ntt-smoke/ntt_smoke_metrics.py
NTTSmokeMetrics extends AudioMetrics to score missing fields via evaluator routing, threshold WER into correctness, aggregate prompt-group and language-specific WER with CI95 confidence intervals, and compute per-mode aggregates using weighted sums and macro averages.
Dataset preparation pipeline
nemo_skills/dataset/ntt-smoke/prepare.py
Orchestrates manifest creation: deterministically samples and loads JSONL sources, augments with noisy audio at configured SNRs, constructs AppTek long-form rows, generates context-biasing/Text-MCQ/Preference-ASR examples with prompt variants, hallucination rows, and audio-instruction fallback, then balances and writes output.
Comprehensive test suite
tests/test_ntt_smoke_prepare.py
Tests manifest generation (subtask fields, long-row handling, preference variants), evaluator functions (entity scoring, WER counting, thresholding), and metrics (pass@1 aggregates, confidence intervals, ASR-only mode) using synthetic fixtures and stubs for external dependencies.

Sequence Diagram: Evaluation and Metrics Flow

sequenceDiagram
  participant Model as Model Output
  participant Evaluator as NTTSmokeEvaluator
  participant Metrics as NTTSmokeMetrics
  participant Scorer as Score Aggregator
  
  Model->>Evaluator: eval_single(data_point)
  activate Evaluator
  alt task_type detected
    Evaluator->>Evaluator: Route by task_type
    Evaluator->>Evaluator: Compute WER/correctness/entities
  end
  Evaluator-->>Metrics: Return metrics dict
  deactivate Evaluator
  
  Metrics->>Metrics: update(predictions)
  Metrics->>Metrics: Score missing fields
  Metrics->>Metrics: Threshold WER to is_correct
  Metrics->>Metrics: Accumulate prompt/language metrics
  
  Metrics->>Metrics: get_metrics()
  Metrics->>Metrics: Compute CI95 for WER/hallucination
  Metrics-->>Scorer: Return aggregated metrics
  
  Scorer->>Scorer: compute_score()
  Scorer->>Scorer: Per-mode weighted aggregation
  Scorer-->>Model: Final benchmark score dict
Loading

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add English NTT-SMOKE benchmark' directly and concisely describes the main change—the introduction of the NTT-SMOKE English benchmark with associated evaluation infrastructure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/ntt-smoke

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/dataset/ntt-smoke/ntt_smoke_eval.py`:
- Around line 264-275: The generic word-boundary fallback pattern "\b([A-J])\b"
in the patterns list causes single-letter words like "I" or "a" to be
mis-parsed; update the patterns array in the MCQ parsing block (the list used by
re.search in the loop) to remove that generic fallback and only include explicit
answer formats and the full-string single-letter pattern (keep patterns such as
r"(?:answer|option|choice)\s*(?:is|:)?\s*([A-J])\b", r"\b([A-J])\)",
r"\b([A-J])\.", and r"^\s*([A-J])\s*$"), ensuring the explicit formats are
checked before the full-string match and continue using re.IGNORECASE.

In `@nemo_skills/dataset/ntt-smoke/prepare.py`:
- Around line 623-632: The MUSAN rows built via the _with_metadata call are
missing task_type, so hallucination scoring is not applied; update the call to
_with_metadata (the block that passes variant,
subtask="hallucination.nonspeech", origin_dataset="musan", etc.) to include
task_type="Hallucination" so generated rows explicitly mark the benchmark as
hallucination tasks and enable the correct evaluator/metrics behavior.
- Around line 431-438: The current logic builds by_mode via _load_source but
then silently ignores empty mode lists when computing common_ids, allowing
partial groups; update the function to explicitly detect missing/empty modes
after creating by_mode (check each mode in the modes list against by_mode and
ensure rows are truthy), and if any mode is missing or empty (e.g.,
any(missing_modes := [m for m in modes if not by_mode.get(m)])), fail closed by
returning [] (or raising) rather than proceeding to set.intersection; reference
the variables/functions modes, by_mode, _load_source, and common_ids to locate
where to add this guard.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fa5a9b1c-f71d-4098-ad80-92e49008cc7f

📥 Commits

Reviewing files that changed from the base of the PR and between 8979a15 and 57dcda6.

📒 Files selected for processing (8)
  • nemo_skills/dataset/ntt-smoke/MEMO.md
  • nemo_skills/dataset/ntt-smoke/README.md
  • nemo_skills/dataset/ntt-smoke/__init__.py
  • nemo_skills/dataset/ntt-smoke/en/__init__.py
  • nemo_skills/dataset/ntt-smoke/ntt_smoke_eval.py
  • nemo_skills/dataset/ntt-smoke/ntt_smoke_metrics.py
  • nemo_skills/dataset/ntt-smoke/prepare.py
  • tests/test_ntt_smoke_prepare.py

Comment on lines +264 to +275
patterns = [
r"(?:answer|option|choice)\s*(?:is|:)?\s*([A-J])\b",
r"\b([A-J])\)",
r"\b([A-J])\.",
r"^\s*([A-J])\s*$",
r"\b([A-J])\b",
]
for pattern in patterns:
match = re.search(pattern, clean, flags=re.IGNORECASE)
if match:
return match.group(1).upper()
return ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Drop the generic single-letter fallback from MCQ parsing.

With re.IGNORECASE, \b([A-J])\b treats ordinary words like I or a as answer choices, so outputs such as I think the answer is B can be mis-scored before the real option is even considered. Restrict matching to explicit answer formats or a full-string single-letter reply.

Proposed fix
 patterns = [
-    r"(?:answer|option|choice)\s*(?:is|:)?\s*([A-J])\b",
-    r"\b([A-J])\)",
-    r"\b([A-J])\.",
-    r"^\s*([A-J])\s*$",
-    r"\b([A-J])\b",
+    r"(?:answer|option|choice|choose|pick|select)\s*(?:is|:)?\s*[\(\[]?([A-J])(?:[\)\].]|\b)",
+    r"^\s*[\(\[]?([A-J])[\)\].]?\s*$",
 ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/ntt-smoke/ntt_smoke_eval.py` around lines 264 - 275, The
generic word-boundary fallback pattern "\b([A-J])\b" in the patterns list causes
single-letter words like "I" or "a" to be mis-parsed; update the patterns array
in the MCQ parsing block (the list used by re.search in the loop) to remove that
generic fallback and only include explicit answer formats and the full-string
single-letter pattern (keep patterns such as
r"(?:answer|option|choice)\s*(?:is|:)?\s*([A-J])\b", r"\b([A-J])\)",
r"\b([A-J])\.", and r"^\s*([A-J])\s*$"), ensuring the explicit formats are
checked before the full-string match and continue using re.IGNORECASE.

Comment on lines +431 to +438
modes = ["contextless", "coarse", "fine"]
by_mode = {
mode: _load_source(source_root, "contextasr-bench", f"{mode}/test.jsonl")
for mode in modes
}
common_ids = set.intersection(*[set(row.get("uniq_id") for row in rows) for rows in by_mode.values() if rows])
if not common_ids:
return []

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail closed when any ContextASR mode is missing.

set.intersection(*[... if rows]) ignores empty manifests, so a missing contextless, coarse, or fine split silently produces partial groups instead of matched three-way comparisons. That makes the context-biasing benchmark inconsistent without any obvious failure signal.

Proposed fix
     by_mode = {
         mode: _load_source(source_root, "contextasr-bench", f"{mode}/test.jsonl")
         for mode in modes
     }
-    common_ids = set.intersection(*[set(row.get("uniq_id") for row in rows) for rows in by_mode.values() if rows])
+    if any(not rows for rows in by_mode.values()):
+        return []
+    common_ids = set.intersection(*(set(row.get("uniq_id") for row in rows) for rows in by_mode.values()))
     if not common_ids:
         return []
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/ntt-smoke/prepare.py` around lines 431 - 438, The current
logic builds by_mode via _load_source but then silently ignores empty mode lists
when computing common_ids, allowing partial groups; update the function to
explicitly detect missing/empty modes after creating by_mode (check each mode in
the modes list against by_mode and ensure rows are truthy), and if any mode is
missing or empty (e.g., any(missing_modes := [m for m in modes if not
by_mode.get(m)])), fail closed by returning [] (or raising) rather than
proceeding to set.intersection; reference the variables/functions modes,
by_mode, _load_source, and common_ids to locate where to add this guard.

Comment on lines +623 to +632
_with_metadata(
row,
variant=variant,
subtask="hallucination.nonspeech",
origin_dataset="musan",
origin_split=row.get("category", "test"),
origin_manifest="musan/test.jsonl",
prompt=prompt,
prompt_variant=prompt_variant,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Set task_type="Hallucination" on generated MUSAN rows.

The evaluator and metrics only attach strict hallucination scoring when task_type == "Hallucination", but this builder currently relies on the upstream MUSAN manifest already carrying that field. If the source rows omit it, these benchmark entries are scored as plain ASR and the hallucination metrics vanish.

Proposed fix
         out.append(
             _with_metadata(
                 row,
                 variant=variant,
                 subtask="hallucination.nonspeech",
                 origin_dataset="musan",
                 origin_split=row.get("category", "test"),
                 origin_manifest="musan/test.jsonl",
+                task_type="Hallucination",
                 prompt=prompt,
                 prompt_variant=prompt_variant,
             )
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_with_metadata(
row,
variant=variant,
subtask="hallucination.nonspeech",
origin_dataset="musan",
origin_split=row.get("category", "test"),
origin_manifest="musan/test.jsonl",
prompt=prompt,
prompt_variant=prompt_variant,
)
_with_metadata(
row,
variant=variant,
subtask="hallucination.nonspeech",
origin_dataset="musan",
origin_split=row.get("category", "test"),
origin_manifest="musan/test.jsonl",
task_type="Hallucination",
prompt=prompt,
prompt_variant=prompt_variant,
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/ntt-smoke/prepare.py` around lines 623 - 632, The MUSAN
rows built via the _with_metadata call are missing task_type, so hallucination
scoring is not applied; update the call to _with_metadata (the block that passes
variant, subtask="hallucination.nonspeech", origin_dataset="musan", etc.) to
include task_type="Hallucination" so generated rows explicitly mark the
benchmark as hallucination tasks and enable the correct evaluator/metrics
behavior.

@Kipok

Kipok commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@vmendelev could you please fix lints and dco? Also looks like tests fail, do we need to add some packages to reqs?

@vmendelev

Copy link
Copy Markdown
Collaborator Author

Don't pay attention to this PR. We won't commit ti. I will push a separate into Gym directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants