Add English NTT-SMOKE benchmark#1472
Conversation
|
This PR is for those what are not yet using Gym. No need to review, since I will create a copy in Gym. |
📝 WalkthroughWalkthroughThis PR introduces the NTT-SMOKE benchmark suite for NemotronTranscribe evaluation. The implementation provides dataset preparation from existing NeMo-Skills datasets, task-specific evaluation routing (ASR, MCQ, context-biasing, hallucination), WER-based metrics with confidence intervals, and comprehensive testing. ChangesNTT-SMOKE Benchmark Suite
Sequence Diagram: Evaluation and Metrics FlowsequenceDiagram
participant Model as Model Output
participant Evaluator as NTTSmokeEvaluator
participant Metrics as NTTSmokeMetrics
participant Scorer as Score Aggregator
Model->>Evaluator: eval_single(data_point)
activate Evaluator
alt task_type detected
Evaluator->>Evaluator: Route by task_type
Evaluator->>Evaluator: Compute WER/correctness/entities
end
Evaluator-->>Metrics: Return metrics dict
deactivate Evaluator
Metrics->>Metrics: update(predictions)
Metrics->>Metrics: Score missing fields
Metrics->>Metrics: Threshold WER to is_correct
Metrics->>Metrics: Accumulate prompt/language metrics
Metrics->>Metrics: get_metrics()
Metrics->>Metrics: Compute CI95 for WER/hallucination
Metrics-->>Scorer: Return aggregated metrics
Scorer->>Scorer: compute_score()
Scorer->>Scorer: Per-mode weighted aggregation
Scorer-->>Model: Final benchmark score dict
🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/dataset/ntt-smoke/ntt_smoke_eval.py`:
- Around line 264-275: The generic word-boundary fallback pattern "\b([A-J])\b"
in the patterns list causes single-letter words like "I" or "a" to be
mis-parsed; update the patterns array in the MCQ parsing block (the list used by
re.search in the loop) to remove that generic fallback and only include explicit
answer formats and the full-string single-letter pattern (keep patterns such as
r"(?:answer|option|choice)\s*(?:is|:)?\s*([A-J])\b", r"\b([A-J])\)",
r"\b([A-J])\.", and r"^\s*([A-J])\s*$"), ensuring the explicit formats are
checked before the full-string match and continue using re.IGNORECASE.
In `@nemo_skills/dataset/ntt-smoke/prepare.py`:
- Around line 623-632: The MUSAN rows built via the _with_metadata call are
missing task_type, so hallucination scoring is not applied; update the call to
_with_metadata (the block that passes variant,
subtask="hallucination.nonspeech", origin_dataset="musan", etc.) to include
task_type="Hallucination" so generated rows explicitly mark the benchmark as
hallucination tasks and enable the correct evaluator/metrics behavior.
- Around line 431-438: The current logic builds by_mode via _load_source but
then silently ignores empty mode lists when computing common_ids, allowing
partial groups; update the function to explicitly detect missing/empty modes
after creating by_mode (check each mode in the modes list against by_mode and
ensure rows are truthy), and if any mode is missing or empty (e.g.,
any(missing_modes := [m for m in modes if not by_mode.get(m)])), fail closed by
returning [] (or raising) rather than proceeding to set.intersection; reference
the variables/functions modes, by_mode, _load_source, and common_ids to locate
where to add this guard.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fa5a9b1c-f71d-4098-ad80-92e49008cc7f
📒 Files selected for processing (8)
nemo_skills/dataset/ntt-smoke/MEMO.mdnemo_skills/dataset/ntt-smoke/README.mdnemo_skills/dataset/ntt-smoke/__init__.pynemo_skills/dataset/ntt-smoke/en/__init__.pynemo_skills/dataset/ntt-smoke/ntt_smoke_eval.pynemo_skills/dataset/ntt-smoke/ntt_smoke_metrics.pynemo_skills/dataset/ntt-smoke/prepare.pytests/test_ntt_smoke_prepare.py
| patterns = [ | ||
| r"(?:answer|option|choice)\s*(?:is|:)?\s*([A-J])\b", | ||
| r"\b([A-J])\)", | ||
| r"\b([A-J])\.", | ||
| r"^\s*([A-J])\s*$", | ||
| r"\b([A-J])\b", | ||
| ] | ||
| for pattern in patterns: | ||
| match = re.search(pattern, clean, flags=re.IGNORECASE) | ||
| if match: | ||
| return match.group(1).upper() | ||
| return "" |
There was a problem hiding this comment.
Drop the generic single-letter fallback from MCQ parsing.
With re.IGNORECASE, \b([A-J])\b treats ordinary words like I or a as answer choices, so outputs such as I think the answer is B can be mis-scored before the real option is even considered. Restrict matching to explicit answer formats or a full-string single-letter reply.
Proposed fix
patterns = [
- r"(?:answer|option|choice)\s*(?:is|:)?\s*([A-J])\b",
- r"\b([A-J])\)",
- r"\b([A-J])\.",
- r"^\s*([A-J])\s*$",
- r"\b([A-J])\b",
+ r"(?:answer|option|choice|choose|pick|select)\s*(?:is|:)?\s*[\(\[]?([A-J])(?:[\)\].]|\b)",
+ r"^\s*[\(\[]?([A-J])[\)\].]?\s*$",
]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/ntt-smoke/ntt_smoke_eval.py` around lines 264 - 275, The
generic word-boundary fallback pattern "\b([A-J])\b" in the patterns list causes
single-letter words like "I" or "a" to be mis-parsed; update the patterns array
in the MCQ parsing block (the list used by re.search in the loop) to remove that
generic fallback and only include explicit answer formats and the full-string
single-letter pattern (keep patterns such as
r"(?:answer|option|choice)\s*(?:is|:)?\s*([A-J])\b", r"\b([A-J])\)",
r"\b([A-J])\.", and r"^\s*([A-J])\s*$"), ensuring the explicit formats are
checked before the full-string match and continue using re.IGNORECASE.
| modes = ["contextless", "coarse", "fine"] | ||
| by_mode = { | ||
| mode: _load_source(source_root, "contextasr-bench", f"{mode}/test.jsonl") | ||
| for mode in modes | ||
| } | ||
| common_ids = set.intersection(*[set(row.get("uniq_id") for row in rows) for rows in by_mode.values() if rows]) | ||
| if not common_ids: | ||
| return [] |
There was a problem hiding this comment.
Fail closed when any ContextASR mode is missing.
set.intersection(*[... if rows]) ignores empty manifests, so a missing contextless, coarse, or fine split silently produces partial groups instead of matched three-way comparisons. That makes the context-biasing benchmark inconsistent without any obvious failure signal.
Proposed fix
by_mode = {
mode: _load_source(source_root, "contextasr-bench", f"{mode}/test.jsonl")
for mode in modes
}
- common_ids = set.intersection(*[set(row.get("uniq_id") for row in rows) for rows in by_mode.values() if rows])
+ if any(not rows for rows in by_mode.values()):
+ return []
+ common_ids = set.intersection(*(set(row.get("uniq_id") for row in rows) for rows in by_mode.values()))
if not common_ids:
return []🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/ntt-smoke/prepare.py` around lines 431 - 438, The current
logic builds by_mode via _load_source but then silently ignores empty mode lists
when computing common_ids, allowing partial groups; update the function to
explicitly detect missing/empty modes after creating by_mode (check each mode in
the modes list against by_mode and ensure rows are truthy), and if any mode is
missing or empty (e.g., any(missing_modes := [m for m in modes if not
by_mode.get(m)])), fail closed by returning [] (or raising) rather than
proceeding to set.intersection; reference the variables/functions modes,
by_mode, _load_source, and common_ids to locate where to add this guard.
| _with_metadata( | ||
| row, | ||
| variant=variant, | ||
| subtask="hallucination.nonspeech", | ||
| origin_dataset="musan", | ||
| origin_split=row.get("category", "test"), | ||
| origin_manifest="musan/test.jsonl", | ||
| prompt=prompt, | ||
| prompt_variant=prompt_variant, | ||
| ) |
There was a problem hiding this comment.
Set task_type="Hallucination" on generated MUSAN rows.
The evaluator and metrics only attach strict hallucination scoring when task_type == "Hallucination", but this builder currently relies on the upstream MUSAN manifest already carrying that field. If the source rows omit it, these benchmark entries are scored as plain ASR and the hallucination metrics vanish.
Proposed fix
out.append(
_with_metadata(
row,
variant=variant,
subtask="hallucination.nonspeech",
origin_dataset="musan",
origin_split=row.get("category", "test"),
origin_manifest="musan/test.jsonl",
+ task_type="Hallucination",
prompt=prompt,
prompt_variant=prompt_variant,
)
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _with_metadata( | |
| row, | |
| variant=variant, | |
| subtask="hallucination.nonspeech", | |
| origin_dataset="musan", | |
| origin_split=row.get("category", "test"), | |
| origin_manifest="musan/test.jsonl", | |
| prompt=prompt, | |
| prompt_variant=prompt_variant, | |
| ) | |
| _with_metadata( | |
| row, | |
| variant=variant, | |
| subtask="hallucination.nonspeech", | |
| origin_dataset="musan", | |
| origin_split=row.get("category", "test"), | |
| origin_manifest="musan/test.jsonl", | |
| task_type="Hallucination", | |
| prompt=prompt, | |
| prompt_variant=prompt_variant, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/ntt-smoke/prepare.py` around lines 623 - 632, The MUSAN
rows built via the _with_metadata call are missing task_type, so hallucination
scoring is not applied; update the call to _with_metadata (the block that passes
variant, subtask="hallucination.nonspeech", origin_dataset="musan", etc.) to
include task_type="Hallucination" so generated rows explicitly mark the
benchmark as hallucination tasks and enable the correct evaluator/metrics
behavior.
|
@vmendelev could you please fix lints and dco? Also looks like tests fail, do we need to add some packages to reqs? |
|
Don't pay attention to this PR. We won't commit ti. I will push a separate into Gym directly. |
Summary
Add the English-only NTT-SMOKE (
ntt-smoke.en) benchmark for NemotronTranscribe smoke testing.The benchmark covers clean/noisy ASR, short ASR, real long-form AppTek calls, non-speech hallucination, prompt robustness, Preference-ASR audio-instruction checks, ContextASR context biasing, and a small superficial text control. The default English build is 5,075 rows, with 75 AppTek long-form rows because each long row contains many reference words.
Notes
Validation
python -m pytest tests/test_ntt_smoke_prepare.py -qpython -m compileall nemo_skills/dataset/ntt-smoke tests/test_ntt_smoke_prepare.pyBaseline Evidence
ntt-smoke.enmanifest.Summary by CodeRabbit
New Features
Documentation
Tests