Add IHEval (Instruction Hierarchy Evaluation) benchmark by bzantium · Pull Request #1464 · NVIDIA-NeMo/Skills

bzantium · 2026-05-27T09:22:25Z

Closes #1463.

Adds IHEval (Instruction Hierarchy Evaluation) — measures whether a model respects the system > user > tool instruction hierarchy. It is a benchmark group with 9 sub-benchmarks across 4 categories (rule-following, task-execution, safety, tool-use), each in aligned / conflict / reference settings.

What's included

nemo_skills/dataset/iheval/ — benchmark group (mirrors the mmau-pro pattern): group __init__.py (IS_BENCHMARK_GROUP + 9 sub-benchmarks + SCORE_MODULE), HF-based prepare.py, iheval_score.py group composite, and 9 sub-benchmark configs.
prepare.py downloads per-variant input_data.json from the zhihz0535/IHEval HF mirror via hf_hub_download; the generated test.jsonl is gitignored (not committed). Rows carry a pre-built OpenAI-style messages list (multi-turn + tool_calls / tool results), so generation uses ++prompt_format=openai.
Scoring lives in the standalone Apache-2.0 package bzantium/iheval; evaluation/evaluator/iheval.py is a thin lazy-import wrapper (same pattern as bfcl_eval), pinned in the Dockerfile. Rationale: upstream IHEval is CC BY-NC-ND with no installable package; the rule-following checkers derive from Google IFEval (Apache-2.0), the rest are independent re-implementations of the documented rule-based metrics.
IHEvalMetrics builds on the base pass@k machinery, so iheval:N yields pass@1[avg-of-N] (per-mode composites at the group level too), with per-setting (aligned/conflict/reference) and per-variant breakdowns.
Registered iheval_* eval types + the iheval metrics key; documented under instruction-following.

Usage

ns eval --benchmarks iheval ...            # all 9 sub-benchmarks
ns eval --benchmarks iheval.safety_hijack  # a single sub-benchmark

Summary by CodeRabbit

New Features
- Added IHEval benchmark support: dataset preparation, nine sub-benchmarks, per-task evaluation entrypoints, and group-level score aggregation and metrics.
Documentation
- Added IHEval docs describing benchmark structure, categories, settings, and resources.
Tests
- Added comprehensive tests validating metrics aggregation, per-setting/variant breakdowns, and group scoring behavior.
Chores
- Updated Docker build to install the IHEval scoring package.

coderabbitai · 2026-05-27T09:29:51Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9000092e-d2c8-491a-bd9b-ce369d4b2753

📥 Commits

Reviewing files that changed from the base of the PR and between 3a06d12 and 65b6241.

📒 Files selected for processing (2)

nemo_skills/evaluation/metrics/iheval_metrics.py
tests/test_iheval_metrics.py

📝 Walkthrough

Walkthrough

Integrates the IHEval benchmark group (9 sub-benchmarks): adds dataset preparation, group-level scoring, evaluator wrappers and registry entries, metrics implementation and registration, Docker pinning of the scorer package, docs, and tests.

Changes

IHEval Benchmark Integration

Layer / File(s)	Summary
Benchmark group structure and configuration `dockerfiles/Dockerfile.nemo-skills`, `docs/evaluation/instruction-following.md`, `nemo_skills/dataset/iheval/__init__.py`, `nemo_skills/dataset/iheval/*/__init__.py`	Defines benchmark group metadata and splits, `SCORE_MODULE`, per-sub-benchmark default generation/eval config modules, docs describing the benchmark, and pins the external `iheval` package in the Docker build.
Dataset preparation from HuggingFace source `nemo_skills/dataset/iheval/prepare.py`	Discovers variants in the `zhihz0535/IHEval` HF mirror, downloads `input_data.json`, builds OpenAI-style `messages` (multi-turn, tool calls, reference handling), and writes per-sub-benchmark `test.jsonl` files with CLI filtering support.
Group-level composite score computation `nemo_skills/dataset/iheval/iheval_score.py`	Computes group composites per aggregation mode: overall `symbolic_correct` mean across available subs, per-category averages, per-setting aligned/conflict/reference averages, `conflict_gap`, totals, and preserves all pass@k modes from subs.
Evaluator registry integration `nemo_skills/evaluation/evaluator/__init__.py`	Adds `eval_type` registry entries for `iheval_*` sub-benchmarks to route evaluation to the iheval evaluators that apply scorers to prediction JSONL files.
IHEval evaluator implementation `nemo_skills/evaluation/evaluator/iheval.py`	Provides lazy import and pip-install hint for `iheval`, implements `_score_in_place` that scores and rewrites JSONL, and exposes `eval_*` functions per task.
Metrics implementation and registration `nemo_skills/evaluation/metrics/iheval_metrics.py`, `nemo_skills/evaluation/metrics/map_metrics.py`	Adds `IHEvalMetrics` (extends pass@k machinery using `symbolic_correct` floats, tracks per-setting and per-variant correctness sums/totals, and augments `get_metrics()` with breakdown fields) and registers `"iheval"` in `METRICS_MAP`.
Comprehensive test suite for metrics and scoring `tests/test_iheval_metrics.py`, `tests/test_iheval_score.py`	Tests cover metric aggregation, per-setting/variant breakdowns, missing-key behavior, fractional correctness, avg-of-N pass@k modes, reset behavior, and group-level composite score correctness including category averages and conflict gap.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

Kipok
gwarmstrong

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding the IHEval benchmark group. It is specific, directly related to the changeset, and provides meaningful information.
Linked Issues check	✅ Passed	All core objectives from `#1463` are met: benchmark group structure with IS_BENCHMARK_GROUP and 9 sub-benchmarks, prepare.py for HF data download, iheval_score.py for group composites, standalone Apache-2.0 scoring wrapper, IHEvalMetrics implementation, and per-setting/variant breakdowns. Reviewer feedback on metrics consistency was addressed in commit.
Out of Scope Changes check	✅ Passed	All changes are in-scope: additions to nemo_skills/dataset/iheval/ for benchmark structure, prepare.py for data generation, evaluation/evaluator/iheval.py for scoring, metrics/iheval_metrics.py for pass@k computation, and supporting configuration files. Documentation and Dockerfile updates are directly supporting IHEval integration.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/evaluation/instruction-following.md`:
- Line 27: Replace the non-descriptive link text "here" on line referencing the
original benchmark source with descriptive text like "IHEval benchmark source"
(or similar descriptive phrase) so the markdown reads: "Original benchmark
source is IHEval benchmark source" and make the link target the same URL
(https://github.com/ytyz1307zzh/IHEval); update the link text in the file
docs/evaluation/instruction-following.md accordingly to improve accessibility
and readability.

In `@nemo_skills/dataset/iheval/prepare.py`:
- Around line 72-73: Replace silent .get() usage for required fields so
malformed input fails fast: change expressions like instruction =
row.get("instruction", "") and similar row.get("answer") usages to direct index
access (e.g., instruction = row["instruction"], answer = row["answer"]) in the
prepare logic, and if you need a clearer error message wrap the access in a
short try/except that raises a ValueError with context (including the row id or
contents) so missing required keys fail loudly; apply the same change for the
other occurrences in the block around the later rows (the group currently using
.get() at the second block mentioned).
- Around line 192-195: The CLI defines a --split argument but never uses it;
either remove the parser.add_argument("--split", ...) line or wire the parsed
"split" value into the downstream logic that selects which dataset split to
prepare (e.g., pass args.split into the function that loads/iterates splits or
replace any hard-coded "test" usage). Locate the parser.add_argument("--split",
...) and ensure the parsed variable named split is consumed where the code
currently assumes "test" (or add validation to reject non-supported values) so
the user-provided flag is not a no-op.
- Around line 146-166: The current loop writes directly to output_file which can
be left truncated on failure; instead build all records first (e.g., accumulate
into a list using list_variants, load_input_data, build_messages,
_last_user_text and the same record shape and rows_written logic) or write to a
temporary file (e.g., output_file.with_suffix(".tmp")) and only move/rename it
to output_file on successful completion; ensure you preserve the id format
("iheval-{task_id}-{setting}-{variant}-{upstream_id}") and increment
rows_written the same way before performing an atomic replace of output_file.

In `@nemo_skills/evaluation/metrics/iheval_metrics.py`:
- Around line 32-33: Replace the silent defaulting of "symbolic_correct" with
explicit required access so missing evaluator output fails fast: in the
_get_score_dict method (and the other place using .get("symbolic_correct", 0.0))
change to direct dictionary indexing (e.g., prediction["symbolic_correct"]) so a
KeyError is raised if the key is absent, ensuring corrupted/missing evaluator
output is detected rather than aggregated as 0.0; add a brief contextual error
message or let the KeyError propagate to surface the problem to callers.

In `@tests/test_iheval_score.py`:
- Line 116: The test loop binds an unused variable "name" in "for name, sub in
metrics.items()" causing a Ruff B007 warning; update the loop in
tests/test_iheval_score.py to either iterate values directly using "for sub in
metrics.values()" or replace "name" with "_" as "for _, sub in metrics.items()"
so only the used "sub" variable remains.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4b0a5168-31d4-4944-bb19-b4cf1b5db447

📥 Commits

Reviewing files that changed from the base of the PR and between b620e79 and 8d41804.

📒 Files selected for processing (20)

dockerfiles/Dockerfile.nemo-skills
docs/evaluation/instruction-following.md
nemo_skills/dataset/iheval/__init__.py
nemo_skills/dataset/iheval/iheval_score.py
nemo_skills/dataset/iheval/prepare.py
nemo_skills/dataset/iheval/rule_following_multi/__init__.py
nemo_skills/dataset/iheval/rule_following_single/__init__.py
nemo_skills/dataset/iheval/safety_extract/__init__.py
nemo_skills/dataset/iheval/safety_hijack/__init__.py
nemo_skills/dataset/iheval/task_execution_lang_detect/__init__.py
nemo_skills/dataset/iheval/task_execution_translation/__init__.py
nemo_skills/dataset/iheval/task_execution_verb_extract/__init__.py
nemo_skills/dataset/iheval/tool_use_slack_user/__init__.py
nemo_skills/dataset/iheval/tool_use_webpage/__init__.py
nemo_skills/evaluation/evaluator/__init__.py
nemo_skills/evaluation/evaluator/iheval.py
nemo_skills/evaluation/metrics/iheval_metrics.py
nemo_skills/evaluation/metrics/map_metrics.py
tests/test_iheval_metrics.py
tests/test_iheval_score.py

coderabbitai · 2026-05-27T09:29:54Z

+- Benchmark group is defined in [`nemo_skills/dataset/iheval/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/iheval/__init__.py); run all sub-benchmarks with `--benchmarks iheval` (or a single one, e.g. `iheval.safety_hijack`).
+- Data is downloaded at prepare time from the [`zhihz0535/IHEval`](https://huggingface.co/datasets/zhihz0535/IHEval) HuggingFace mirror (not committed).
+- Rule-based scoring lives in the standalone [`bzantium/iheval`](https://github.com/bzantium/iheval) package — install with `pip install git+https://github.com/bzantium/iheval.git` (already baked into the nemo-skills Docker image).
+- Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval).


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use descriptive link text instead of “here”.

Line 27 uses non-descriptive link text, which hurts docs readability and accessibility tooling checks.

Suggested doc diff

-- Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval). +- Original IHEval benchmark source is [ytyz1307zzh/IHEval](https://github.com/ytyz1307zzh/IHEval).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval).

- Original IHEval benchmark source is [ytyz1307zzh/IHEval](https://github.com/ytyz1307zzh/IHEval).

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 27-27: Link text should be descriptive

(MD059, descriptive-link-text)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/evaluation/instruction-following.md` at line 27, Replace the non-descriptive link text "here" on line referencing the original benchmark source with descriptive text like "IHEval benchmark source" (or similar descriptive phrase) so the markdown reads: "Original benchmark source is IHEval benchmark source" and make the link target the same URL (https://github.com/ytyz1307zzh/IHEval); update the link text in the file docs/evaluation/instruction-following.md accordingly to improve accessibility and readability.

coderabbitai · 2026-05-27T09:29:54Z

+    instruction = row.get("instruction", "")
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on required row fields instead of silently defaulting.

Using .get() for expected keys can silently corrupt output (instruction="", answer=None). For required schema fields, use direct indexing so malformed upstream data fails loudly.

Suggested change

- instruction = row.get("instruction", "") + instruction = row["instruction"] @@ - answer = row.get("answer") + answer = row["answer"]

As per coding guidelines, "Don't use `.get()` for accessing dictionary keys if the code expects them to be present; use direct access `data[key_name]` to fail with a clear error instead of silently corrupting data".
Also applies to: 152-163

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/iheval/prepare.py` around lines 72 - 73, Replace silent .get() usage for required fields so malformed input fails fast: change expressions like instruction = row.get("instruction", "") and similar row.get("answer") usages to direct index access (e.g., instruction = row["instruction"], answer = row["answer"]) in the prepare logic, and if you need a clearer error message wrap the access in a short try/except that raises a ValueError with context (including the row id or contents) so missing required keys fail loudly; apply the same change for the other occurrences in the block around the later rows (the group currently using .get() at the second block mentioned).

coderabbitai · 2026-05-27T09:29:54Z

+    with output_file.open("w", encoding="utf-8") as fout:
+        for setting in SETTINGS:
+            for variant in list_variants(repo_files, category, task, setting):
+                data = load_input_data(category, task, setting, variant)
+                for i, row in enumerate(data):
+                    messages = build_messages(row, category_id, task_id, setting)
+                    answer = row.get("answer")
+                    upstream_id = row.get("id", i)
+                    record = {
+                        "id": f"iheval-{task_id}-{setting}-{variant}-{upstream_id}",
+                        "messages": messages,
+                        "question": _last_user_text(messages),
+                        "setting": setting,
+                        "variant": variant,
+                        "category": category_id,
+                        "task": task_id,
+                        "answer": answer,
+                        "expected_answer": answer,
+                    }
+                    fout.write(json.dumps(record, ensure_ascii=False) + "\n")
+                    rows_written += 1


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Write test.jsonl atomically after successful preparation.

The final output file is opened before all downloads/transforms finish. A mid-run failure can leave a truncated file. Build records first (or write to a temp file) and replace the target only on success.

Suggested change

def process_sub_benchmark(out_root, repo_files, split_name, spec): @@ - rows_written = 0 - with output_file.open("w", encoding="utf-8") as fout: - for setting in SETTINGS: - for variant in list_variants(repo_files, category, task, setting): - data = load_input_data(category, task, setting, variant) - for i, row in enumerate(data): - ... - fout.write(json.dumps(record, ensure_ascii=False) + "\n") - rows_written += 1 + records = [] + for setting in SETTINGS: + for variant in list_variants(repo_files, category, task, setting): + data = load_input_data(category, task, setting, variant) + for i, row in enumerate(data): + ... + records.append(record) + + tmp_file = out_dir / "test.jsonl.tmp" + with tmp_file.open("w", encoding="utf-8") as fout: + for record in records: + fout.write(json.dumps(record, ensure_ascii=False) + "\n") + tmp_file.replace(output_file) + rows_written = len(records)

As per coding guidelines, "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/iheval/prepare.py` around lines 146 - 166, The current loop writes directly to output_file which can be left truncated on failure; instead build all records first (e.g., accumulate into a list using list_variants, load_input_data, build_messages, _last_user_text and the same record shape and rows_written logic) or write to a temporary file (e.g., output_file.with_suffix(".tmp")) and only move/rename it to output_file on successful completion; ensure you preserve the id format ("iheval-{task_id}-{setting}-{variant}-{upstream_id}") and increment rows_written the same way before performing an atomic replace of output_file.

coderabbitai · 2026-05-27T09:29:54Z

+    parser.add_argument("--split", default="test", choices=("test",), help="Local split name.")
+    parser.add_argument(
+        "--only", nargs="*", default=None, help="Restrict to these sub-benchmark output dirs (e.g. safety_hijack)."
+    )


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove or wire --split; it is currently ignored.

--split is user-facing but not used in execution, so it behaves like a no-op argument.

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

Also applies to: 197-197

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/dataset/iheval/prepare.py` around lines 192 - 195, The CLI defines a --split argument but never uses it; either remove the parser.add_argument("--split", ...) line or wire the parsed "split" value into the downstream logic that selects which dataset split to prepare (e.g., pass args.split into the function that loads/iterates splits or replace any hard-coded "test" usage). Locate the parser.add_argument("--split", ...) and ensure the parsed variable named split is consumed where the code currently assumes "test" (or add validation to reject non-supported values) so the user-provided flag is not a no-op.

coderabbitai · 2026-05-27T09:29:54Z

+    def _get_score_dict(self, prediction: dict) -> dict[str, float]:
+        return {"symbolic_correct": float(prediction.get("symbolic_correct", 0.0))}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when symbolic_correct is missing instead of silently defaulting to 0.0.

On Line 33 and Line 46, .get("symbolic_correct", 0.0) can mask bad evaluator output and skew aggregated scores. symbolic_correct is required for this metric path, so direct key access is safer.

Proposed fix

def _get_score_dict(self, prediction: dict) -> dict[str, float]: - return {"symbolic_correct": float(prediction.get("symbolic_correct", 0.0))} + return {"symbolic_correct": float(prediction["symbolic_correct"])} @@ - scores = [float(p.get("symbolic_correct", 0.0)) for p in predictions] + scores = [float(p["symbolic_correct"]) for p in predictions]

As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

Also applies to: 46-47

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/evaluation/metrics/iheval_metrics.py` around lines 32 - 33, Replace the silent defaulting of "symbolic_correct" with explicit required access so missing evaluator output fails fast: in the _get_score_dict method (and the other place using .get("symbolic_correct", 0.0)) change to direct dictionary indexing (e.g., prediction["symbolic_correct"]) so a KeyError is raised if the key is absent, ensuring corrupted/missing evaluator output is detected rather than aggregated as 0.0; add a brief contextual error message or let the KeyError propagate to surface the problem to callers.

coderabbitai · 2026-05-27T09:29:55Z

+    def test_pass_at_k_modes_are_aggregated(self):
+        # iheval:N -> each sub reports extra agg modes; the group composite must surface them all.
+        metrics = _full_metrics()
+        for name, sub in metrics.items():


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove unused loop variable in test loop.

Line 116 binds name but never uses it. Iterate over values directly (or rename to _) to satisfy Ruff B007.

Proposed fix

- for name, sub in metrics.items(): + for sub in metrics.values(): base = sub["pass@1"]["symbolic_correct"] sub["pass@4"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base + 10} sub["pass@1[avg-of-4]"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base - 5}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for name, sub in metrics.items():

for sub in metrics.values():

base = sub["pass@1"]["symbolic_correct"]

sub["pass@4"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base + 10}

sub["pass@1[avg-of-4]"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base - 5}

🧰 Tools

🪛 Ruff (0.15.14)

[warning] 116-116: Loop control variable name not used within loop body

Rename unused name to _name

(B007)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_iheval_score.py` at line 116, The test loop binds an unused variable "name" in "for name, sub in metrics.items()" causing a Ruff B007 warning; update the loop in tests/test_iheval_score.py to either iterate values directly using "for sub in metrics.values()" or replace "name" with "_" as "for _, sub in metrics.items()" so only the used "sub" variable remains.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/test_iheval_score.py (1)

116-119: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove unused loop variable.

Line 116 binds name but never uses it in the loop body. This violates the guideline to avoid unused parameters and triggers a Ruff B007 warning.

♻️ Proposed fix

-        for name, sub in metrics.items():
+        for sub in metrics.values():
             base = sub["pass@1"]["symbolic_correct"]
             sub["pass@4"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base + 10}
             sub["pass@1[avg-of-4]"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base - 5}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_iheval_score.py` around lines 116 - 119, The for-loop over metrics
binds an unused variable name which triggers Ruff B007; change the loop to avoid
binding name by iterating values directly (use metrics.values()) or use a
discard variable (_) so only sub is bound; update the loop that currently reads
"for name, sub in metrics.items()" to either "for sub in metrics.values()" or
"for _, sub in metrics.items()" so name is not unused while keeping the body
that assigns to sub["pass@4"] and sub["pass@1[avg-of-4]"] unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/evaluation/instruction-following.md`:
- Around line 19-27: Add runnable example commands and expected-result snippets
to the IHEval section in docs/evaluation/instruction-following.md: include
example ns eval invocations (e.g., run the whole group with --benchmarks iheval
and a single sub-benchmark like --benchmarks iheval.safety_hijack) and show
required flags (--model, --output_dir); document expected metrics such as
overall symbolic_correct, per-setting breakdowns for aligned / conflict /
reference, category/group averages and the conflict_gap (reference - conflict);
mention where the benchmark is defined (nemo_skills/dataset/iheval/__init__.py)
and that rule-based scoring comes from the bzantium/iheval package so readers
know prerequisites.

---

Duplicate comments:
In `@tests/test_iheval_score.py`:
- Around line 116-119: The for-loop over metrics binds an unused variable name
which triggers Ruff B007; change the loop to avoid binding name by iterating
values directly (use metrics.values()) or use a discard variable (_) so only sub
is bound; update the loop that currently reads "for name, sub in
metrics.items()" to either "for sub in metrics.values()" or "for _, sub in
metrics.items()" so name is not unused while keeping the body that assigns to
sub["pass@4"] and sub["pass@1[avg-of-4]"] unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 07e6fd54-0165-4dbc-af8b-09c90af9e88c

📥 Commits

Reviewing files that changed from the base of the PR and between 8d41804 and 024716f.

📒 Files selected for processing (20)

dockerfiles/Dockerfile.nemo-skills
docs/evaluation/instruction-following.md
nemo_skills/dataset/iheval/__init__.py
nemo_skills/dataset/iheval/iheval_score.py
nemo_skills/dataset/iheval/prepare.py
nemo_skills/dataset/iheval/rule_following_multi/__init__.py
nemo_skills/dataset/iheval/rule_following_single/__init__.py
nemo_skills/dataset/iheval/safety_extract/__init__.py
nemo_skills/dataset/iheval/safety_hijack/__init__.py
nemo_skills/dataset/iheval/task_execution_lang_detect/__init__.py
nemo_skills/dataset/iheval/task_execution_translation/__init__.py
nemo_skills/dataset/iheval/task_execution_verb_extract/__init__.py
nemo_skills/dataset/iheval/tool_use_slack_user/__init__.py
nemo_skills/dataset/iheval/tool_use_webpage/__init__.py
nemo_skills/evaluation/evaluator/__init__.py
nemo_skills/evaluation/evaluator/iheval.py
nemo_skills/evaluation/metrics/iheval_metrics.py
nemo_skills/evaluation/metrics/map_metrics.py
tests/test_iheval_metrics.py
tests/test_iheval_score.py

✅ Files skipped from review due to trivial changes (6)

nemo_skills/dataset/iheval/tool_use_slack_user/init.py
nemo_skills/dataset/iheval/rule_following_single/init.py
nemo_skills/dataset/iheval/task_execution_verb_extract/init.py
nemo_skills/dataset/iheval/init.py
nemo_skills/dataset/iheval/tool_use_webpage/init.py
nemo_skills/dataset/iheval/safety_hijack/init.py

🚧 Files skipped from review as they are similar to previous changes (11)

nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/dataset/iheval/task_execution_lang_detect/init.py
nemo_skills/dataset/iheval/task_execution_translation/init.py
nemo_skills/dataset/iheval/rule_following_multi/init.py
tests/test_iheval_metrics.py
nemo_skills/evaluation/evaluator/init.py
nemo_skills/evaluation/evaluator/iheval.py
nemo_skills/dataset/iheval/safety_extract/init.py
nemo_skills/evaluation/metrics/iheval_metrics.py
nemo_skills/dataset/iheval/prepare.py
nemo_skills/dataset/iheval/iheval_score.py

coderabbitai · 2026-05-27T09:34:22Z

+IHEval (Instruction Hierarchy Evaluation) measures whether a model respects the
+system > user > tool instruction hierarchy. It is a **benchmark group** with 9
+sub-benchmarks across 4 categories (rule-following, task-execution, safety,
+tool-use), each in `aligned` / `conflict` / `reference` settings.
+
+- Benchmark group is defined in [`nemo_skills/dataset/iheval/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/iheval/__init__.py); run all sub-benchmarks with `--benchmarks iheval` (or a single one, e.g. `iheval.safety_hijack`).
+- Data is downloaded at prepare time from the [`zhihz0535/IHEval`](https://huggingface.co/datasets/zhihz0535/IHEval) HuggingFace mirror (not committed).
+- Rule-based scoring lives in the standalone [`bzantium/iheval`](https://github.com/bzantium/iheval) package — install with `pip install git+https://github.com/bzantium/iheval.git` (already baked into the nemo-skills Docker image).
+- Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval).


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add runnable IHEval examples and expected-result snippets in this new section.

This new benchmark section documents scope, but it’s missing example eval commands and expected results for tested models, which are required for benchmark docs completeness.

Suggested doc addition

### iheval IHEval (Instruction Hierarchy Evaluation) measures whether a model respects the system > user > tool instruction hierarchy. It is a **benchmark group** with 9 sub-benchmarks across 4 categories (rule-following, task-execution, safety, tool-use), each in `aligned` / `conflict` / `reference` settings. - Benchmark group is defined in [`nemo_skills/dataset/iheval/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/iheval/__init__.py); run all sub-benchmarks with `--benchmarks iheval` (or a single one, e.g. `iheval.safety_hijack`). - Data is downloaded at prepare time from the [`zhihz0535/IHEval`](https://huggingface.co/datasets/zhihz0535/IHEval) HuggingFace mirror (not committed). - Rule-based scoring lives in the standalone [`bzantium/iheval`](https://github.com/bzantium/iheval) package — install with `pip install git+https://github.com/bzantium/iheval.git` (already baked into the nemo-skills Docker image). - Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval). + +Example: +```bash +ns eval --benchmarks iheval --model <model_name> --output_dir <out_dir> +ns eval --benchmarks iheval.safety_hijack --model <model_name> --output_dir <out_dir> +``` + +Expected results: +- Metrics include overall `symbolic_correct` and setting breakdowns for `aligned`, `conflict`, and `reference`. +- Group outputs include category-level averages and `conflict_gap` (`reference - conflict`).

As per coding guidelines, “When adding new benchmarks, add it to the corresponding place in the documentation with example commands for running evaluation and expected results for tested models”.

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 27-27: Link text should be descriptive

(MD059, descriptive-link-text)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/evaluation/instruction-following.md` around lines 19 - 27, Add runnable example commands and expected-result snippets to the IHEval section in docs/evaluation/instruction-following.md: include example ns eval invocations (e.g., run the whole group with --benchmarks iheval and a single sub-benchmark like --benchmarks iheval.safety_hijack) and show required flags (--model, --output_dir); document expected metrics such as overall symbolic_correct, per-setting breakdowns for aligned / conflict / reference, category/group averages and the conflict_gap (reference - conflict); mention where the benchmark is defined (nemo_skills/dataset/iheval/__init__.py) and that rule-based scoring comes from the bzantium/iheval package so readers know prerequisites.

IHEval is a benchmark group with 9 sub-benchmarks across 4 categories (rule-following, task-execution, safety, tool-use), each in aligned / conflict / reference settings. - dataset/iheval: benchmark group + HF-based prepare.py (downloads per-variant input_data.json from the zhihz0535/IHEval mirror; test.jsonl is gitignored), iheval_score.py group composite, and 9 sub-benchmark configs. - Scoring lives in the standalone pip package bzantium/iheval (Apache-2.0); evaluator/iheval.py is a thin lazy-import wrapper (bfcl-style). Pinned in the Dockerfile. - IHEvalMetrics uses the base pass@k machinery so iheval:N yields pass@1[avg-of-N], plus per-setting / per-variant breakdowns. - Registered iheval_* eval types and the iheval metrics key; documented under instruction-following. Signed-off-by: bzantium <ryumin93@gmail.com>

Kipok · 2026-06-09T19:56:30Z

Codex review comment posted on behalf of @Kipok:

I found one metrics issue in this PR.

nemo_skills/evaluation/metrics/iheval_metrics.py:59 copies the same setting_* and variant_* breakdowns into every aggregation mode returned by get_metrics(). Those breakdowns are computed earlier from the average over all attempts for a sample, while pass@N.symbolic_correct is computed as best-of-N.

That makes pass@N internally inconsistent for eval_config.generation.num_random_seeds > 1. For example, with two attempts for the same item where one passes and one fails, pass@2.symbolic_correct reports 100.0, but pass@2.setting_aligned / variant_* report 50.0.

I think the fix should be either:

compute setting/variant breakdowns separately for each aggregation mode, mirroring the pass@k logic; or
only attach these breakdowns to aggregation modes where average-over-attempts is the intended meaning, and omit them from pass@N.

The previous implementation copied the same setting_* / variant_* values (computed as mean-over-attempts) into every aggregation mode, so pass@N reported best-of-N for symbolic_correct but mean-over-attempts for the breakdowns — internally inconsistent at N>1. Also, pass@1 used scores[0], which disagrees with base's probabilistic formula for binary scores. Now stores per-sample attempt scores and computes setting/variant per aggregation mode, mirroring base._compute_pass_at_k semantics (probabilistic for binary, max for fractional, mean for avg-of-k). Signed-off-by: bzantium <ryumin93@gmail.com>

bzantium · 2026-06-10T07:15:39Z

Thanks for catching this. Fixed in 65b6241.

update() now stores per-sample attempt scores with setting / variant, and get_metrics() computes the breakdown per aggregation mode via a _mode_aggregator(mode_key) that mirrors _compute_pass_at_k (probabilistic for binary pass@k, max for fractional, mean for pass@1[avg-of-k]). pass@1 deliberately falls through to the generic pass@k branch at k=1 so binary subs match base's correct/N rather than scores[0].

bzantium force-pushed the feat/iheval branch from 8d41804 to 024716f Compare May 27, 2026 09:25

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

bzantium force-pushed the feat/iheval branch from 024716f to 3a06d12 Compare May 27, 2026 12:17

	- Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval).
	- Original IHEval benchmark source is [ytyz1307zzh/IHEval](https://github.com/ytyz1307zzh/IHEval).

		def _get_score_dict(self, prediction: dict) -> dict[str, float]:
		return {"symbolic_correct": float(prediction.get("symbolic_correct", 0.0))}

Uh oh!

Conversation

bzantium commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's included

Usage

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok commented Jun 9, 2026

Uh oh!

bzantium commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bzantium commented May 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading