Skip to content

Add IHEval (Instruction Hierarchy Evaluation) benchmark#1464

Open
bzantium wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
bzantium:feat/iheval
Open

Add IHEval (Instruction Hierarchy Evaluation) benchmark#1464
bzantium wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
bzantium:feat/iheval

Conversation

@bzantium

@bzantium bzantium commented May 27, 2026

Copy link
Copy Markdown
Contributor

Closes #1463.

Adds IHEval (Instruction Hierarchy Evaluation) — measures whether a model respects the system > user > tool instruction hierarchy. It is a benchmark group with 9 sub-benchmarks across 4 categories (rule-following, task-execution, safety, tool-use), each in aligned / conflict / reference settings.

What's included

  • nemo_skills/dataset/iheval/ — benchmark group (mirrors the mmau-pro pattern): group __init__.py (IS_BENCHMARK_GROUP + 9 sub-benchmarks + SCORE_MODULE), HF-based prepare.py, iheval_score.py group composite, and 9 sub-benchmark configs.
  • prepare.py downloads per-variant input_data.json from the zhihz0535/IHEval HF mirror via hf_hub_download; the generated test.jsonl is gitignored (not committed). Rows carry a pre-built OpenAI-style messages list (multi-turn + tool_calls / tool results), so generation uses ++prompt_format=openai.
  • Scoring lives in the standalone Apache-2.0 package bzantium/iheval; evaluation/evaluator/iheval.py is a thin lazy-import wrapper (same pattern as bfcl_eval), pinned in the Dockerfile. Rationale: upstream IHEval is CC BY-NC-ND with no installable package; the rule-following checkers derive from Google IFEval (Apache-2.0), the rest are independent re-implementations of the documented rule-based metrics.
  • IHEvalMetrics builds on the base pass@k machinery, so iheval:N yields pass@1[avg-of-N] (per-mode composites at the group level too), with per-setting (aligned/conflict/reference) and per-variant breakdowns.
  • Registered iheval_* eval types + the iheval metrics key; documented under instruction-following.

Usage

ns eval --benchmarks iheval ...            # all 9 sub-benchmarks
ns eval --benchmarks iheval.safety_hijack  # a single sub-benchmark

Summary by CodeRabbit

  • New Features

    • Added IHEval benchmark support: dataset preparation, nine sub-benchmarks, per-task evaluation entrypoints, and group-level score aggregation and metrics.
  • Documentation

    • Added IHEval docs describing benchmark structure, categories, settings, and resources.
  • Tests

    • Added comprehensive tests validating metrics aggregation, per-setting/variant breakdowns, and group scoring behavior.
  • Chores

    • Updated Docker build to install the IHEval scoring package.

@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9000092e-d2c8-491a-bd9b-ce369d4b2753

📥 Commits

Reviewing files that changed from the base of the PR and between 3a06d12 and 65b6241.

📒 Files selected for processing (2)
  • nemo_skills/evaluation/metrics/iheval_metrics.py
  • tests/test_iheval_metrics.py

📝 Walkthrough

Walkthrough

Integrates the IHEval benchmark group (9 sub-benchmarks): adds dataset preparation, group-level scoring, evaluator wrappers and registry entries, metrics implementation and registration, Docker pinning of the scorer package, docs, and tests.

Changes

IHEval Benchmark Integration

Layer / File(s) Summary
Benchmark group structure and configuration
dockerfiles/Dockerfile.nemo-skills, docs/evaluation/instruction-following.md, nemo_skills/dataset/iheval/__init__.py, nemo_skills/dataset/iheval/*/__init__.py
Defines benchmark group metadata and splits, SCORE_MODULE, per-sub-benchmark default generation/eval config modules, docs describing the benchmark, and pins the external iheval package in the Docker build.
Dataset preparation from HuggingFace source
nemo_skills/dataset/iheval/prepare.py
Discovers variants in the zhihz0535/IHEval HF mirror, downloads input_data.json, builds OpenAI-style messages (multi-turn, tool calls, reference handling), and writes per-sub-benchmark test.jsonl files with CLI filtering support.
Group-level composite score computation
nemo_skills/dataset/iheval/iheval_score.py
Computes group composites per aggregation mode: overall symbolic_correct mean across available subs, per-category averages, per-setting aligned/conflict/reference averages, conflict_gap, totals, and preserves all pass@k modes from subs.
Evaluator registry integration
nemo_skills/evaluation/evaluator/__init__.py
Adds eval_type registry entries for iheval_* sub-benchmarks to route evaluation to the iheval evaluators that apply scorers to prediction JSONL files.
IHEval evaluator implementation
nemo_skills/evaluation/evaluator/iheval.py
Provides lazy import and pip-install hint for iheval, implements _score_in_place that scores and rewrites JSONL, and exposes eval_* functions per task.
Metrics implementation and registration
nemo_skills/evaluation/metrics/iheval_metrics.py, nemo_skills/evaluation/metrics/map_metrics.py
Adds IHEvalMetrics (extends pass@k machinery using symbolic_correct floats, tracks per-setting and per-variant correctness sums/totals, and augments get_metrics() with breakdown fields) and registers "iheval" in METRICS_MAP.
Comprehensive test suite for metrics and scoring
tests/test_iheval_metrics.py, tests/test_iheval_score.py
Tests cover metric aggregation, per-setting/variant breakdowns, missing-key behavior, fractional correctness, avg-of-N pass@k modes, reset behavior, and group-level composite score correctness including category averages and conflict gap.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • Kipok
  • gwarmstrong
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding the IHEval benchmark group. It is specific, directly related to the changeset, and provides meaningful information.
Linked Issues check ✅ Passed All core objectives from #1463 are met: benchmark group structure with IS_BENCHMARK_GROUP and 9 sub-benchmarks, prepare.py for HF data download, iheval_score.py for group composites, standalone Apache-2.0 scoring wrapper, IHEvalMetrics implementation, and per-setting/variant breakdowns. Reviewer feedback on metrics consistency was addressed in commit.
Out of Scope Changes check ✅ Passed All changes are in-scope: additions to nemo_skills/dataset/iheval/ for benchmark structure, prepare.py for data generation, evaluation/evaluator/iheval.py for scoring, metrics/iheval_metrics.py for pass@k computation, and supporting configuration files. Documentation and Dockerfile updates are directly supporting IHEval integration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/evaluation/instruction-following.md`:
- Line 27: Replace the non-descriptive link text "here" on line referencing the
original benchmark source with descriptive text like "IHEval benchmark source"
(or similar descriptive phrase) so the markdown reads: "Original benchmark
source is IHEval benchmark source" and make the link target the same URL
(https://github.com/ytyz1307zzh/IHEval); update the link text in the file
docs/evaluation/instruction-following.md accordingly to improve accessibility
and readability.

In `@nemo_skills/dataset/iheval/prepare.py`:
- Around line 72-73: Replace silent .get() usage for required fields so
malformed input fails fast: change expressions like instruction =
row.get("instruction", "") and similar row.get("answer") usages to direct index
access (e.g., instruction = row["instruction"], answer = row["answer"]) in the
prepare logic, and if you need a clearer error message wrap the access in a
short try/except that raises a ValueError with context (including the row id or
contents) so missing required keys fail loudly; apply the same change for the
other occurrences in the block around the later rows (the group currently using
.get() at the second block mentioned).
- Around line 192-195: The CLI defines a --split argument but never uses it;
either remove the parser.add_argument("--split", ...) line or wire the parsed
"split" value into the downstream logic that selects which dataset split to
prepare (e.g., pass args.split into the function that loads/iterates splits or
replace any hard-coded "test" usage). Locate the parser.add_argument("--split",
...) and ensure the parsed variable named split is consumed where the code
currently assumes "test" (or add validation to reject non-supported values) so
the user-provided flag is not a no-op.
- Around line 146-166: The current loop writes directly to output_file which can
be left truncated on failure; instead build all records first (e.g., accumulate
into a list using list_variants, load_input_data, build_messages,
_last_user_text and the same record shape and rows_written logic) or write to a
temporary file (e.g., output_file.with_suffix(".tmp")) and only move/rename it
to output_file on successful completion; ensure you preserve the id format
("iheval-{task_id}-{setting}-{variant}-{upstream_id}") and increment
rows_written the same way before performing an atomic replace of output_file.

In `@nemo_skills/evaluation/metrics/iheval_metrics.py`:
- Around line 32-33: Replace the silent defaulting of "symbolic_correct" with
explicit required access so missing evaluator output fails fast: in the
_get_score_dict method (and the other place using .get("symbolic_correct", 0.0))
change to direct dictionary indexing (e.g., prediction["symbolic_correct"]) so a
KeyError is raised if the key is absent, ensuring corrupted/missing evaluator
output is detected rather than aggregated as 0.0; add a brief contextual error
message or let the KeyError propagate to surface the problem to callers.

In `@tests/test_iheval_score.py`:
- Line 116: The test loop binds an unused variable "name" in "for name, sub in
metrics.items()" causing a Ruff B007 warning; update the loop in
tests/test_iheval_score.py to either iterate values directly using "for sub in
metrics.values()" or replace "name" with "_" as "for _, sub in metrics.items()"
so only the used "sub" variable remains.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4b0a5168-31d4-4944-bb19-b4cf1b5db447

📥 Commits

Reviewing files that changed from the base of the PR and between b620e79 and 8d41804.

📒 Files selected for processing (20)
  • dockerfiles/Dockerfile.nemo-skills
  • docs/evaluation/instruction-following.md
  • nemo_skills/dataset/iheval/__init__.py
  • nemo_skills/dataset/iheval/iheval_score.py
  • nemo_skills/dataset/iheval/prepare.py
  • nemo_skills/dataset/iheval/rule_following_multi/__init__.py
  • nemo_skills/dataset/iheval/rule_following_single/__init__.py
  • nemo_skills/dataset/iheval/safety_extract/__init__.py
  • nemo_skills/dataset/iheval/safety_hijack/__init__.py
  • nemo_skills/dataset/iheval/task_execution_lang_detect/__init__.py
  • nemo_skills/dataset/iheval/task_execution_translation/__init__.py
  • nemo_skills/dataset/iheval/task_execution_verb_extract/__init__.py
  • nemo_skills/dataset/iheval/tool_use_slack_user/__init__.py
  • nemo_skills/dataset/iheval/tool_use_webpage/__init__.py
  • nemo_skills/evaluation/evaluator/__init__.py
  • nemo_skills/evaluation/evaluator/iheval.py
  • nemo_skills/evaluation/metrics/iheval_metrics.py
  • nemo_skills/evaluation/metrics/map_metrics.py
  • tests/test_iheval_metrics.py
  • tests/test_iheval_score.py

- Benchmark group is defined in [`nemo_skills/dataset/iheval/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/iheval/__init__.py); run all sub-benchmarks with `--benchmarks iheval` (or a single one, e.g. `iheval.safety_hijack`).
- Data is downloaded at prepare time from the [`zhihz0535/IHEval`](https://huggingface.co/datasets/zhihz0535/IHEval) HuggingFace mirror (not committed).
- Rule-based scoring lives in the standalone [`bzantium/iheval`](https://github.com/bzantium/iheval) package — install with `pip install git+https://github.com/bzantium/iheval.git` (already baked into the nemo-skills Docker image).
- Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use descriptive link text instead of “here”.

Line 27 uses non-descriptive link text, which hurts docs readability and accessibility tooling checks.

Suggested doc diff
-- Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval).
+- Original IHEval benchmark source is [ytyz1307zzh/IHEval](https://github.com/ytyz1307zzh/IHEval).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval).
- Original IHEval benchmark source is [ytyz1307zzh/IHEval](https://github.com/ytyz1307zzh/IHEval).
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 27-27: Link text should be descriptive

(MD059, descriptive-link-text)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/evaluation/instruction-following.md` at line 27, Replace the
non-descriptive link text "here" on line referencing the original benchmark
source with descriptive text like "IHEval benchmark source" (or similar
descriptive phrase) so the markdown reads: "Original benchmark source is IHEval
benchmark source" and make the link target the same URL
(https://github.com/ytyz1307zzh/IHEval); update the link text in the file
docs/evaluation/instruction-following.md accordingly to improve accessibility
and readability.

Comment on lines +72 to +73
instruction = row.get("instruction", "")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on required row fields instead of silently defaulting.

Using .get() for expected keys can silently corrupt output (instruction="", answer=None). For required schema fields, use direct indexing so malformed upstream data fails loudly.

Suggested change
-    instruction = row.get("instruction", "")
+    instruction = row["instruction"]
@@
-                    answer = row.get("answer")
+                    answer = row["answer"]
As per coding guidelines, "Don't use `.get()` for accessing dictionary keys if the code expects them to be present; use direct access `data[key_name]` to fail with a clear error instead of silently corrupting data".

Also applies to: 152-163

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/iheval/prepare.py` around lines 72 - 73, Replace silent
.get() usage for required fields so malformed input fails fast: change
expressions like instruction = row.get("instruction", "") and similar
row.get("answer") usages to direct index access (e.g., instruction =
row["instruction"], answer = row["answer"]) in the prepare logic, and if you
need a clearer error message wrap the access in a short try/except that raises a
ValueError with context (including the row id or contents) so missing required
keys fail loudly; apply the same change for the other occurrences in the block
around the later rows (the group currently using .get() at the second block
mentioned).

Comment on lines +146 to +166
with output_file.open("w", encoding="utf-8") as fout:
for setting in SETTINGS:
for variant in list_variants(repo_files, category, task, setting):
data = load_input_data(category, task, setting, variant)
for i, row in enumerate(data):
messages = build_messages(row, category_id, task_id, setting)
answer = row.get("answer")
upstream_id = row.get("id", i)
record = {
"id": f"iheval-{task_id}-{setting}-{variant}-{upstream_id}",
"messages": messages,
"question": _last_user_text(messages),
"setting": setting,
"variant": variant,
"category": category_id,
"task": task_id,
"answer": answer,
"expected_answer": answer,
}
fout.write(json.dumps(record, ensure_ascii=False) + "\n")
rows_written += 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Write test.jsonl atomically after successful preparation.

The final output file is opened before all downloads/transforms finish. A mid-run failure can leave a truncated file. Build records first (or write to a temp file) and replace the target only on success.

Suggested change
 def process_sub_benchmark(out_root, repo_files, split_name, spec):
@@
-    rows_written = 0
-    with output_file.open("w", encoding="utf-8") as fout:
-        for setting in SETTINGS:
-            for variant in list_variants(repo_files, category, task, setting):
-                data = load_input_data(category, task, setting, variant)
-                for i, row in enumerate(data):
-                    ...
-                    fout.write(json.dumps(record, ensure_ascii=False) + "\n")
-                    rows_written += 1
+    records = []
+    for setting in SETTINGS:
+        for variant in list_variants(repo_files, category, task, setting):
+            data = load_input_data(category, task, setting, variant)
+            for i, row in enumerate(data):
+                ...
+                records.append(record)
+
+    tmp_file = out_dir / "test.jsonl.tmp"
+    with tmp_file.open("w", encoding="utf-8") as fout:
+        for record in records:
+            fout.write(json.dumps(record, ensure_ascii=False) + "\n")
+    tmp_file.replace(output_file)
+    rows_written = len(records)
As per coding guidelines, "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/iheval/prepare.py` around lines 146 - 166, The current
loop writes directly to output_file which can be left truncated on failure;
instead build all records first (e.g., accumulate into a list using
list_variants, load_input_data, build_messages, _last_user_text and the same
record shape and rows_written logic) or write to a temporary file (e.g.,
output_file.with_suffix(".tmp")) and only move/rename it to output_file on
successful completion; ensure you preserve the id format
("iheval-{task_id}-{setting}-{variant}-{upstream_id}") and increment
rows_written the same way before performing an atomic replace of output_file.

Comment on lines +192 to +195
parser.add_argument("--split", default="test", choices=("test",), help="Local split name.")
parser.add_argument(
"--only", nargs="*", default=None, help="Restrict to these sub-benchmark output dirs (e.g. safety_hijack)."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove or wire --split; it is currently ignored.

--split is user-facing but not used in execution, so it behaves like a no-op argument.

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

Also applies to: 197-197

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/dataset/iheval/prepare.py` around lines 192 - 195, The CLI
defines a --split argument but never uses it; either remove the
parser.add_argument("--split", ...) line or wire the parsed "split" value into
the downstream logic that selects which dataset split to prepare (e.g., pass
args.split into the function that loads/iterates splits or replace any
hard-coded "test" usage). Locate the parser.add_argument("--split", ...) and
ensure the parsed variable named split is consumed where the code currently
assumes "test" (or add validation to reject non-supported values) so the
user-provided flag is not a no-op.

Comment on lines +32 to +33
def _get_score_dict(self, prediction: dict) -> dict[str, float]:
return {"symbolic_correct": float(prediction.get("symbolic_correct", 0.0))}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when symbolic_correct is missing instead of silently defaulting to 0.0.

On Line 33 and Line 46, .get("symbolic_correct", 0.0) can mask bad evaluator output and skew aggregated scores. symbolic_correct is required for this metric path, so direct key access is safer.

Proposed fix
 def _get_score_dict(self, prediction: dict) -> dict[str, float]:
-    return {"symbolic_correct": float(prediction.get("symbolic_correct", 0.0))}
+    return {"symbolic_correct": float(prediction["symbolic_correct"])}
@@
-    scores = [float(p.get("symbolic_correct", 0.0)) for p in predictions]
+    scores = [float(p["symbolic_correct"]) for p in predictions]

As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

Also applies to: 46-47

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/evaluation/metrics/iheval_metrics.py` around lines 32 - 33,
Replace the silent defaulting of "symbolic_correct" with explicit required
access so missing evaluator output fails fast: in the _get_score_dict method
(and the other place using .get("symbolic_correct", 0.0)) change to direct
dictionary indexing (e.g., prediction["symbolic_correct"]) so a KeyError is
raised if the key is absent, ensuring corrupted/missing evaluator output is
detected rather than aggregated as 0.0; add a brief contextual error message or
let the KeyError propagate to surface the problem to callers.

def test_pass_at_k_modes_are_aggregated(self):
# iheval:N -> each sub reports extra agg modes; the group composite must surface them all.
metrics = _full_metrics()
for name, sub in metrics.items():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove unused loop variable in test loop.

Line 116 binds name but never uses it. Iterate over values directly (or rename to _) to satisfy Ruff B007.

Proposed fix
-        for name, sub in metrics.items():
+        for sub in metrics.values():
             base = sub["pass@1"]["symbolic_correct"]
             sub["pass@4"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base + 10}
             sub["pass@1[avg-of-4]"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base - 5}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for name, sub in metrics.items():
for sub in metrics.values():
base = sub["pass@1"]["symbolic_correct"]
sub["pass@4"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base + 10}
sub["pass@1[avg-of-4]"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base - 5}
🧰 Tools
🪛 Ruff (0.15.14)

[warning] 116-116: Loop control variable name not used within loop body

Rename unused name to _name

(B007)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_iheval_score.py` at line 116, The test loop binds an unused
variable "name" in "for name, sub in metrics.items()" causing a Ruff B007
warning; update the loop in tests/test_iheval_score.py to either iterate values
directly using "for sub in metrics.values()" or replace "name" with "_" as "for
_, sub in metrics.items()" so only the used "sub" variable remains.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/test_iheval_score.py (1)

116-119: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove unused loop variable.

Line 116 binds name but never uses it in the loop body. This violates the guideline to avoid unused parameters and triggers a Ruff B007 warning.

♻️ Proposed fix
-        for name, sub in metrics.items():
+        for sub in metrics.values():
             base = sub["pass@1"]["symbolic_correct"]
             sub["pass@4"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base + 10}
             sub["pass@1[avg-of-4]"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base - 5}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_iheval_score.py` around lines 116 - 119, The for-loop over metrics
binds an unused variable name which triggers Ruff B007; change the loop to avoid
binding name by iterating values directly (use metrics.values()) or use a
discard variable (_) so only sub is bound; update the loop that currently reads
"for name, sub in metrics.items()" to either "for sub in metrics.values()" or
"for _, sub in metrics.items()" so name is not unused while keeping the body
that assigns to sub["pass@4"] and sub["pass@1[avg-of-4]"] unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/evaluation/instruction-following.md`:
- Around line 19-27: Add runnable example commands and expected-result snippets
to the IHEval section in docs/evaluation/instruction-following.md: include
example ns eval invocations (e.g., run the whole group with --benchmarks iheval
and a single sub-benchmark like --benchmarks iheval.safety_hijack) and show
required flags (--model, --output_dir); document expected metrics such as
overall symbolic_correct, per-setting breakdowns for aligned / conflict /
reference, category/group averages and the conflict_gap (reference - conflict);
mention where the benchmark is defined (nemo_skills/dataset/iheval/__init__.py)
and that rule-based scoring comes from the bzantium/iheval package so readers
know prerequisites.

---

Duplicate comments:
In `@tests/test_iheval_score.py`:
- Around line 116-119: The for-loop over metrics binds an unused variable name
which triggers Ruff B007; change the loop to avoid binding name by iterating
values directly (use metrics.values()) or use a discard variable (_) so only sub
is bound; update the loop that currently reads "for name, sub in
metrics.items()" to either "for sub in metrics.values()" or "for _, sub in
metrics.items()" so name is not unused while keeping the body that assigns to
sub["pass@4"] and sub["pass@1[avg-of-4]"] unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 07e6fd54-0165-4dbc-af8b-09c90af9e88c

📥 Commits

Reviewing files that changed from the base of the PR and between 8d41804 and 024716f.

📒 Files selected for processing (20)
  • dockerfiles/Dockerfile.nemo-skills
  • docs/evaluation/instruction-following.md
  • nemo_skills/dataset/iheval/__init__.py
  • nemo_skills/dataset/iheval/iheval_score.py
  • nemo_skills/dataset/iheval/prepare.py
  • nemo_skills/dataset/iheval/rule_following_multi/__init__.py
  • nemo_skills/dataset/iheval/rule_following_single/__init__.py
  • nemo_skills/dataset/iheval/safety_extract/__init__.py
  • nemo_skills/dataset/iheval/safety_hijack/__init__.py
  • nemo_skills/dataset/iheval/task_execution_lang_detect/__init__.py
  • nemo_skills/dataset/iheval/task_execution_translation/__init__.py
  • nemo_skills/dataset/iheval/task_execution_verb_extract/__init__.py
  • nemo_skills/dataset/iheval/tool_use_slack_user/__init__.py
  • nemo_skills/dataset/iheval/tool_use_webpage/__init__.py
  • nemo_skills/evaluation/evaluator/__init__.py
  • nemo_skills/evaluation/evaluator/iheval.py
  • nemo_skills/evaluation/metrics/iheval_metrics.py
  • nemo_skills/evaluation/metrics/map_metrics.py
  • tests/test_iheval_metrics.py
  • tests/test_iheval_score.py
✅ Files skipped from review due to trivial changes (6)
  • nemo_skills/dataset/iheval/tool_use_slack_user/init.py
  • nemo_skills/dataset/iheval/rule_following_single/init.py
  • nemo_skills/dataset/iheval/task_execution_verb_extract/init.py
  • nemo_skills/dataset/iheval/init.py
  • nemo_skills/dataset/iheval/tool_use_webpage/init.py
  • nemo_skills/dataset/iheval/safety_hijack/init.py
🚧 Files skipped from review as they are similar to previous changes (11)
  • nemo_skills/evaluation/metrics/map_metrics.py
  • nemo_skills/dataset/iheval/task_execution_lang_detect/init.py
  • nemo_skills/dataset/iheval/task_execution_translation/init.py
  • nemo_skills/dataset/iheval/rule_following_multi/init.py
  • tests/test_iheval_metrics.py
  • nemo_skills/evaluation/evaluator/init.py
  • nemo_skills/evaluation/evaluator/iheval.py
  • nemo_skills/dataset/iheval/safety_extract/init.py
  • nemo_skills/evaluation/metrics/iheval_metrics.py
  • nemo_skills/dataset/iheval/prepare.py
  • nemo_skills/dataset/iheval/iheval_score.py

Comment on lines +19 to +27
IHEval (Instruction Hierarchy Evaluation) measures whether a model respects the
system > user > tool instruction hierarchy. It is a **benchmark group** with 9
sub-benchmarks across 4 categories (rule-following, task-execution, safety,
tool-use), each in `aligned` / `conflict` / `reference` settings.

- Benchmark group is defined in [`nemo_skills/dataset/iheval/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/iheval/__init__.py); run all sub-benchmarks with `--benchmarks iheval` (or a single one, e.g. `iheval.safety_hijack`).
- Data is downloaded at prepare time from the [`zhihz0535/IHEval`](https://huggingface.co/datasets/zhihz0535/IHEval) HuggingFace mirror (not committed).
- Rule-based scoring lives in the standalone [`bzantium/iheval`](https://github.com/bzantium/iheval) package — install with `pip install git+https://github.com/bzantium/iheval.git` (already baked into the nemo-skills Docker image).
- Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add runnable IHEval examples and expected-result snippets in this new section.

This new benchmark section documents scope, but it’s missing example eval commands and expected results for tested models, which are required for benchmark docs completeness.

Suggested doc addition
 ### iheval
 
 IHEval (Instruction Hierarchy Evaluation) measures whether a model respects the
 system > user > tool instruction hierarchy. It is a **benchmark group** with 9
 sub-benchmarks across 4 categories (rule-following, task-execution, safety,
 tool-use), each in `aligned` / `conflict` / `reference` settings.
 
 - Benchmark group is defined in [`nemo_skills/dataset/iheval/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/iheval/__init__.py); run all sub-benchmarks with `--benchmarks iheval` (or a single one, e.g. `iheval.safety_hijack`).
 - Data is downloaded at prepare time from the [`zhihz0535/IHEval`](https://huggingface.co/datasets/zhihz0535/IHEval) HuggingFace mirror (not committed).
 - Rule-based scoring lives in the standalone [`bzantium/iheval`](https://github.com/bzantium/iheval) package — install with `pip install git+https://github.com/bzantium/iheval.git` (already baked into the nemo-skills Docker image).
 - Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval).
+
+Example:
+```bash
+ns eval --benchmarks iheval --model <model_name> --output_dir <out_dir>
+ns eval --benchmarks iheval.safety_hijack --model <model_name> --output_dir <out_dir>
+```
+
+Expected results:
+- Metrics include overall `symbolic_correct` and setting breakdowns for `aligned`, `conflict`, and `reference`.
+- Group outputs include category-level averages and `conflict_gap` (`reference - conflict`).

As per coding guidelines, “When adding new benchmarks, add it to the corresponding place in the documentation with example commands for running evaluation and expected results for tested models”.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 27-27: Link text should be descriptive

(MD059, descriptive-link-text)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/evaluation/instruction-following.md` around lines 19 - 27, Add runnable
example commands and expected-result snippets to the IHEval section in
docs/evaluation/instruction-following.md: include example ns eval invocations
(e.g., run the whole group with --benchmarks iheval and a single sub-benchmark
like --benchmarks iheval.safety_hijack) and show required flags (--model,
--output_dir); document expected metrics such as overall symbolic_correct,
per-setting breakdowns for aligned / conflict / reference, category/group
averages and the conflict_gap (reference - conflict); mention where the
benchmark is defined (nemo_skills/dataset/iheval/__init__.py) and that
rule-based scoring comes from the bzantium/iheval package so readers know
prerequisites.

IHEval is a benchmark group with 9 sub-benchmarks across 4 categories
(rule-following, task-execution, safety, tool-use), each in aligned /
conflict / reference settings.

- dataset/iheval: benchmark group + HF-based prepare.py (downloads per-variant
  input_data.json from the zhihz0535/IHEval mirror; test.jsonl is gitignored),
  iheval_score.py group composite, and 9 sub-benchmark configs.
- Scoring lives in the standalone pip package bzantium/iheval (Apache-2.0);
  evaluator/iheval.py is a thin lazy-import wrapper (bfcl-style). Pinned in the
  Dockerfile.
- IHEvalMetrics uses the base pass@k machinery so iheval:N yields
  pass@1[avg-of-N], plus per-setting / per-variant breakdowns.
- Registered iheval_* eval types and the iheval metrics key; documented under
  instruction-following.

Signed-off-by: bzantium <ryumin93@gmail.com>
@Kipok

Kipok commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Codex review comment posted on behalf of @Kipok:

I found one metrics issue in this PR.

nemo_skills/evaluation/metrics/iheval_metrics.py:59 copies the same setting_* and variant_* breakdowns into every aggregation mode returned by get_metrics(). Those breakdowns are computed earlier from the average over all attempts for a sample, while pass@N.symbolic_correct is computed as best-of-N.

That makes pass@N internally inconsistent for eval_config.generation.num_random_seeds > 1. For example, with two attempts for the same item where one passes and one fails, pass@2.symbolic_correct reports 100.0, but pass@2.setting_aligned / variant_* report 50.0.

I think the fix should be either:

  • compute setting/variant breakdowns separately for each aggregation mode, mirroring the pass@k logic; or
  • only attach these breakdowns to aggregation modes where average-over-attempts is the intended meaning, and omit them from pass@N.

The previous implementation copied the same setting_* / variant_* values
(computed as mean-over-attempts) into every aggregation mode, so pass@N
reported best-of-N for symbolic_correct but mean-over-attempts for the
breakdowns — internally inconsistent at N>1. Also, pass@1 used scores[0],
which disagrees with base's probabilistic formula for binary scores.

Now stores per-sample attempt scores and computes setting/variant per
aggregation mode, mirroring base._compute_pass_at_k semantics
(probabilistic for binary, max for fractional, mean for avg-of-k).

Signed-off-by: bzantium <ryumin93@gmail.com>
@bzantium

Copy link
Copy Markdown
Contributor Author

Thanks for catching this. Fixed in 65b6241.

update() now stores per-sample attempt scores with setting / variant, and get_metrics() computes the breakdown per aggregation mode via a _mode_aggregator(mode_key) that mirrors _compute_pass_at_k (probabilistic for binary pass@k, max for fractional, mean for pass@1[avg-of-k]). pass@1 deliberately falls through to the generic pass@k branch at k=1 so binary subs match base's correct/N rather than scores[0].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add IHEval (Instruction Hierarchy Evaluation) benchmark

2 participants