Add IHEval (Instruction Hierarchy Evaluation) benchmark#1464
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughIntegrates the IHEval benchmark group (9 sub-benchmarks): adds dataset preparation, group-level scoring, evaluator wrappers and registry entries, metrics implementation and registration, Docker pinning of the scorer package, docs, and tests. ChangesIHEval Benchmark Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/evaluation/instruction-following.md`:
- Line 27: Replace the non-descriptive link text "here" on line referencing the
original benchmark source with descriptive text like "IHEval benchmark source"
(or similar descriptive phrase) so the markdown reads: "Original benchmark
source is IHEval benchmark source" and make the link target the same URL
(https://github.com/ytyz1307zzh/IHEval); update the link text in the file
docs/evaluation/instruction-following.md accordingly to improve accessibility
and readability.
In `@nemo_skills/dataset/iheval/prepare.py`:
- Around line 72-73: Replace silent .get() usage for required fields so
malformed input fails fast: change expressions like instruction =
row.get("instruction", "") and similar row.get("answer") usages to direct index
access (e.g., instruction = row["instruction"], answer = row["answer"]) in the
prepare logic, and if you need a clearer error message wrap the access in a
short try/except that raises a ValueError with context (including the row id or
contents) so missing required keys fail loudly; apply the same change for the
other occurrences in the block around the later rows (the group currently using
.get() at the second block mentioned).
- Around line 192-195: The CLI defines a --split argument but never uses it;
either remove the parser.add_argument("--split", ...) line or wire the parsed
"split" value into the downstream logic that selects which dataset split to
prepare (e.g., pass args.split into the function that loads/iterates splits or
replace any hard-coded "test" usage). Locate the parser.add_argument("--split",
...) and ensure the parsed variable named split is consumed where the code
currently assumes "test" (or add validation to reject non-supported values) so
the user-provided flag is not a no-op.
- Around line 146-166: The current loop writes directly to output_file which can
be left truncated on failure; instead build all records first (e.g., accumulate
into a list using list_variants, load_input_data, build_messages,
_last_user_text and the same record shape and rows_written logic) or write to a
temporary file (e.g., output_file.with_suffix(".tmp")) and only move/rename it
to output_file on successful completion; ensure you preserve the id format
("iheval-{task_id}-{setting}-{variant}-{upstream_id}") and increment
rows_written the same way before performing an atomic replace of output_file.
In `@nemo_skills/evaluation/metrics/iheval_metrics.py`:
- Around line 32-33: Replace the silent defaulting of "symbolic_correct" with
explicit required access so missing evaluator output fails fast: in the
_get_score_dict method (and the other place using .get("symbolic_correct", 0.0))
change to direct dictionary indexing (e.g., prediction["symbolic_correct"]) so a
KeyError is raised if the key is absent, ensuring corrupted/missing evaluator
output is detected rather than aggregated as 0.0; add a brief contextual error
message or let the KeyError propagate to surface the problem to callers.
In `@tests/test_iheval_score.py`:
- Line 116: The test loop binds an unused variable "name" in "for name, sub in
metrics.items()" causing a Ruff B007 warning; update the loop in
tests/test_iheval_score.py to either iterate values directly using "for sub in
metrics.values()" or replace "name" with "_" as "for _, sub in metrics.items()"
so only the used "sub" variable remains.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4b0a5168-31d4-4944-bb19-b4cf1b5db447
📒 Files selected for processing (20)
dockerfiles/Dockerfile.nemo-skillsdocs/evaluation/instruction-following.mdnemo_skills/dataset/iheval/__init__.pynemo_skills/dataset/iheval/iheval_score.pynemo_skills/dataset/iheval/prepare.pynemo_skills/dataset/iheval/rule_following_multi/__init__.pynemo_skills/dataset/iheval/rule_following_single/__init__.pynemo_skills/dataset/iheval/safety_extract/__init__.pynemo_skills/dataset/iheval/safety_hijack/__init__.pynemo_skills/dataset/iheval/task_execution_lang_detect/__init__.pynemo_skills/dataset/iheval/task_execution_translation/__init__.pynemo_skills/dataset/iheval/task_execution_verb_extract/__init__.pynemo_skills/dataset/iheval/tool_use_slack_user/__init__.pynemo_skills/dataset/iheval/tool_use_webpage/__init__.pynemo_skills/evaluation/evaluator/__init__.pynemo_skills/evaluation/evaluator/iheval.pynemo_skills/evaluation/metrics/iheval_metrics.pynemo_skills/evaluation/metrics/map_metrics.pytests/test_iheval_metrics.pytests/test_iheval_score.py
| - Benchmark group is defined in [`nemo_skills/dataset/iheval/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/iheval/__init__.py); run all sub-benchmarks with `--benchmarks iheval` (or a single one, e.g. `iheval.safety_hijack`). | ||
| - Data is downloaded at prepare time from the [`zhihz0535/IHEval`](https://huggingface.co/datasets/zhihz0535/IHEval) HuggingFace mirror (not committed). | ||
| - Rule-based scoring lives in the standalone [`bzantium/iheval`](https://github.com/bzantium/iheval) package — install with `pip install git+https://github.com/bzantium/iheval.git` (already baked into the nemo-skills Docker image). | ||
| - Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval). |
There was a problem hiding this comment.
Use descriptive link text instead of “here”.
Line 27 uses non-descriptive link text, which hurts docs readability and accessibility tooling checks.
Suggested doc diff
-- Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval).
+- Original IHEval benchmark source is [ytyz1307zzh/IHEval](https://github.com/ytyz1307zzh/IHEval).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval). | |
| - Original IHEval benchmark source is [ytyz1307zzh/IHEval](https://github.com/ytyz1307zzh/IHEval). |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 27-27: Link text should be descriptive
(MD059, descriptive-link-text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/evaluation/instruction-following.md` at line 27, Replace the
non-descriptive link text "here" on line referencing the original benchmark
source with descriptive text like "IHEval benchmark source" (or similar
descriptive phrase) so the markdown reads: "Original benchmark source is IHEval
benchmark source" and make the link target the same URL
(https://github.com/ytyz1307zzh/IHEval); update the link text in the file
docs/evaluation/instruction-following.md accordingly to improve accessibility
and readability.
| instruction = row.get("instruction", "") | ||
|
|
There was a problem hiding this comment.
Fail fast on required row fields instead of silently defaulting.
Using .get() for expected keys can silently corrupt output (instruction="", answer=None). For required schema fields, use direct indexing so malformed upstream data fails loudly.
Suggested change
- instruction = row.get("instruction", "")
+ instruction = row["instruction"]
@@
- answer = row.get("answer")
+ answer = row["answer"]Also applies to: 152-163
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/iheval/prepare.py` around lines 72 - 73, Replace silent
.get() usage for required fields so malformed input fails fast: change
expressions like instruction = row.get("instruction", "") and similar
row.get("answer") usages to direct index access (e.g., instruction =
row["instruction"], answer = row["answer"]) in the prepare logic, and if you
need a clearer error message wrap the access in a short try/except that raises a
ValueError with context (including the row id or contents) so missing required
keys fail loudly; apply the same change for the other occurrences in the block
around the later rows (the group currently using .get() at the second block
mentioned).
| with output_file.open("w", encoding="utf-8") as fout: | ||
| for setting in SETTINGS: | ||
| for variant in list_variants(repo_files, category, task, setting): | ||
| data = load_input_data(category, task, setting, variant) | ||
| for i, row in enumerate(data): | ||
| messages = build_messages(row, category_id, task_id, setting) | ||
| answer = row.get("answer") | ||
| upstream_id = row.get("id", i) | ||
| record = { | ||
| "id": f"iheval-{task_id}-{setting}-{variant}-{upstream_id}", | ||
| "messages": messages, | ||
| "question": _last_user_text(messages), | ||
| "setting": setting, | ||
| "variant": variant, | ||
| "category": category_id, | ||
| "task": task_id, | ||
| "answer": answer, | ||
| "expected_answer": answer, | ||
| } | ||
| fout.write(json.dumps(record, ensure_ascii=False) + "\n") | ||
| rows_written += 1 |
There was a problem hiding this comment.
Write test.jsonl atomically after successful preparation.
The final output file is opened before all downloads/transforms finish. A mid-run failure can leave a truncated file. Build records first (or write to a temp file) and replace the target only on success.
Suggested change
def process_sub_benchmark(out_root, repo_files, split_name, spec):
@@
- rows_written = 0
- with output_file.open("w", encoding="utf-8") as fout:
- for setting in SETTINGS:
- for variant in list_variants(repo_files, category, task, setting):
- data = load_input_data(category, task, setting, variant)
- for i, row in enumerate(data):
- ...
- fout.write(json.dumps(record, ensure_ascii=False) + "\n")
- rows_written += 1
+ records = []
+ for setting in SETTINGS:
+ for variant in list_variants(repo_files, category, task, setting):
+ data = load_input_data(category, task, setting, variant)
+ for i, row in enumerate(data):
+ ...
+ records.append(record)
+
+ tmp_file = out_dir / "test.jsonl.tmp"
+ with tmp_file.open("w", encoding="utf-8") as fout:
+ for record in records:
+ fout.write(json.dumps(record, ensure_ascii=False) + "\n")
+ tmp_file.replace(output_file)
+ rows_written = len(records)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/iheval/prepare.py` around lines 146 - 166, The current
loop writes directly to output_file which can be left truncated on failure;
instead build all records first (e.g., accumulate into a list using
list_variants, load_input_data, build_messages, _last_user_text and the same
record shape and rows_written logic) or write to a temporary file (e.g.,
output_file.with_suffix(".tmp")) and only move/rename it to output_file on
successful completion; ensure you preserve the id format
("iheval-{task_id}-{setting}-{variant}-{upstream_id}") and increment
rows_written the same way before performing an atomic replace of output_file.
| parser.add_argument("--split", default="test", choices=("test",), help="Local split name.") | ||
| parser.add_argument( | ||
| "--only", nargs="*", default=None, help="Restrict to these sub-benchmark output dirs (e.g. safety_hijack)." | ||
| ) |
There was a problem hiding this comment.
Remove or wire --split; it is currently ignored.
--split is user-facing but not used in execution, so it behaves like a no-op argument.
As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".
Also applies to: 197-197
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/dataset/iheval/prepare.py` around lines 192 - 195, The CLI
defines a --split argument but never uses it; either remove the
parser.add_argument("--split", ...) line or wire the parsed "split" value into
the downstream logic that selects which dataset split to prepare (e.g., pass
args.split into the function that loads/iterates splits or replace any
hard-coded "test" usage). Locate the parser.add_argument("--split", ...) and
ensure the parsed variable named split is consumed where the code currently
assumes "test" (or add validation to reject non-supported values) so the
user-provided flag is not a no-op.
| def _get_score_dict(self, prediction: dict) -> dict[str, float]: | ||
| return {"symbolic_correct": float(prediction.get("symbolic_correct", 0.0))} |
There was a problem hiding this comment.
Fail fast when symbolic_correct is missing instead of silently defaulting to 0.0.
On Line 33 and Line 46, .get("symbolic_correct", 0.0) can mask bad evaluator output and skew aggregated scores. symbolic_correct is required for this metric path, so direct key access is safer.
Proposed fix
def _get_score_dict(self, prediction: dict) -> dict[str, float]:
- return {"symbolic_correct": float(prediction.get("symbolic_correct", 0.0))}
+ return {"symbolic_correct": float(prediction["symbolic_correct"])}
@@
- scores = [float(p.get("symbolic_correct", 0.0)) for p in predictions]
+ scores = [float(p["symbolic_correct"]) for p in predictions]As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".
Also applies to: 46-47
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/evaluation/metrics/iheval_metrics.py` around lines 32 - 33,
Replace the silent defaulting of "symbolic_correct" with explicit required
access so missing evaluator output fails fast: in the _get_score_dict method
(and the other place using .get("symbolic_correct", 0.0)) change to direct
dictionary indexing (e.g., prediction["symbolic_correct"]) so a KeyError is
raised if the key is absent, ensuring corrupted/missing evaluator output is
detected rather than aggregated as 0.0; add a brief contextual error message or
let the KeyError propagate to surface the problem to callers.
| def test_pass_at_k_modes_are_aggregated(self): | ||
| # iheval:N -> each sub reports extra agg modes; the group composite must surface them all. | ||
| metrics = _full_metrics() | ||
| for name, sub in metrics.items(): |
There was a problem hiding this comment.
Remove unused loop variable in test loop.
Line 116 binds name but never uses it. Iterate over values directly (or rename to _) to satisfy Ruff B007.
Proposed fix
- for name, sub in metrics.items():
+ for sub in metrics.values():
base = sub["pass@1"]["symbolic_correct"]
sub["pass@4"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base + 10}
sub["pass@1[avg-of-4]"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base - 5}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for name, sub in metrics.items(): | |
| for sub in metrics.values(): | |
| base = sub["pass@1"]["symbolic_correct"] | |
| sub["pass@4"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base + 10} | |
| sub["pass@1[avg-of-4]"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base - 5} |
🧰 Tools
🪛 Ruff (0.15.14)
[warning] 116-116: Loop control variable name not used within loop body
Rename unused name to _name
(B007)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_iheval_score.py` at line 116, The test loop binds an unused
variable "name" in "for name, sub in metrics.items()" causing a Ruff B007
warning; update the loop in tests/test_iheval_score.py to either iterate values
directly using "for sub in metrics.values()" or replace "name" with "_" as "for
_, sub in metrics.items()" so only the used "sub" variable remains.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/test_iheval_score.py (1)
116-119:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRemove unused loop variable.
Line 116 binds
namebut never uses it in the loop body. This violates the guideline to avoid unused parameters and triggers a Ruff B007 warning.♻️ Proposed fix
- for name, sub in metrics.items(): + for sub in metrics.values(): base = sub["pass@1"]["symbolic_correct"] sub["pass@4"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base + 10} sub["pass@1[avg-of-4]"] = {"num_entries": sub["pass@1"]["num_entries"], "symbolic_correct": base - 5}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_iheval_score.py` around lines 116 - 119, The for-loop over metrics binds an unused variable name which triggers Ruff B007; change the loop to avoid binding name by iterating values directly (use metrics.values()) or use a discard variable (_) so only sub is bound; update the loop that currently reads "for name, sub in metrics.items()" to either "for sub in metrics.values()" or "for _, sub in metrics.items()" so name is not unused while keeping the body that assigns to sub["pass@4"] and sub["pass@1[avg-of-4]"] unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/evaluation/instruction-following.md`:
- Around line 19-27: Add runnable example commands and expected-result snippets
to the IHEval section in docs/evaluation/instruction-following.md: include
example ns eval invocations (e.g., run the whole group with --benchmarks iheval
and a single sub-benchmark like --benchmarks iheval.safety_hijack) and show
required flags (--model, --output_dir); document expected metrics such as
overall symbolic_correct, per-setting breakdowns for aligned / conflict /
reference, category/group averages and the conflict_gap (reference - conflict);
mention where the benchmark is defined (nemo_skills/dataset/iheval/__init__.py)
and that rule-based scoring comes from the bzantium/iheval package so readers
know prerequisites.
---
Duplicate comments:
In `@tests/test_iheval_score.py`:
- Around line 116-119: The for-loop over metrics binds an unused variable name
which triggers Ruff B007; change the loop to avoid binding name by iterating
values directly (use metrics.values()) or use a discard variable (_) so only sub
is bound; update the loop that currently reads "for name, sub in
metrics.items()" to either "for sub in metrics.values()" or "for _, sub in
metrics.items()" so name is not unused while keeping the body that assigns to
sub["pass@4"] and sub["pass@1[avg-of-4]"] unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 07e6fd54-0165-4dbc-af8b-09c90af9e88c
📒 Files selected for processing (20)
dockerfiles/Dockerfile.nemo-skillsdocs/evaluation/instruction-following.mdnemo_skills/dataset/iheval/__init__.pynemo_skills/dataset/iheval/iheval_score.pynemo_skills/dataset/iheval/prepare.pynemo_skills/dataset/iheval/rule_following_multi/__init__.pynemo_skills/dataset/iheval/rule_following_single/__init__.pynemo_skills/dataset/iheval/safety_extract/__init__.pynemo_skills/dataset/iheval/safety_hijack/__init__.pynemo_skills/dataset/iheval/task_execution_lang_detect/__init__.pynemo_skills/dataset/iheval/task_execution_translation/__init__.pynemo_skills/dataset/iheval/task_execution_verb_extract/__init__.pynemo_skills/dataset/iheval/tool_use_slack_user/__init__.pynemo_skills/dataset/iheval/tool_use_webpage/__init__.pynemo_skills/evaluation/evaluator/__init__.pynemo_skills/evaluation/evaluator/iheval.pynemo_skills/evaluation/metrics/iheval_metrics.pynemo_skills/evaluation/metrics/map_metrics.pytests/test_iheval_metrics.pytests/test_iheval_score.py
✅ Files skipped from review due to trivial changes (6)
- nemo_skills/dataset/iheval/tool_use_slack_user/init.py
- nemo_skills/dataset/iheval/rule_following_single/init.py
- nemo_skills/dataset/iheval/task_execution_verb_extract/init.py
- nemo_skills/dataset/iheval/init.py
- nemo_skills/dataset/iheval/tool_use_webpage/init.py
- nemo_skills/dataset/iheval/safety_hijack/init.py
🚧 Files skipped from review as they are similar to previous changes (11)
- nemo_skills/evaluation/metrics/map_metrics.py
- nemo_skills/dataset/iheval/task_execution_lang_detect/init.py
- nemo_skills/dataset/iheval/task_execution_translation/init.py
- nemo_skills/dataset/iheval/rule_following_multi/init.py
- tests/test_iheval_metrics.py
- nemo_skills/evaluation/evaluator/init.py
- nemo_skills/evaluation/evaluator/iheval.py
- nemo_skills/dataset/iheval/safety_extract/init.py
- nemo_skills/evaluation/metrics/iheval_metrics.py
- nemo_skills/dataset/iheval/prepare.py
- nemo_skills/dataset/iheval/iheval_score.py
| IHEval (Instruction Hierarchy Evaluation) measures whether a model respects the | ||
| system > user > tool instruction hierarchy. It is a **benchmark group** with 9 | ||
| sub-benchmarks across 4 categories (rule-following, task-execution, safety, | ||
| tool-use), each in `aligned` / `conflict` / `reference` settings. | ||
|
|
||
| - Benchmark group is defined in [`nemo_skills/dataset/iheval/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/iheval/__init__.py); run all sub-benchmarks with `--benchmarks iheval` (or a single one, e.g. `iheval.safety_hijack`). | ||
| - Data is downloaded at prepare time from the [`zhihz0535/IHEval`](https://huggingface.co/datasets/zhihz0535/IHEval) HuggingFace mirror (not committed). | ||
| - Rule-based scoring lives in the standalone [`bzantium/iheval`](https://github.com/bzantium/iheval) package — install with `pip install git+https://github.com/bzantium/iheval.git` (already baked into the nemo-skills Docker image). | ||
| - Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval). |
There was a problem hiding this comment.
Add runnable IHEval examples and expected-result snippets in this new section.
This new benchmark section documents scope, but it’s missing example eval commands and expected results for tested models, which are required for benchmark docs completeness.
Suggested doc addition
### iheval
IHEval (Instruction Hierarchy Evaluation) measures whether a model respects the
system > user > tool instruction hierarchy. It is a **benchmark group** with 9
sub-benchmarks across 4 categories (rule-following, task-execution, safety,
tool-use), each in `aligned` / `conflict` / `reference` settings.
- Benchmark group is defined in [`nemo_skills/dataset/iheval/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/iheval/__init__.py); run all sub-benchmarks with `--benchmarks iheval` (or a single one, e.g. `iheval.safety_hijack`).
- Data is downloaded at prepare time from the [`zhihz0535/IHEval`](https://huggingface.co/datasets/zhihz0535/IHEval) HuggingFace mirror (not committed).
- Rule-based scoring lives in the standalone [`bzantium/iheval`](https://github.com/bzantium/iheval) package — install with `pip install git+https://github.com/bzantium/iheval.git` (already baked into the nemo-skills Docker image).
- Original benchmark source is [here](https://github.com/ytyz1307zzh/IHEval).
+
+Example:
+```bash
+ns eval --benchmarks iheval --model <model_name> --output_dir <out_dir>
+ns eval --benchmarks iheval.safety_hijack --model <model_name> --output_dir <out_dir>
+```
+
+Expected results:
+- Metrics include overall `symbolic_correct` and setting breakdowns for `aligned`, `conflict`, and `reference`.
+- Group outputs include category-level averages and `conflict_gap` (`reference - conflict`).As per coding guidelines, “When adding new benchmarks, add it to the corresponding place in the documentation with example commands for running evaluation and expected results for tested models”.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 27-27: Link text should be descriptive
(MD059, descriptive-link-text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/evaluation/instruction-following.md` around lines 19 - 27, Add runnable
example commands and expected-result snippets to the IHEval section in
docs/evaluation/instruction-following.md: include example ns eval invocations
(e.g., run the whole group with --benchmarks iheval and a single sub-benchmark
like --benchmarks iheval.safety_hijack) and show required flags (--model,
--output_dir); document expected metrics such as overall symbolic_correct,
per-setting breakdowns for aligned / conflict / reference, category/group
averages and the conflict_gap (reference - conflict); mention where the
benchmark is defined (nemo_skills/dataset/iheval/__init__.py) and that
rule-based scoring comes from the bzantium/iheval package so readers know
prerequisites.
IHEval is a benchmark group with 9 sub-benchmarks across 4 categories (rule-following, task-execution, safety, tool-use), each in aligned / conflict / reference settings. - dataset/iheval: benchmark group + HF-based prepare.py (downloads per-variant input_data.json from the zhihz0535/IHEval mirror; test.jsonl is gitignored), iheval_score.py group composite, and 9 sub-benchmark configs. - Scoring lives in the standalone pip package bzantium/iheval (Apache-2.0); evaluator/iheval.py is a thin lazy-import wrapper (bfcl-style). Pinned in the Dockerfile. - IHEvalMetrics uses the base pass@k machinery so iheval:N yields pass@1[avg-of-N], plus per-setting / per-variant breakdowns. - Registered iheval_* eval types and the iheval metrics key; documented under instruction-following. Signed-off-by: bzantium <ryumin93@gmail.com>
|
Codex review comment posted on behalf of @Kipok: I found one metrics issue in this PR.
That makes I think the fix should be either:
|
The previous implementation copied the same setting_* / variant_* values (computed as mean-over-attempts) into every aggregation mode, so pass@N reported best-of-N for symbolic_correct but mean-over-attempts for the breakdowns — internally inconsistent at N>1. Also, pass@1 used scores[0], which disagrees with base's probabilistic formula for binary scores. Now stores per-sample attempt scores and computes setting/variant per aggregation mode, mirroring base._compute_pass_at_k semantics (probabilistic for binary, max for fractional, mean for avg-of-k). Signed-off-by: bzantium <ryumin93@gmail.com>
|
Thanks for catching this. Fixed in 65b6241.
|
Closes #1463.
Adds IHEval (Instruction Hierarchy Evaluation) — measures whether a model respects the system > user > tool instruction hierarchy. It is a benchmark group with 9 sub-benchmarks across 4 categories (rule-following, task-execution, safety, tool-use), each in
aligned/conflict/referencesettings.What's included
nemo_skills/dataset/iheval/— benchmark group (mirrors themmau-propattern): group__init__.py(IS_BENCHMARK_GROUP+ 9 sub-benchmarks +SCORE_MODULE), HF-basedprepare.py,iheval_score.pygroup composite, and 9 sub-benchmark configs.prepare.pydownloads per-variantinput_data.jsonfrom thezhihz0535/IHEvalHF mirror viahf_hub_download; the generatedtest.jsonlis gitignored (not committed). Rows carry a pre-built OpenAI-stylemessageslist (multi-turn + tool_calls / tool results), so generation uses++prompt_format=openai.bzantium/iheval;evaluation/evaluator/iheval.pyis a thin lazy-import wrapper (same pattern asbfcl_eval), pinned in the Dockerfile. Rationale: upstream IHEval is CC BY-NC-ND with no installable package; the rule-following checkers derive from Google IFEval (Apache-2.0), the rest are independent re-implementations of the documented rule-based metrics.IHEvalMetricsbuilds on the base pass@k machinery, soiheval:Nyieldspass@1[avg-of-N](per-mode composites at the group level too), with per-setting(aligned/conflict/reference) and per-variantbreakdowns.iheval_*eval types + theihevalmetrics key; documented under instruction-following.Usage
Summary by CodeRabbit
New Features
Documentation
Tests
Chores