refactor(evaluation): consolidate judge workflows behind a base class by memadi-nv · Pull Request #173 · NVIDIA-NeMo/Anonymizer

memadi-nv · 2026-05-29T04:42:23Z

Summary

Extracts a shared _BaseJudgeWorkflow ABC to eliminate the duplicated scaffolding across the four LLM-as-judge workflows. Pure refactor of the judges added in #158 — no behavior change, no public-API change.

What changed

New engine/evaluation/judge_base.py (+196) — _BaseJudgeWorkflow ABC + shared JudgeResult. Holds the logic deleted from all 4 subclasses: column_config(), _flatten_judgment(), postprocess(), standalone evaluate().
4 judges → thin subclasses — detection, type_fidelity, attribute_fidelity, relational_consistency. Each keeps only what's unique (schema, prompt, helpers, class attrs + 4 abstract hooks). ~120–150 lines of scaffolding removed per judge.
4 test files — one-line rebind each (_flatten_judgment = XxxJudgeWorkflow._flatten_judgment) since the helper now lives on the base.

Relationship to #158

#158 ("Add LLM-as-a-judge for replace evaluation") landed on main while this branch was open; both added the four judge files independently. main has been merged in and the add/add conflicts resolved in favor of the base-class versions. The diff reads as the refactor delta relative to #158's judges.

Also includes

skills/anonymizer/SKILL.md — documents the LLM-as-judge evaluate step (rule, usage tips, opt-in --evaluate flag in the output template).
.github/dependabot.yml — fixes an invalid open-pull-requests-limit on grouped ecosystems (carried in from chore: Use uv and pip ecosystem for dependabot #170; unrelated drive-by fix).
Greptile feedback — removed a redundant .copy() in judge_base.evaluate() (prepare() already returns a fresh frame) and reordered a stranded import in test_detection_judge.py.

Non-changes (verified)

Orchestrator ReplacementWorkflow._run_merged_judges untouched; still drives the four judges via prepare() / column_config() / postprocess().
Public API (Anonymizer.evaluate(...) + the four judge classes) unchanged.
Per-judge prompts, schemas, and intermediate column names byte-identical.

Type of Change

Testing

make test passes locally (822 tests)
make check passes locally (format + lint clean)
Added/updated tests for changes

Related Issues

Closes #98

Signed-off-by: memadi <memadi@nvidia.com>

* evaluate independent of the mode-EvaluateConfig Signed-off-by: memadi <memadi@nvidia.com> * address feedback Signed-off-by: memadi <memadi@nvidia.com> --------- Signed-off-by: memadi <memadi@nvidia.com>

Signed-off-by: memadi <memadi@nvidia.com>

…ed evaluate model-selection Signed-off-by: memadi <memadi@nvidia.com>

Signed-off-by: memadi <memadi@nvidia.com>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Signed-off-by: memadi <memadi@nvidia.com>

greptile-apps · 2026-05-29T04:51:44Z

Greptile Summary

This PR extracts a _BaseJudgeWorkflow ABC into engine/evaluation/judge_base.py, consolidating ~80% scaffolding that was copy-pasted across the four LLM-as-judge workflows. The shared _flatten_judgment classmethod is correctly parameterized via cls.SCHEMA and cls.VERDICT_FIELD, the evaluate() path faithfully reproduces the old split_rows logic by manually stamping ROW_ORDER_COL, and the orchestrator's prepare / column_config / postprocess contract is unchanged.

judge_base.py introduces _BaseJudgeWorkflow (ABC) and a unified JudgeResult dataclass; four class-level ClassVar slots plus four abstract hooks replace per-judge duplication.
All four judge subclasses are thinned to schema/prompt/column wiring only; judge-specific Result dataclasses, _flatten_judgment functions, column_config, postprocess, and evaluate bodies are removed.
Four test files rebind _flatten_judgment to the classmethod on each concrete class — functionally correct since the accessor returns a bound method with cls already fixed.

Confidence Score: 4/5

Safe to merge with awareness of two API surface changes already flagged in prior review rounds: the four judge-specific Result dataclasses are gone and the entities_column keyword on DetectionJudgeWorkflow.evaluate() is removed; no callers within this repo reference either.

The refactoring is behaviorally equivalent for both the orchestrator and standalone evaluate() paths. Within-repo grep confirms no code imports the removed Result dataclasses or passes entities_column to evaluate(). The remaining open question is whether any out-of-repo consumers depend on those symbols.

judge_base.py — the _extract_invalid abstract method types parsed as BaseModel but implementations access schema-specific attributes; harmless at runtime but would fail a strict type-check pass.

Important Files Changed

Filename	Overview
src/anonymizer/engine/evaluation/judge_base.py	New base class: _BaseJudgeWorkflow ABC with shared _flatten_judgment (parameterized via cls.SCHEMA/cls.VERDICT_FIELD), postprocess, and evaluate. Logic correct; _extract_invalid types parsed as BaseModel but implementations access schema-specific attributes.
src/anonymizer/engine/evaluation/detection_judge.py	Converted to thin subclass; all four abstract methods correctly wired. DetectionJudgeResult removed — no external reference found in the codebase.
src/anonymizer/engine/evaluation/replace/relational_consistency_judge.py	Passthrough threshold (len < 2) preserved correctly; VERDICT_FIELD=all_consistent correctly distinguishes this judge from the three all_valid judges.
src/anonymizer/engine/evaluation/replace/attribute_fidelity_judge.py	_extract_invalid filters parsed.entities by not e.passes, matching the old standalone _flatten_judgment exactly.
src/anonymizer/engine/evaluation/replace/type_fidelity_judge.py	Straightforward conversion; _extract_invalid correctly delegates to parsed.invalid_replacements.
tests/engine/evaluation/test_detection_judge.py	One-line rebind correctly points tests at the base-class classmethod bound to the right schema.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant SubclassJudge as XxxJudgeWorkflow (subclass)
    participant Base as _BaseJudgeWorkflow (base)
    participant Adapter as NddAdapter

    Note over SubclassJudge,Base: Orchestrator path (merged judges)
    Caller->>SubclassJudge: prepare(dataframe)
    SubclassJudge-->>Caller: df with intermediate cols
    Caller->>Base: column_config(selected_models)
    Base-->>Caller: LLMStructuredColumnConfig
    Caller->>Adapter: run_workflow(prepared, columns)
    Adapter-->>Caller: run_result
    Caller->>Base: postprocess(dataframe)
    Base->>SubclassJudge: _passthrough_mask(out)
    SubclassJudge-->>Base: boolean Series
    Base->>SubclassJudge: _flatten_judgment via cls.SCHEMA
    Base->>SubclassJudge: _extract_invalid(parsed)
    Base-->>Caller: df with VALID_COL + INVALID_COL

    Note over SubclassJudge,Base: Standalone evaluate() path
    Caller->>Base: evaluate(dataframe, model_configs, selected_models)
    Base->>SubclassJudge: prepare(dataframe)
    Base->>SubclassJudge: _passthrough_mask(working_df)
    Base->>Adapter: run_workflow(with_content)
    Adapter-->>Base: run_result
    Base-->>Caller: JudgeResult(dataframe, failed_records)

_{Reviews (5): Last reviewed commit: "address greptile feedback" | Re-trigger Greptile}

Signed-off-by: memadi <memadi@nvidia.com>

open-pull-requests-limit cannot be set on updates entries that belong to a multi-ecosystem group; Dependabot requires it on the group itself. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv and others added 29 commits May 11, 2026 17:27

add entity detection validation

856d9b7

Signed-off-by: memadi <memadi@nvidia.com>

add type fidelity metric

8a8636e

Signed-off-by: memadi <memadi@nvidia.com>

add relational consistency metric

6c3ae51

Signed-off-by: memadi <memadi@nvidia.com>

add attribute fidelity metric

7946920

Signed-off-by: memadi <memadi@nvidia.com>

update prompts

8776b53

Signed-off-by: memadi <memadi@nvidia.com>

disp;ay replacement map

dfad43b

Signed-off-by: memadi <memadi@nvidia.com>

update metric display

96c297a

Signed-off-by: memadi <memadi@nvidia.com>

more specific prompt

186d311

Signed-off-by: memadi <memadi@nvidia.com>

change judge models for sparce error

63592ad

Signed-off-by: memadi <memadi@nvidia.com>

nit-update namings in metric

8997e2c

Signed-off-by: memadi <memadi@nvidia.com>

add a toggle for replace evaluation

f203d96

Signed-off-by: memadi <memadi@nvidia.com>

format-nit

514b1e5

Signed-off-by: memadi <memadi@nvidia.com>

run evaluate judges in parallel

3e38013

Signed-off-by: memadi <memadi@nvidia.com>

seperate evaluate_replace from preview

f6c7d22

Signed-off-by: memadi <memadi@nvidia.com>

merge conflicts

52c2f3a

Signed-off-by: memadi <memadi@nvidia.com>

nit-format

133f069

Signed-off-by: memadi <memadi@nvidia.com>

nit

c04aa0d

Signed-off-by: memadi <memadi@nvidia.com>

address greptile feedback

74735b2

Signed-off-by: memadi <memadi@nvidia.com>

nit

3ad5929

Signed-off-by: memadi <memadi@nvidia.com>

feat: Make anonymizer evaluation mode-independent (#168)

b17b95d

* evaluate independent of the mode-EvaluateConfig Signed-off-by: memadi <memadi@nvidia.com> * address feedback Signed-off-by: memadi <memadi@nvidia.com> --------- Signed-off-by: memadi <memadi@nvidia.com>

address feedback-fixed empty-section render in detection judge.

e6859a2

Signed-off-by: memadi <memadi@nvidia.com>

address feedback-preserve rows when LLM drops some during judge workflow

e28ce8d

Signed-off-by: memadi <memadi@nvidia.com>

address feedback-refactor: move judge modules under engine/evaluation/

ab54a95

Signed-off-by: memadi <memadi@nvidia.com>

address feedback-refactor(config): split judge aliases into a dedicat…

5c63ee2

…ed evaluate model-selection Signed-off-by: memadi <memadi@nvidia.com>

add evaluation as an optional step to replace notebooks

187c3a5

Signed-off-by: memadi <memadi@nvidia.com>

Update src/anonymizer/interface/display.py

da6c682

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

address Lipika's feedback

6a759f6

Signed-off-by: memadi <memadi@nvidia.com>

nit-format

9c1379c

Signed-off-by: memadi <memadi@nvidia.com>

refactor(evaluation): consolidate judge workflows behind a base class

495d8ab

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv requested a review from lipikaramaswamy May 29, 2026 04:42

memadi-nv requested a review from a team as a code owner May 29, 2026 04:42

add judge base

4119738

Signed-off-by: memadi <memadi@nvidia.com>

greptile-apps Bot reviewed May 29, 2026

View reviewed changes

Comment thread src/anonymizer/engine/evaluation/judge_base.py

Comment thread tests/engine/evaluation/test_detection_judge.py Outdated

memadi-nv changed the base branch from memadi/feature/evaluate-replace to main May 29, 2026 04:55

memadi-nv and others added 2 commits June 1, 2026 10:11

fix merge conflicts

532c5c5

Signed-off-by: memadi <memadi@nvidia.com>

fix(dependabot): move open-pull-requests-limit to the python group

f11b19b

open-pull-requests-limit cannot be set on updates entries that belong to a multi-ecosystem group; Dependabot requires it on the group itself. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

memadi-nv requested a review from a team as a code owner June 1, 2026 18:26

memadi-nv added 2 commits June 1, 2026 12:11

Update SKILL according to evaluate; replace addition

e32aff4

Signed-off-by: memadi <memadi@nvidia.com>

address greptile feedback

6d58161

Signed-off-by: memadi <memadi@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(evaluation): consolidate judge workflows behind a base class#173

refactor(evaluation): consolidate judge workflows behind a base class#173
memadi-nv wants to merge 34 commits into
mainfrom
memadi/feature/judge-base-class

memadi-nv commented May 29, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

memadi-nv commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Relationship to #158

Also includes

Non-changes (verified)

Type of Change

Testing

Related Issues

Uh oh!

greptile-apps Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

memadi-nv commented May 29, 2026 •

edited

Loading

greptile-apps Bot commented May 29, 2026 •

edited

Loading