Skip to content

refactor(evaluation): consolidate judge workflows behind a base class#173

Open
memadi-nv wants to merge 34 commits into
mainfrom
memadi/feature/judge-base-class
Open

refactor(evaluation): consolidate judge workflows behind a base class#173
memadi-nv wants to merge 34 commits into
mainfrom
memadi/feature/judge-base-class

Conversation

@memadi-nv

@memadi-nv memadi-nv commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Extracts a shared _BaseJudgeWorkflow ABC to eliminate the duplicated scaffolding across the four LLM-as-judge workflows. Pure refactor of the judges added in #158 — no behavior change, no public-API change.

What changed

  • New engine/evaluation/judge_base.py (+196)_BaseJudgeWorkflow ABC + shared JudgeResult. Holds the logic deleted from all 4 subclasses: column_config(), _flatten_judgment(), postprocess(), standalone evaluate().
  • 4 judges → thin subclassesdetection, type_fidelity, attribute_fidelity, relational_consistency. Each keeps only what's unique (schema, prompt, helpers, class attrs + 4 abstract hooks). ~120–150 lines of scaffolding removed per judge.
  • 4 test files — one-line rebind each (_flatten_judgment = XxxJudgeWorkflow._flatten_judgment) since the helper now lives on the base.

Relationship to #158

#158 ("Add LLM-as-a-judge for replace evaluation") landed on main while this branch was open; both added the four judge files independently. main has been merged in and the add/add conflicts resolved in favor of the base-class versions. The diff reads as the refactor delta relative to #158's judges.

Also includes

  • skills/anonymizer/SKILL.md — documents the LLM-as-judge evaluate step (rule, usage tips, opt-in --evaluate flag in the output template).
  • .github/dependabot.yml — fixes an invalid open-pull-requests-limit on grouped ecosystems (carried in from chore: Use uv and pip ecosystem for dependabot #170; unrelated drive-by fix).
  • Greptile feedback — removed a redundant .copy() in judge_base.evaluate() (prepare() already returns a fresh frame) and reordered a stranded import in test_detection_judge.py.

Non-changes (verified)

  • Orchestrator ReplacementWorkflow._run_merged_judges untouched; still drives the four judges via prepare() / column_config() / postprocess().
  • Public API (Anonymizer.evaluate(...) + the four judge classes) unchanged.
  • Per-judge prompts, schemas, and intermediate column names byte-identical.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Refactoring

Testing

  • make test passes locally (822 tests)
  • make check passes locally (format + lint clean)
  • Added/updated tests for changes

Related Issues

Closes #98

memadi-nv and others added 29 commits May 11, 2026 17:27
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
* evaluate independent of the mode-EvaluateConfig

Signed-off-by: memadi <memadi@nvidia.com>

* address feedback

Signed-off-by: memadi <memadi@nvidia.com>

---------

Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
…ed evaluate model-selection

Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
@memadi-nv memadi-nv requested a review from lipikaramaswamy May 29, 2026 04:42
@memadi-nv memadi-nv requested a review from a team as a code owner May 29, 2026 04:42
Signed-off-by: memadi <memadi@nvidia.com>
@greptile-apps

greptile-apps Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR extracts a _BaseJudgeWorkflow ABC into engine/evaluation/judge_base.py, consolidating ~80% scaffolding that was copy-pasted across the four LLM-as-judge workflows. The shared _flatten_judgment classmethod is correctly parameterized via cls.SCHEMA and cls.VERDICT_FIELD, the evaluate() path faithfully reproduces the old split_rows logic by manually stamping ROW_ORDER_COL, and the orchestrator's prepare / column_config / postprocess contract is unchanged.

  • judge_base.py introduces _BaseJudgeWorkflow (ABC) and a unified JudgeResult dataclass; four class-level ClassVar slots plus four abstract hooks replace per-judge duplication.
  • All four judge subclasses are thinned to schema/prompt/column wiring only; judge-specific Result dataclasses, _flatten_judgment functions, column_config, postprocess, and evaluate bodies are removed.
  • Four test files rebind _flatten_judgment to the classmethod on each concrete class — functionally correct since the accessor returns a bound method with cls already fixed.

Confidence Score: 4/5

Safe to merge with awareness of two API surface changes already flagged in prior review rounds: the four judge-specific Result dataclasses are gone and the entities_column keyword on DetectionJudgeWorkflow.evaluate() is removed; no callers within this repo reference either.

The refactoring is behaviorally equivalent for both the orchestrator and standalone evaluate() paths. Within-repo grep confirms no code imports the removed Result dataclasses or passes entities_column to evaluate(). The remaining open question is whether any out-of-repo consumers depend on those symbols.

judge_base.py — the _extract_invalid abstract method types parsed as BaseModel but implementations access schema-specific attributes; harmless at runtime but would fail a strict type-check pass.

Important Files Changed

Filename Overview
src/anonymizer/engine/evaluation/judge_base.py New base class: _BaseJudgeWorkflow ABC with shared _flatten_judgment (parameterized via cls.SCHEMA/cls.VERDICT_FIELD), postprocess, and evaluate. Logic correct; _extract_invalid types parsed as BaseModel but implementations access schema-specific attributes.
src/anonymizer/engine/evaluation/detection_judge.py Converted to thin subclass; all four abstract methods correctly wired. DetectionJudgeResult removed — no external reference found in the codebase.
src/anonymizer/engine/evaluation/replace/relational_consistency_judge.py Passthrough threshold (len < 2) preserved correctly; VERDICT_FIELD=all_consistent correctly distinguishes this judge from the three all_valid judges.
src/anonymizer/engine/evaluation/replace/attribute_fidelity_judge.py _extract_invalid filters parsed.entities by not e.passes, matching the old standalone _flatten_judgment exactly.
src/anonymizer/engine/evaluation/replace/type_fidelity_judge.py Straightforward conversion; _extract_invalid correctly delegates to parsed.invalid_replacements.
tests/engine/evaluation/test_detection_judge.py One-line rebind correctly points tests at the base-class classmethod bound to the right schema.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant SubclassJudge as XxxJudgeWorkflow (subclass)
    participant Base as _BaseJudgeWorkflow (base)
    participant Adapter as NddAdapter

    Note over SubclassJudge,Base: Orchestrator path (merged judges)
    Caller->>SubclassJudge: prepare(dataframe)
    SubclassJudge-->>Caller: df with intermediate cols
    Caller->>Base: column_config(selected_models)
    Base-->>Caller: LLMStructuredColumnConfig
    Caller->>Adapter: run_workflow(prepared, columns)
    Adapter-->>Caller: run_result
    Caller->>Base: postprocess(dataframe)
    Base->>SubclassJudge: _passthrough_mask(out)
    SubclassJudge-->>Base: boolean Series
    Base->>SubclassJudge: _flatten_judgment via cls.SCHEMA
    Base->>SubclassJudge: _extract_invalid(parsed)
    Base-->>Caller: df with VALID_COL + INVALID_COL

    Note over SubclassJudge,Base: Standalone evaluate() path
    Caller->>Base: evaluate(dataframe, model_configs, selected_models)
    Base->>SubclassJudge: prepare(dataframe)
    Base->>SubclassJudge: _passthrough_mask(working_df)
    Base->>Adapter: run_workflow(with_content)
    Adapter-->>Base: run_result
    Base-->>Caller: JudgeResult(dataframe, failed_records)
Loading

Reviews (5): Last reviewed commit: "address greptile feedback" | Re-trigger Greptile

Comment thread src/anonymizer/engine/evaluation/judge_base.py
Comment thread tests/engine/evaluation/test_detection_judge.py Outdated
@memadi-nv memadi-nv changed the base branch from memadi/feature/evaluate-replace to main May 29, 2026 04:55
memadi-nv and others added 2 commits June 1, 2026 10:11
Signed-off-by: memadi <memadi@nvidia.com>
open-pull-requests-limit cannot be set on updates entries that belong to
a multi-ecosystem group; Dependabot requires it on the group itself.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@memadi-nv memadi-nv requested a review from a team as a code owner June 1, 2026 18:26
memadi-nv added 2 commits June 1, 2026 12:11
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: LLM-as-a-judge for REPLACE evaluation

1 participant