refactor(evaluation): consolidate judge workflows behind a base class#173
refactor(evaluation): consolidate judge workflows behind a base class#173memadi-nv wants to merge 34 commits into
Conversation
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
* evaluate independent of the mode-EvaluateConfig Signed-off-by: memadi <memadi@nvidia.com> * address feedback Signed-off-by: memadi <memadi@nvidia.com> --------- Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
…ed evaluate model-selection Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Greptile SummaryThis PR extracts a
Confidence Score: 4/5Safe to merge with awareness of two API surface changes already flagged in prior review rounds: the four judge-specific Result dataclasses are gone and the entities_column keyword on DetectionJudgeWorkflow.evaluate() is removed; no callers within this repo reference either. The refactoring is behaviorally equivalent for both the orchestrator and standalone evaluate() paths. Within-repo grep confirms no code imports the removed Result dataclasses or passes entities_column to evaluate(). The remaining open question is whether any out-of-repo consumers depend on those symbols. judge_base.py — the _extract_invalid abstract method types parsed as BaseModel but implementations access schema-specific attributes; harmless at runtime but would fail a strict type-check pass. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant SubclassJudge as XxxJudgeWorkflow (subclass)
participant Base as _BaseJudgeWorkflow (base)
participant Adapter as NddAdapter
Note over SubclassJudge,Base: Orchestrator path (merged judges)
Caller->>SubclassJudge: prepare(dataframe)
SubclassJudge-->>Caller: df with intermediate cols
Caller->>Base: column_config(selected_models)
Base-->>Caller: LLMStructuredColumnConfig
Caller->>Adapter: run_workflow(prepared, columns)
Adapter-->>Caller: run_result
Caller->>Base: postprocess(dataframe)
Base->>SubclassJudge: _passthrough_mask(out)
SubclassJudge-->>Base: boolean Series
Base->>SubclassJudge: _flatten_judgment via cls.SCHEMA
Base->>SubclassJudge: _extract_invalid(parsed)
Base-->>Caller: df with VALID_COL + INVALID_COL
Note over SubclassJudge,Base: Standalone evaluate() path
Caller->>Base: evaluate(dataframe, model_configs, selected_models)
Base->>SubclassJudge: prepare(dataframe)
Base->>SubclassJudge: _passthrough_mask(working_df)
Base->>Adapter: run_workflow(with_content)
Adapter-->>Base: run_result
Base-->>Caller: JudgeResult(dataframe, failed_records)
Reviews (5): Last reviewed commit: "address greptile feedback" | Re-trigger Greptile |
Signed-off-by: memadi <memadi@nvidia.com>
open-pull-requests-limit cannot be set on updates entries that belong to a multi-ecosystem group; Dependabot requires it on the group itself. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Summary
Extracts a shared
_BaseJudgeWorkflowABC to eliminate the duplicated scaffolding across the four LLM-as-judge workflows. Pure refactor of the judges added in #158 — no behavior change, no public-API change.What changed
engine/evaluation/judge_base.py(+196) —_BaseJudgeWorkflowABC + sharedJudgeResult. Holds the logic deleted from all 4 subclasses:column_config(),_flatten_judgment(),postprocess(), standaloneevaluate().detection,type_fidelity,attribute_fidelity,relational_consistency. Each keeps only what's unique (schema, prompt, helpers, class attrs + 4 abstract hooks). ~120–150 lines of scaffolding removed per judge._flatten_judgment = XxxJudgeWorkflow._flatten_judgment) since the helper now lives on the base.Relationship to #158
#158 ("Add LLM-as-a-judge for replace evaluation") landed on
mainwhile this branch was open; both added the four judge files independently.mainhas been merged in and the add/add conflicts resolved in favor of the base-class versions. The diff reads as the refactor delta relative to #158's judges.Also includes
skills/anonymizer/SKILL.md— documents the LLM-as-judge evaluate step (rule, usage tips, opt-in--evaluateflag in the output template)..github/dependabot.yml— fixes an invalidopen-pull-requests-limiton grouped ecosystems (carried in from chore: Use uv and pip ecosystem for dependabot #170; unrelated drive-by fix)..copy()injudge_base.evaluate()(prepare()already returns a fresh frame) and reordered a stranded import intest_detection_judge.py.Non-changes (verified)
ReplacementWorkflow._run_merged_judgesuntouched; still drives the four judges viaprepare()/column_config()/postprocess().Anonymizer.evaluate(...)+ the four judge classes) unchanged.Type of Change
Testing
make testpasses locally (822 tests)make checkpasses locally (format + lint clean)Related Issues
Closes #98