feat: rewrite evaluation improvements #186
Conversation
Signed-off-by: memadi <memadi@nvidia.com>
Greptile SummaryThis PR separates rewrite evaluation from
Confidence Score: 5/5Safe to merge — the core logic changes are well-contained and the previous round of review concerns have all been addressed in this implementation. All issues identified in the previous review thread (silent score discard in src/anonymizer/interface/display.py — minor label grammar ("Rewrite Need Review") and inner-loop variable shadowing of Important Files Changed
|
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
| | `detection_invalid_entities` | After `evaluate()` | Flagged detections with value, label, and one-sentence reasoning. | | ||
| | `judge_evaluation` | After `evaluate()` | Dict with `privacy`, `quality`, and `style` rubric scores and reasoning. | | ||
|
|
||
| Use `preview.trace_dataframe` for the full pipeline trace (domain, disposition, QA pairs, repair iterations, judge evaluation). |
There was a problem hiding this comment.
Since judge_evaluation is explicitly rendered in display_record(), I think it belongs in the same public-facing tier as utility_score, leakage_mass, and detection_valid — all without the underscore prefix- hence removed the internal use, aka _
| replace_attribute_fidelity_judge: gpt-oss-120b | ||
|
|
||
| # --- Rewrite evaluation --- | ||
| rewrite_judge: nemotron-30b-thinking |
There was a problem hiding this comment.
The rewrite judge is one holistic call because the dimensions are entangled. Privacy, quality, and style in a rewrite are hard to evaluate in isolation in contrast with replace judges.
PR Summary — Rewrite Evaluation
Overview
Separates rewrite evaluation from
run()/preview()and exposes it as an explicitAnonymizer.evaluate()call, consistent with replace evaluation. Adds detection validity scoring to rewrite evaluation, replaces numeric judge scores with categorical rubrics, and renames the fluency dimension.Core Changes
Anonymizer.evaluate()now supports rewrite results —AnonymizerResultandPreviewResultcarryrewrite_configsoevaluate()can dispatch without the user restating the mode.evaluate()on a rewrite result validates only the evaluate model aliases it actually uses (check_rewrite=False), not the full rewrite pipeline roles.Final judge moved out of
run()/preview()— the repair loop (leakage, utility,needs_human_review) still runs as part of everyrun()call. The holistic quality judge runs only whenevaluate()is called explicitly.Detection validity added to rewrite evaluation —
detection_validis afloat | Nonein rewrite mode, representing the fraction of detected entities that passed the judge. Passthrough rows (no detected entities) receive1.0(trivially valid) with anINFOlog. If the score cannot be computed,detection_validisNone, logged as aWARNING, and displayed as Unavailable.rewrite_judgemodel role moved to evaluate config — no longer part of rewrite generation config; now underselected_models.evaluate.rewrite_judge, consistent with the other judge roles.Categorical judge scores replace numeric
1–10rubrics — the final rewrite judge now returnslow/medium/highper dimension, stored as structured JSON underjudge_evaluation. Higher is better for all three dimensions, rendered as colour-coded badges indisplay_record().naturalnessrenamed tostyle— more precisely reflects what the rubric measures (fluency, grammar, readability).judge_evaluationcolumn made public — renamed from_judge_evaluation; appears indisplay_record()output and user-facing docs._render_scores_sectionreceivesis_rewriteas an explicit parameter — replaces the previous column-name scan that could misfire on user-defined column names.Display
display_record()for rewrite results rendersdetection_validalongside utility and leakage scores.detection_valid = None/NaNrenders as Unavailable in grey rather than a numeric score.high/medium/lowbadges.The following shows what rewrite evaluation results look like. Note that when the
Detection Validityscore is below 1.0, the dropdown displays the flagged invalid entities:Docs and Notebooks
evaluation.md— new Rewrite Evaluation section; detection validity special-value tables for both modes including no-entities and unavailable scenarios with their log messages.rewrite.md— output columns table distinguishesrun()vsevaluate()columns; new "Evaluating rewrite output" section.anonymizer.evaluate(result)andevaluated.display_record(0)as separate cells; Python sources synced.Tests
_standard_side_effecttrimmed to 2 results (pipeline + evaluate);stub_judge_dfscoped to judge-produced columns only._detection_valid_fractiontests covering all branches:valid=True, fraction computation, parse failure,total=0,valid=None.detection_valid=None,NaN, and absent column.evaluate()on a rewrite result callsvalidate_model_alias_referenceswithcheck_rewrite=False.Type of Change
Testing
make testpasses locallymake checkpasses locally (format + lint + typecheck + lock-check)Documentation
make docs-buildpasses locallyRelated Issues
Closes #106