Skip to content

feat: rewrite evaluation improvements #186

Open
memadi-nv wants to merge 17 commits into
mainfrom
memadi/feature/update-rewrite-evaluation
Open

feat: rewrite evaluation improvements #186
memadi-nv wants to merge 17 commits into
mainfrom
memadi/feature/update-rewrite-evaluation

Conversation

@memadi-nv

@memadi-nv memadi-nv commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

PR Summary — Rewrite Evaluation

Overview

Separates rewrite evaluation from run() / preview() and exposes it as an explicit Anonymizer.evaluate() call, consistent with replace evaluation. Adds detection validity scoring to rewrite evaluation, replaces numeric judge scores with categorical rubrics, and renames the fluency dimension.


Core Changes

  • Anonymizer.evaluate() now supports rewrite resultsAnonymizerResult and PreviewResult carry rewrite_config so evaluate() can dispatch without the user restating the mode. evaluate() on a rewrite result validates only the evaluate model aliases it actually uses (check_rewrite=False), not the full rewrite pipeline roles.

  • Final judge moved out of run() / preview() — the repair loop (leakage, utility, needs_human_review) still runs as part of every run() call. The holistic quality judge runs only when evaluate() is called explicitly.

  • Detection validity added to rewrite evaluationdetection_valid is a float | None in rewrite mode, representing the fraction of detected entities that passed the judge. Passthrough rows (no detected entities) receive 1.0 (trivially valid) with an INFO log. If the score cannot be computed, detection_valid is None, logged as a WARNING , and displayed as Unavailable.

  • rewrite_judge model role moved to evaluate config — no longer part of rewrite generation config; now under selected_models.evaluate.rewrite_judge, consistent with the other judge roles.

  • Categorical judge scores replace numeric 1–10 rubrics — the final rewrite judge now returns low / medium / high per dimension, stored as structured JSON under judge_evaluation. Higher is better for all three dimensions, rendered as colour-coded badges in display_record().

  • naturalness renamed to style — more precisely reflects what the rubric measures (fluency, grammar, readability).

  • judge_evaluation column made public — renamed from _judge_evaluation; appears in display_record() output and user-facing docs.

  • _render_scores_section receives is_rewrite as an explicit parameter — replaces the previous column-name scan that could misfire on user-defined column names.


Display

  • display_record() for rewrite results renders detection_valid alongside utility and leakage scores.
  • detection_valid = None / NaN renders as Unavailable in grey rather than a numeric score.
  • Categorical judge rubrics (privacy, quality, style) render with high / medium / low badges.

The following shows what rewrite evaluation results look like. Note that when the Detection Validity score is below 1.0, the dropdown displays the flagged invalid entities:

Screenshot 2026-06-12 at 10 42 56

Docs and Notebooks

  • evaluation.md — new Rewrite Evaluation section; detection validity special-value tables for both modes including no-entities and unavailable scenarios with their log messages.
  • rewrite.md — output columns table distinguishes run() vs evaluate() columns; new "Evaluating rewrite output" section.
  • Notebooks 04 and 05 — added optional "🔬 Evaluate" section with anonymizer.evaluate(result) and evaluated.display_record(0) as separate cells; Python sources synced.

Tests

  • _standard_side_effect trimmed to 2 results (pipeline + evaluate); stub_judge_df scoped to judge-produced columns only.
  • New _detection_valid_fraction tests covering all branches: valid=True, fraction computation, parse failure, total=0, valid=None.
  • New display tests for detection_valid=None, NaN, and absent column.
  • New test asserting evaluate() on a rewrite result calls validate_model_alias_references with check_rewrite=False.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Refactoring

Testing

  • make test passes locally
  • make check passes locally (format + lint + typecheck + lock-check)
  • Added/updated tests for changes

Documentation

  • If docs changed: make docs-build passes locally

Related Issues

Closes #106

Signed-off-by: memadi <memadi@nvidia.com>
@memadi-nv memadi-nv requested a review from a team as a code owner June 9, 2026 23:48
@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR separates rewrite evaluation from run() / preview() by exposing it as an explicit Anonymizer.evaluate() call, consistent with replace mode. It also replaces numeric 1–10 judge rubrics with categorical low / medium / high scores, renames the naturalness dimension to style, moves rewrite_judge from RewriteModelSelection to EvaluateModelSelection, and adds a detection_valid fraction to rewrite evaluation output.

  • Anonymizer.evaluate() on rewrite resultsrewrite_config is stored on AnonymizerResult / PreviewResult so evaluate() dispatches without requiring the user to restate the mode; a new check_rewrite_evaluate flag in validate_model_alias_references validates only the evaluate-side aliases.
  • needs_human_review decoupled from the judge — the repair-loop flag is now computed inline from objective metrics after the loop exhausts, and COL_NEEDS_HUMAN_REVIEW is never touched by the holistic judge or the new evaluate() path.
  • detection_valid as a 0–1 fraction — rewrite evaluate surfaces the share of detected entities that passed the judge (vs. a boolean in replace mode); passthrough rows get 1.0; parse failures surface None / "Unavailable" in display_record().

Confidence Score: 5/5

Safe to merge — the core logic changes are well-contained and the previous round of review concerns have all been addressed in this implementation.

All issues identified in the previous review thread (silent score discard in _extract_judge_scores, telemetry AttributeError on removed judge field, COL_NEEDS_HUMAN_REVIEW overwrite in _join_judge_columns, missing rewrite_config on result types, FinalJudgeWorkflow signature mismatch, passthrough-row handling in evaluate(), and rewrite_judge alias validation) are each addressed with targeted fixes backed by new tests. The only remaining findings are a grammatical label typo and a variable-shadowing code-clarity concern in display.py, neither of which affects runtime behaviour.

src/anonymizer/interface/display.py — minor label grammar ("Rewrite Need Review") and inner-loop variable shadowing of label; no other files require special attention.

Important Files Changed

Filename Overview
src/anonymizer/interface/anonymizer.py evaluate() dispatch extended to handle rewrite results; judge_model removed from telemetry collection; rewrite_config forwarded through run()/preview() results.
src/anonymizer/engine/rewrite/rewrite_workflow.py evaluate() method added to RewriteWorkflow; passthrough rows correctly assigned detection_valid=1.0; needs_human_review computation moved inline after the repair loop; _detection_valid_fraction handles all edge cases including parse failure and total=0.
src/anonymizer/engine/rewrite/final_judge.py FinalJudgeWorkflow now takes EvaluateModelSelection; NATURALNESS_RUBRIC renamed STYLE_RUBRIC; numeric 1-10 scores replaced with categorical low/medium/high; HumanReviewParams and _determine_needs_human_review removed.
src/anonymizer/engine/ndd/model_loader.py check_rewrite_evaluate flag added; rewrite_judge validation now correctly gated inside check_evaluate block with (check_rewrite or check_rewrite_evaluate) guard.
src/anonymizer/interface/display.py _render_scores_section refactored with is_rewrite parameter; detection_valid row added; _extract_judge_scores fixed for categorical string scores; minor label variable shadowing and grammar issue ("Rewrite Need Review" should be "Rewrite Needs Review").
src/anonymizer/interface/results.py rewrite_config: PrivacyGoal
src/anonymizer/engine/evaluation/detection_judge.py _safe_parse helper added to DetectionJudgeWorkflow.prepare() to gracefully handle malformed entity rows; passthrough row logging added.
src/anonymizer/config/models.py judge removed from RewriteModelSelection; rewrite_judge added to EvaluateModelSelection.

Sequence Diagram

sequenceDiagram
    participant User
    participant Anonymizer
    participant RewriteWorkflow
    participant DetectionJudgeWorkflow
    participant FinalJudgeWorkflow

    User->>Anonymizer: run(config, data)
    Anonymizer->>RewriteWorkflow: run(df, selected_models.rewrite, ...)
    RewriteWorkflow-->>RewriteWorkflow: pipeline + eval-repair loop
    RewriteWorkflow-->>RewriteWorkflow: compute needs_human_review
    RewriteWorkflow-->>Anonymizer: RewriteResult (no judge columns)
    Anonymizer-->>User: AnonymizerResult (rewrite_config stored)

    User->>Anonymizer: evaluate(result)
    Anonymizer->>Anonymizer: "validate_model_alias_references(check_rewrite_evaluate=True)"
    Anonymizer->>RewriteWorkflow: evaluate(internal_df, selected_models.evaluate, ...)
    RewriteWorkflow-->>RewriteWorkflow: split entity rows / passthrough rows
    RewriteWorkflow->>DetectionJudgeWorkflow: evaluate(entity_rows)
    DetectionJudgeWorkflow-->>RewriteWorkflow: detection_valid (bool) + invalid_entities
    RewriteWorkflow-->>RewriteWorkflow: convert bool to 0-1 fraction (_detection_valid_fraction)
    RewriteWorkflow->>FinalJudgeWorkflow: columns(selected_models.evaluate, privacy_goal)
    FinalJudgeWorkflow-->>RewriteWorkflow: LLMJudgeColumnConfig (privacy/quality/style)
    RewriteWorkflow-->>RewriteWorkflow: "run holistic judge -> judge_evaluation"
    RewriteWorkflow-->>Anonymizer: RewriteResult (detection_valid + judge_evaluation)
    Anonymizer-->>User: AnonymizerResult (all columns present)
Loading

Reviews (9): Last reviewed commit: "merge main" | Re-trigger Greptile

Comment thread plans/rewrite-evaluation/plan.md Outdated
Comment thread plans/rewrite-evaluation/plan.md Outdated
Comment thread plans/rewrite-evaluation/plan.md Outdated
Comment thread plans/rewrite-evaluation/plan.md Outdated
Comment thread plans/rewrite-evaluation/plan.md Outdated
memadi-nv added 2 commits June 9, 2026 16:59
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Comment thread plans/rewrite-evaluation/plan.md Outdated
memadi-nv added 7 commits June 9, 2026 18:05
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
@memadi-nv memadi-nv requested a review from a team as a code owner June 11, 2026 19:28
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Comment thread src/anonymizer/engine/ndd/model_loader.py Outdated
Signed-off-by: memadi <memadi@nvidia.com>
@memadi-nv memadi-nv changed the title docs: rewrite evaluation improvements plan feat: rewrite evaluation improvements Jun 12, 2026
Comment thread docs/concepts/rewrite.md
| `detection_invalid_entities` | After `evaluate()` | Flagged detections with value, label, and one-sentence reasoning. |
| `judge_evaluation` | After `evaluate()` | Dict with `privacy`, `quality`, and `style` rubric scores and reasoning. |

Use `preview.trace_dataframe` for the full pipeline trace (domain, disposition, QA pairs, repair iterations, judge evaluation).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since judge_evaluation is explicitly rendered in display_record(), I think it belongs in the same public-facing tier as utility_score, leakage_mass, and detection_valid — all without the underscore prefix- hence removed the internal use, aka _

replace_attribute_fidelity_judge: gpt-oss-120b

# --- Rewrite evaluation ---
rewrite_judge: nemotron-30b-thinking

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rewrite judge is one holistic call because the dimensions are entangled. Privacy, quality, and style in a rewrite are hard to evaluate in isolation in contrast with replace judges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(rewrite): formal judge rubric selection and score distribution analysis

1 participant