feat: rewrite evaluation improvements by memadi-nv · Pull Request #186 · NVIDIA-NeMo/Anonymizer

memadi-nv · 2026-06-09T23:48:50Z

PR Summary — Rewrite Evaluation

Overview

Separates rewrite evaluation from run() / preview() and exposes it as an explicit Anonymizer.evaluate() call, consistent with replace evaluation. Adds detection validity scoring to rewrite evaluation, replaces numeric judge scores with categorical rubrics, and renames the fluency dimension.

Core Changes

Anonymizer.evaluate() now supports rewrite results — AnonymizerResult and PreviewResult carry rewrite_config so evaluate() can dispatch without the user restating the mode. evaluate() on a rewrite result validates only the evaluate model aliases it actually uses (check_rewrite=False), not the full rewrite pipeline roles.
Final judge moved out of run() / preview() — the repair loop (leakage, utility, needs_human_review) still runs as part of every run() call. The holistic quality judge runs only when evaluate() is called explicitly.
Detection validity added to rewrite evaluation — detection_valid is a float | None in rewrite mode, representing the fraction of detected entities that passed the judge. Passthrough rows (no detected entities) receive 1.0 (trivially valid) with an INFO log. If the score cannot be computed, detection_valid is None, logged as a WARNING , and displayed as Unavailable.
rewrite_judge model role moved to evaluate config — no longer part of rewrite generation config; now under selected_models.evaluate.rewrite_judge, consistent with the other judge roles.
Categorical judge scores replace numeric 1–10 rubrics — the final rewrite judge now returns low / medium / high per dimension, stored as structured JSON under judge_evaluation. Higher is better for all three dimensions, rendered as colour-coded badges in display_record().
naturalness renamed to style — more precisely reflects what the rubric measures (fluency, grammar, readability).
judge_evaluation column made public — renamed from _judge_evaluation; appears in display_record() output and user-facing docs.
_render_scores_section receives is_rewrite as an explicit parameter — replaces the previous column-name scan that could misfire on user-defined column names.

Display

display_record() for rewrite results renders detection_valid alongside utility and leakage scores.
detection_valid = None / NaN renders as Unavailable in grey rather than a numeric score.
Categorical judge rubrics (privacy, quality, style) render with high / medium / low badges.

The following shows what rewrite evaluation results look like. Note that when the Detection Validity score is below 1.0, the dropdown displays the flagged invalid entities:

Docs and Notebooks

evaluation.md — new Rewrite Evaluation section; detection validity special-value tables for both modes including no-entities and unavailable scenarios with their log messages.
rewrite.md — output columns table distinguishes run() vs evaluate() columns; new "Evaluating rewrite output" section.
Notebooks 04 and 05 — added optional "🔬 Evaluate" section with anonymizer.evaluate(result) and evaluated.display_record(0) as separate cells; Python sources synced.

Tests

_standard_side_effect trimmed to 2 results (pipeline + evaluate); stub_judge_df scoped to judge-produced columns only.
New _detection_valid_fraction tests covering all branches: valid=True, fraction computation, parse failure, total=0, valid=None.
New display tests for detection_valid=None, NaN, and absent column.
New test asserting evaluate() on a rewrite result calls validate_model_alias_references with check_rewrite=False.

Type of Change

Testing

make test passes locally
make check passes locally (format + lint + typecheck + lock-check)
Added/updated tests for changes

Documentation

If docs changed: make docs-build passes locally

Related Issues

Closes #106

Signed-off-by: memadi <memadi@nvidia.com>

greptile-apps · 2026-06-09T23:52:42Z

Greptile Summary

This PR separates rewrite evaluation from run() / preview() by exposing it as an explicit Anonymizer.evaluate() call, consistent with replace mode. It also replaces numeric 1–10 judge rubrics with categorical low / medium / high scores, renames the naturalness dimension to style, moves rewrite_judge from RewriteModelSelection to EvaluateModelSelection, and adds a detection_valid fraction to rewrite evaluation output.

Anonymizer.evaluate() on rewrite results — rewrite_config is stored on AnonymizerResult / PreviewResult so evaluate() dispatches without requiring the user to restate the mode; a new check_rewrite_evaluate flag in validate_model_alias_references validates only the evaluate-side aliases.
needs_human_review decoupled from the judge — the repair-loop flag is now computed inline from objective metrics after the loop exhausts, and COL_NEEDS_HUMAN_REVIEW is never touched by the holistic judge or the new evaluate() path.
detection_valid as a 0–1 fraction — rewrite evaluate surfaces the share of detected entities that passed the judge (vs. a boolean in replace mode); passthrough rows get 1.0; parse failures surface None / "Unavailable" in display_record().

Confidence Score: 5/5

Safe to merge — the core logic changes are well-contained and the previous round of review concerns have all been addressed in this implementation.

All issues identified in the previous review thread (silent score discard in _extract_judge_scores, telemetry AttributeError on removed judge field, COL_NEEDS_HUMAN_REVIEW overwrite in _join_judge_columns, missing rewrite_config on result types, FinalJudgeWorkflow signature mismatch, passthrough-row handling in evaluate(), and rewrite_judge alias validation) are each addressed with targeted fixes backed by new tests. The only remaining findings are a grammatical label typo and a variable-shadowing code-clarity concern in display.py, neither of which affects runtime behaviour.

src/anonymizer/interface/display.py — minor label grammar ("Rewrite Need Review") and inner-loop variable shadowing of label; no other files require special attention.

Important Files Changed

Filename	Overview
src/anonymizer/interface/anonymizer.py	evaluate() dispatch extended to handle rewrite results; judge_model removed from telemetry collection; rewrite_config forwarded through run()/preview() results.
src/anonymizer/engine/rewrite/rewrite_workflow.py	evaluate() method added to RewriteWorkflow; passthrough rows correctly assigned detection_valid=1.0; needs_human_review computation moved inline after the repair loop; _detection_valid_fraction handles all edge cases including parse failure and total=0.
src/anonymizer/engine/rewrite/final_judge.py	FinalJudgeWorkflow now takes EvaluateModelSelection; NATURALNESS_RUBRIC renamed STYLE_RUBRIC; numeric 1-10 scores replaced with categorical low/medium/high; HumanReviewParams and _determine_needs_human_review removed.
src/anonymizer/engine/ndd/model_loader.py	check_rewrite_evaluate flag added; rewrite_judge validation now correctly gated inside check_evaluate block with (check_rewrite or check_rewrite_evaluate) guard.
src/anonymizer/interface/display.py	_render_scores_section refactored with is_rewrite parameter; detection_valid row added; _extract_judge_scores fixed for categorical string scores; minor label variable shadowing and grammar issue ("Rewrite Need Review" should be "Rewrite Needs Review").
src/anonymizer/interface/results.py	rewrite_config: PrivacyGoal
src/anonymizer/engine/evaluation/detection_judge.py	_safe_parse helper added to DetectionJudgeWorkflow.prepare() to gracefully handle malformed entity rows; passthrough row logging added.
src/anonymizer/config/models.py	judge removed from RewriteModelSelection; rewrite_judge added to EvaluateModelSelection.

Sequence Diagram

sequenceDiagram
    participant User
    participant Anonymizer
    participant RewriteWorkflow
    participant DetectionJudgeWorkflow
    participant FinalJudgeWorkflow

    User->>Anonymizer: run(config, data)
    Anonymizer->>RewriteWorkflow: run(df, selected_models.rewrite, ...)
    RewriteWorkflow-->>RewriteWorkflow: pipeline + eval-repair loop
    RewriteWorkflow-->>RewriteWorkflow: compute needs_human_review
    RewriteWorkflow-->>Anonymizer: RewriteResult (no judge columns)
    Anonymizer-->>User: AnonymizerResult (rewrite_config stored)

    User->>Anonymizer: evaluate(result)
    Anonymizer->>Anonymizer: "validate_model_alias_references(check_rewrite_evaluate=True)"
    Anonymizer->>RewriteWorkflow: evaluate(internal_df, selected_models.evaluate, ...)
    RewriteWorkflow-->>RewriteWorkflow: split entity rows / passthrough rows
    RewriteWorkflow->>DetectionJudgeWorkflow: evaluate(entity_rows)
    DetectionJudgeWorkflow-->>RewriteWorkflow: detection_valid (bool) + invalid_entities
    RewriteWorkflow-->>RewriteWorkflow: convert bool to 0-1 fraction (_detection_valid_fraction)
    RewriteWorkflow->>FinalJudgeWorkflow: columns(selected_models.evaluate, privacy_goal)
    FinalJudgeWorkflow-->>RewriteWorkflow: LLMJudgeColumnConfig (privacy/quality/style)
    RewriteWorkflow-->>RewriteWorkflow: "run holistic judge -> judge_evaluation"
    RewriteWorkflow-->>Anonymizer: RewriteResult (detection_valid + judge_evaluation)
    Anonymizer-->>User: AnonymizerResult (all columns present)

_{Reviews (9): Last reviewed commit: "merge main" | Re-trigger Greptile}

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv · 2026-06-12T18:20:01Z

+| `detection_invalid_entities` | After `evaluate()` | Flagged detections with value, label, and one-sentence reasoning. |
+| `judge_evaluation` | After `evaluate()` | Dict with `privacy`, `quality`, and `style` rubric scores and reasoning. |

-Use `preview.trace_dataframe` for the full pipeline trace (domain, disposition, QA pairs, repair iterations, judge evaluation).


Since judge_evaluation is explicitly rendered in display_record(), I think it belongs in the same public-facing tier as utility_score, leakage_mass, and detection_valid — all without the underscore prefix- hence removed the internal use, aka _

memadi-nv · 2026-06-12T18:33:12Z

  replace_attribute_fidelity_judge: gpt-oss-120b
+
+  # --- Rewrite evaluation ---
+  rewrite_judge: nemotron-30b-thinking


The rewrite judge is one holistic call because the dimensions are entangled. Privacy, quality, and style in a rewrite are hard to evaluate in isolation in contrast with replace judges.

add a plan for updating anonymizer rewrite evaluation

b23de7c

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv requested a review from a team as a code owner June 9, 2026 23:48

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread plans/rewrite-evaluation/plan.md Outdated

Comment thread plans/rewrite-evaluation/plan.md Outdated

Comment thread plans/rewrite-evaluation/plan.md Outdated

Comment thread plans/rewrite-evaluation/plan.md Outdated

Comment thread plans/rewrite-evaluation/plan.md Outdated

memadi-nv added 2 commits June 9, 2026 16:59

nit

3308479

Signed-off-by: memadi <memadi@nvidia.com>

address gliner feedback

df3f4f2

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv requested a review from lipikaramaswamy June 10, 2026 00:55

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread plans/rewrite-evaluation/plan.md Outdated

memadi-nv added 7 commits June 9, 2026 18:05

address greptile feedback-part 2

24b4185

Signed-off-by: memadi <memadi@nvidia.com>

nit

0a6f0bf

Signed-off-by: memadi <memadi@nvidia.com>

implement the plan

d347463

Signed-off-by: memadi <memadi@nvidia.com>

merge main

c158670

Signed-off-by: memadi <memadi@nvidia.com>

update evaluation-rewrite

59a7c93

Signed-off-by: memadi <memadi@nvidia.com>

nit

763c3a6

Signed-off-by: memadi <memadi@nvidia.com>

make format

fb37234

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv requested a review from a team as a code owner June 11, 2026 19:28

memadi-nv added 6 commits June 11, 2026 16:42

update notebooks according to the change

28cb50f

Signed-off-by: memadi <memadi@nvidia.com>

remove plan

8a1a013

Signed-off-by: memadi <memadi@nvidia.com>

make judge_evaluation a user facing value

52e7cbb

Signed-off-by: memadi <memadi@nvidia.com>

nit

7853fbd

Signed-off-by: memadi <memadi@nvidia.com>

fix some bugs-add some tests

e078d39

Signed-off-by: memadi <memadi@nvidia.com>

update docs according to the changes

63d8450

Signed-off-by: memadi <memadi@nvidia.com>

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread src/anonymizer/engine/ndd/model_loader.py Outdated

merge main

eb5367e

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv changed the title ~~docs: rewrite evaluation improvements plan~~ feat: rewrite evaluation improvements Jun 12, 2026

memadi-nv commented Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: rewrite evaluation improvements #186

feat: rewrite evaluation improvements #186
memadi-nv wants to merge 17 commits into
mainfrom
memadi/feature/update-rewrite-evaluation

memadi-nv commented Jun 9, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

memadi-nv Jun 12, 2026

Uh oh!

memadi-nv Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

memadi-nv commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary — Rewrite Evaluation

Overview

Core Changes

Display

Docs and Notebooks

Tests

Type of Change

Testing

Documentation

Related Issues

Uh oh!

greptile-apps Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

memadi-nv Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

memadi-nv Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

memadi-nv commented Jun 9, 2026 •

edited

Loading

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading