Better token count metrics by snimu · Pull Request #1108 · PrimeIntellect-ai/verifiers

snimu · 2026-04-03T19:54:50Z

Description

Change the token metrics:

Field	Description
`input_tokens`	Unchanged. Sum of prompt tokens across all turns. Shared context is counted each time it appears in a prompt.
`output_tokens`	Unchanged. Sum of completion tokens across all turns.
`final_input_tokens`	New. Non-completion tokens in the final turn's context (system prompts, user messages, tool results, etc.).
`final_output_tokens`	New. Completion tokens in the final turn's context. Equals `output_tokens` for single-turn rollouts.

In a single-turn rollout, input_tokens == final_input_tokens and output_tokens == final_output_tokens. In a multi-turn rollout, input_tokens > final_input_tokens because earlier turns' prompts are counted again.

The final_* metrics assume a single, continuously extended trajectory. Non-linear trajectories (multi-agent, context summarization, history rewriting) are not accounted for.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Note

Medium Risk
Adds new token_usage fields and changes how token usage is computed/aggregated from trajectories, which can alter reported metrics and downstream consumers expecting the old schema.

Overview
Adds new per-rollout token usage metrics (final_input_tokens, final_output_tokens) derived from the rollout trajectory’s last step, alongside existing input_tokens/output_tokens totals.

Plumbs these fields through result serialization and aggregation: state_to_output now enriches token_usage, GenerateOutputsBuilder/print_usage compute averages for the new metrics when available, and TUI/eval displays render the additional fields. Updates types/docs to include the new TokenUsage shape, refactors token metrics into a keyed base class, and adds focused tests for trajectory-based context token computation and metric classes.

^{Reviewed by Cursor Bugbot for commit 00c7d29. Bugbot is set up for automated code reviews on this repo. Configure here.}

Track (prompt_tokens, completion_tokens) per turn in StateUsageTracker and compute branch-aware context metrics: - cumulative_prefill_tokens: total prefill work (renamed from input_tokens) - cumulative_decode_tokens: total decode work (renamed from output_tokens) - longest_context_completion_tokens: model output in longest branch - longest_context_non_completion_tokens: environment input in longest branch Branching is detected via mark_branch() called from RLM's summarize_turns. Without summarization, there's one branch and the metrics reflect the full conversation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New metrics computed from the trajectory at rollout end: - longest_context_completion_tokens: model-generated tokens in context - longest_context_non_completion_tokens: non-model tokens in context Context metrics detect summarization automatically by counting assistant messages in the last trajectory step's prompt — dropped turns are simply absent. No env-specific hooks needed. Rename display names (backward-compatible with old saved data): - input_tokens → cumulative_prefill_tokens - output_tokens → cumulative_decode_tokens Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ics-2026-04-03 merge in main

Replace assistant-counting heuristic with the same message-prefix matching approach used by best-effort TITO (PR #955). This auto-detects branching, context dropping, and history rewriting from trajectory data alone — no trajectory_id filtering or env modifications needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

✅ Fixed: Falsy or fallback skips valid zero token values
- Replaced or operator with explicit if value is None checks in InputTokensMetric, OutputTokensMetric, and _make_tokens_row to properly handle zero token values.
✅ Fixed: Documentation not updated for TokenUsage changes
- Updated docs/reference.md to document the new TokenUsage fields (cumulative_prefill_tokens, cumulative_decode_tokens, longest_context_completion_tokens, longest_context_non_completion_tokens) and explain the naming convention change.

Or push these changes by commenting:

@cursor push af0195d735

Preview (af0195d735)

diff --git a/docs/reference.md b/docs/reference.md
--- a/docs/reference.md
+++ b/docs/reference.md
@@ -210,6 +210,14 @@
     env_version: str | None
     env_commit: str | None
 
+class TokenUsage(TypedDict, total=False):
+    input_tokens: float  # legacy name for cumulative_prefill_tokens
+    output_tokens: float  # legacy name for cumulative_decode_tokens
+    cumulative_prefill_tokens: float  # total prefill tokens across all turns
+    cumulative_decode_tokens: float  # total decode tokens across all turns
+    longest_context_completion_tokens: float  # completion tokens in longest context branch
+    longest_context_non_completion_tokens: float  # non-completion tokens in longest context branch
+
 class GenerateMetadata(TypedDict):
     env_id: str
     env_args: dict
@@ -237,6 +245,8 @@
 
 `version_info` captures the verifiers framework version/commit and the environment package version/commit at generation time. Populated automatically by `GenerateOutputsBuilder`.
 
+`usage` aggregates token usage across all rollouts. All fields in `TokenUsage` are optional. The new naming convention (`cumulative_prefill_tokens`, `cumulative_decode_tokens`) is preferred; legacy field names (`input_tokens`, `output_tokens`) are supported for backward compatibility.
+
 ### RolloutScore / RolloutScores
 
 ```python

diff --git a/verifiers/utils/eval_display.py b/verifiers/utils/eval_display.py
--- a/verifiers/utils/eval_display.py
+++ b/verifiers/utils/eval_display.py
@@ -354,13 +354,15 @@
 
     def _make_tokens_row(self, usage: TokenUsage) -> Table | None:
         """Create a tokens row with prefill/decode and context values."""
+        prefill = usage.get("cumulative_prefill_tokens")
+        if prefill is None:
+            prefill = usage.get("input_tokens", 0.0)
+        decode = usage.get("cumulative_decode_tokens")
+        if decode is None:
+            decode = usage.get("output_tokens", 0.0)
         kv: dict[str, object] = {
-            "prefill": format_numeric(
-                usage.get("cumulative_prefill_tokens") or usage.get("input_tokens", 0.0)
-            ),
-            "decode": format_numeric(
-                usage.get("cumulative_decode_tokens") or usage.get("output_tokens", 0.0)
-            ),
+            "prefill": format_numeric(prefill),
+            "decode": format_numeric(decode),
         }
         ctx_non_completion = usage.get("longest_context_non_completion_tokens")
         ctx_completion = usage.get("longest_context_completion_tokens")
@@ -980,14 +982,14 @@
             else:
                 usage = env_state.usage
             if usage is not None:
-                prefill_tokens = format_numeric(
-                    usage.get("cumulative_prefill_tokens")
-                    or usage.get("input_tokens", 0.0)
-                )
-                decode_tokens = format_numeric(
-                    usage.get("cumulative_decode_tokens")
-                    or usage.get("output_tokens", 0.0)
-                )
+                prefill = usage.get("cumulative_prefill_tokens")
+                if prefill is None:
+                    prefill = usage.get("input_tokens", 0.0)
+                decode = usage.get("cumulative_decode_tokens")
+                if decode is None:
+                    decode = usage.get("output_tokens", 0.0)
+                prefill_tokens = format_numeric(prefill)
+                decode_tokens = format_numeric(decode)
 
             # error rate with color coding
             error_rate = env_state.error_rate

diff --git a/verifiers/utils/metric_utils.py b/verifiers/utils/metric_utils.py
--- a/verifiers/utils/metric_utils.py
+++ b/verifiers/utils/metric_utils.py
@@ -74,7 +74,9 @@
     def extract(self, output: RolloutOutput) -> float | None:
         usage = output.get("token_usage")
         if isinstance(usage, dict):
-            value = usage.get("cumulative_prefill_tokens") or usage.get("input_tokens")
+            value = usage.get("cumulative_prefill_tokens")
+            if value is None:
+                value = usage.get("input_tokens")
             if value is not None:
                 return float(value)
         return None
@@ -86,7 +88,9 @@
     def extract(self, output: RolloutOutput) -> float | None:
         usage = output.get("token_usage")
         if isinstance(usage, dict):
-            value = usage.get("cumulative_decode_tokens") or usage.get("output_tokens")
+            value = usage.get("cumulative_decode_tokens")
+            if value is None:
+                value = usage.get("output_tokens")
             if value is not None:
                 return float(value)
         return None

_{This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.}