Skip to content

Better token count metrics#1108

Merged
samsja merged 21 commits intomainfrom
sebastian/prompt-metrics-2026-04-03
Apr 18, 2026
Merged

Better token count metrics#1108
samsja merged 21 commits intomainfrom
sebastian/prompt-metrics-2026-04-03

Conversation

@snimu
Copy link
Copy Markdown
Contributor

@snimu snimu commented Apr 3, 2026

Description

Change the token metrics:

Field Description
input_tokens Unchanged. Sum of prompt tokens across all turns. Shared context is counted each time it appears in a prompt.
output_tokens Unchanged. Sum of completion tokens across all turns.
final_input_tokens New. Non-completion tokens in the final turn's context (system prompts, user messages, tool results, etc.).
final_output_tokens New. Completion tokens in the final turn's context. Equals output_tokens for single-turn rollouts.

In a single-turn rollout, input_tokens == final_input_tokens and output_tokens == final_output_tokens. In a multi-turn rollout, input_tokens > final_input_tokens because earlier turns' prompts are counted again.

The final_* metrics assume a single, continuously extended trajectory. Non-linear trajectories (multi-agent, context summarization, history rewriting) are not accounted for.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Note

Medium Risk
Adds new token_usage fields and changes how token usage is computed/aggregated from trajectories, which can alter reported metrics and downstream consumers expecting the old schema.

Overview
Adds new per-rollout token usage metrics (final_input_tokens, final_output_tokens) derived from the rollout trajectory’s last step, alongside existing input_tokens/output_tokens totals.

Plumbs these fields through result serialization and aggregation: state_to_output now enriches token_usage, GenerateOutputsBuilder/print_usage compute averages for the new metrics when available, and TUI/eval displays render the additional fields. Updates types/docs to include the new TokenUsage shape, refactors token metrics into a keyed base class, and adds focused tests for trajectory-based context token computation and metric classes.

Reviewed by Cursor Bugbot for commit 00c7d29. Bugbot is set up for automated code reviews on this repo. Configure here.

snimu and others added 4 commits April 3, 2026 16:22
Track (prompt_tokens, completion_tokens) per turn in StateUsageTracker
and compute branch-aware context metrics:
- cumulative_prefill_tokens: total prefill work (renamed from input_tokens)
- cumulative_decode_tokens: total decode work (renamed from output_tokens)
- longest_context_completion_tokens: model output in longest branch
- longest_context_non_completion_tokens: environment input in longest branch

Branching is detected via mark_branch() called from RLM's
summarize_turns. Without summarization, there's one branch and the
metrics reflect the full conversation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New metrics computed from the trajectory at rollout end:
- longest_context_completion_tokens: model-generated tokens in context
- longest_context_non_completion_tokens: non-model tokens in context

Context metrics detect summarization automatically by counting
assistant messages in the last trajectory step's prompt — dropped
turns are simply absent. No env-specific hooks needed.

Rename display names (backward-compatible with old saved data):
- input_tokens → cumulative_prefill_tokens
- output_tokens → cumulative_decode_tokens

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace assistant-counting heuristic with the same message-prefix
matching approach used by best-effort TITO (PR #955). This auto-detects
branching, context dropping, and history rewriting from trajectory data
alone — no trajectory_id filtering or env modifications needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Falsy or fallback skips valid zero token values
    • Replaced or operator with explicit if value is None checks in InputTokensMetric, OutputTokensMetric, and _make_tokens_row to properly handle zero token values.
  • ✅ Fixed: Documentation not updated for TokenUsage changes
    • Updated docs/reference.md to document the new TokenUsage fields (cumulative_prefill_tokens, cumulative_decode_tokens, longest_context_completion_tokens, longest_context_non_completion_tokens) and explain the naming convention change.

Create PR

Or push these changes by commenting:

@cursor push af0195d735
Preview (af0195d735)
diff --git a/docs/reference.md b/docs/reference.md
--- a/docs/reference.md
+++ b/docs/reference.md
@@ -210,6 +210,14 @@
     env_version: str | None
     env_commit: str | None
 
+class TokenUsage(TypedDict, total=False):
+    input_tokens: float  # legacy name for cumulative_prefill_tokens
+    output_tokens: float  # legacy name for cumulative_decode_tokens
+    cumulative_prefill_tokens: float  # total prefill tokens across all turns
+    cumulative_decode_tokens: float  # total decode tokens across all turns
+    longest_context_completion_tokens: float  # completion tokens in longest context branch
+    longest_context_non_completion_tokens: float  # non-completion tokens in longest context branch
+
 class GenerateMetadata(TypedDict):
     env_id: str
     env_args: dict
@@ -237,6 +245,8 @@
 
 `version_info` captures the verifiers framework version/commit and the environment package version/commit at generation time. Populated automatically by `GenerateOutputsBuilder`.
 
+`usage` aggregates token usage across all rollouts. All fields in `TokenUsage` are optional. The new naming convention (`cumulative_prefill_tokens`, `cumulative_decode_tokens`) is preferred; legacy field names (`input_tokens`, `output_tokens`) are supported for backward compatibility.
+
 ### RolloutScore / RolloutScores
 
 ```python

diff --git a/verifiers/utils/eval_display.py b/verifiers/utils/eval_display.py
--- a/verifiers/utils/eval_display.py
+++ b/verifiers/utils/eval_display.py
@@ -354,13 +354,15 @@
 
     def _make_tokens_row(self, usage: TokenUsage) -> Table | None:
         """Create a tokens row with prefill/decode and context values."""
+        prefill = usage.get("cumulative_prefill_tokens")
+        if prefill is None:
+            prefill = usage.get("input_tokens", 0.0)
+        decode = usage.get("cumulative_decode_tokens")
+        if decode is None:
+            decode = usage.get("output_tokens", 0.0)
         kv: dict[str, object] = {
-            "prefill": format_numeric(
-                usage.get("cumulative_prefill_tokens") or usage.get("input_tokens", 0.0)
-            ),
-            "decode": format_numeric(
-                usage.get("cumulative_decode_tokens") or usage.get("output_tokens", 0.0)
-            ),
+            "prefill": format_numeric(prefill),
+            "decode": format_numeric(decode),
         }
         ctx_non_completion = usage.get("longest_context_non_completion_tokens")
         ctx_completion = usage.get("longest_context_completion_tokens")
@@ -980,14 +982,14 @@
             else:
                 usage = env_state.usage
             if usage is not None:
-                prefill_tokens = format_numeric(
-                    usage.get("cumulative_prefill_tokens")
-                    or usage.get("input_tokens", 0.0)
-                )
-                decode_tokens = format_numeric(
-                    usage.get("cumulative_decode_tokens")
-                    or usage.get("output_tokens", 0.0)
-                )
+                prefill = usage.get("cumulative_prefill_tokens")
+                if prefill is None:
+                    prefill = usage.get("input_tokens", 0.0)
+                decode = usage.get("cumulative_decode_tokens")
+                if decode is None:
+                    decode = usage.get("output_tokens", 0.0)
+                prefill_tokens = format_numeric(prefill)
+                decode_tokens = format_numeric(decode)
 
             # error rate with color coding
             error_rate = env_state.error_rate

diff --git a/verifiers/utils/metric_utils.py b/verifiers/utils/metric_utils.py
--- a/verifiers/utils/metric_utils.py
+++ b/verifiers/utils/metric_utils.py
@@ -74,7 +74,9 @@
     def extract(self, output: RolloutOutput) -> float | None:
         usage = output.get("token_usage")
         if isinstance(usage, dict):
-            value = usage.get("cumulative_prefill_tokens") or usage.get("input_tokens")
+            value = usage.get("cumulative_prefill_tokens")
+            if value is None:
+                value = usage.get("input_tokens")
             if value is not None:
                 return float(value)
         return None
@@ -86,7 +88,9 @@
     def extract(self, output: RolloutOutput) -> float | None:
         usage = output.get("token_usage")
         if isinstance(usage, dict):
-            value = usage.get("cumulative_decode_tokens") or usage.get("output_tokens")
+            value = usage.get("cumulative_decode_tokens")
+            if value is None:
+                value = usage.get("output_tokens")
             if value is not None:
                 return float(value)
         return None

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

Comment thread verifiers/utils/metric_utils.py Outdated
Comment thread verifiers/types.py Outdated
snimu and others added 2 commits April 3, 2026 22:47
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use `if val is None` instead of `or` for new-key/old-key fallback so
that a legitimate 0.0 value is not treated as missing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@snimu snimu changed the title Sebastian/prompt metrics 2026 04 03 Better token count metrics Apr 3, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread verifiers/utils/eval_utils.py Outdated
snimu and others added 2 commits April 3, 2026 23:17
Previously the locally computed usage dict only had prefill/decode,
so longest_context_* was always None in the common path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@snimu snimu marked this pull request as draft April 3, 2026 21:46
snimu and others added 5 commits April 6, 2026 16:22
Replace the complex prefix-matching context token computation with a
linear-rollout assumption (last trajectory step). Rename cumulative
tracker keys from input_tokens/output_tokens to prefill_tokens/decode_tokens.
The new input_tokens/output_tokens now represent non-completion and
completion tokens in context respectively.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@snimu snimu marked this pull request as ready for review April 6, 2026 15:40
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread verifiers/utils/usage_utils.py
Responses where .usage is None are now skipped when finding the last
step and summing completion tokens, matching the guard used by the
legacy fallback in state_to_output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread verifiers/utils/eval_display.py Outdated
snimu and others added 2 commits April 6, 2026 18:39
The old-key fallback is unnecessary here since StateUsageTracker and
_coerce_token_usage already normalize to the new key names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@willccbb
Copy link
Copy Markdown
Member

willccbb commented Apr 7, 2026

Not sure I agree with the renaming, let's keep input_tokens + output_tokens as-is, these are user facing metrics for inference usage

…vel to final_*

Cumulative metrics (prefill_tokens/decode_tokens) revert to
input_tokens/output_tokens to match main. Context-level metrics
(input_tokens/output_tokens) become final_input_tokens/final_output_tokens.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: All TokenUsage fields made optional unnecessarily
    • Replaced total=False with NotRequired annotations for only the new optional fields (final_input_tokens, final_output_tokens) while keeping input_tokens and output_tokens as required fields.

Create PR

Or push these changes by commenting:

@cursor push 79808092d3
Preview (79808092d3)
diff --git a/verifiers/types.py b/verifiers/types.py
--- a/verifiers/types.py
+++ b/verifiers/types.py
@@ -203,11 +203,11 @@
     routed_experts: list[list[list[int]]] | None  # [seq_len, layers, topk]
 
 
-class TokenUsage(TypedDict, total=False):
+class TokenUsage(TypedDict):
     input_tokens: float
     output_tokens: float
-    final_input_tokens: float
-    final_output_tokens: float
+    final_input_tokens: NotRequired[float]
+    final_output_tokens: NotRequired[float]
 
 
 class VersionInfo(TypedDict):

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f1b88c2. Configure here.

Comment thread verifiers/types.py Outdated
snimu and others added 2 commits April 8, 2026 12:16
… required

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@snimu snimu requested a review from willccbb April 18, 2026 12:48
@samsja samsja merged commit abf7708 into main Apr 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants