Conversation
Track (prompt_tokens, completion_tokens) per turn in StateUsageTracker and compute branch-aware context metrics: - cumulative_prefill_tokens: total prefill work (renamed from input_tokens) - cumulative_decode_tokens: total decode work (renamed from output_tokens) - longest_context_completion_tokens: model output in longest branch - longest_context_non_completion_tokens: environment input in longest branch Branching is detected via mark_branch() called from RLM's summarize_turns. Without summarization, there's one branch and the metrics reflect the full conversation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New metrics computed from the trajectory at rollout end: - longest_context_completion_tokens: model-generated tokens in context - longest_context_non_completion_tokens: non-model tokens in context Context metrics detect summarization automatically by counting assistant messages in the last trajectory step's prompt — dropped turns are simply absent. No env-specific hooks needed. Rename display names (backward-compatible with old saved data): - input_tokens → cumulative_prefill_tokens - output_tokens → cumulative_decode_tokens Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ics-2026-04-03 merge in main
Replace assistant-counting heuristic with the same message-prefix matching approach used by best-effort TITO (PR #955). This auto-detects branching, context dropping, and history rewriting from trajectory data alone — no trajectory_id filtering or env modifications needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Autofix Details
Bugbot Autofix prepared fixes for both issues found in the latest run.
- ✅ Fixed: Falsy
orfallback skips valid zero token values- Replaced
oroperator with explicitif value is Nonechecks in InputTokensMetric, OutputTokensMetric, and _make_tokens_row to properly handle zero token values.
- Replaced
- ✅ Fixed: Documentation not updated for TokenUsage changes
- Updated docs/reference.md to document the new TokenUsage fields (cumulative_prefill_tokens, cumulative_decode_tokens, longest_context_completion_tokens, longest_context_non_completion_tokens) and explain the naming convention change.
Or push these changes by commenting:
@cursor push af0195d735
Preview (af0195d735)
diff --git a/docs/reference.md b/docs/reference.md
--- a/docs/reference.md
+++ b/docs/reference.md
@@ -210,6 +210,14 @@
env_version: str | None
env_commit: str | None
+class TokenUsage(TypedDict, total=False):
+ input_tokens: float # legacy name for cumulative_prefill_tokens
+ output_tokens: float # legacy name for cumulative_decode_tokens
+ cumulative_prefill_tokens: float # total prefill tokens across all turns
+ cumulative_decode_tokens: float # total decode tokens across all turns
+ longest_context_completion_tokens: float # completion tokens in longest context branch
+ longest_context_non_completion_tokens: float # non-completion tokens in longest context branch
+
class GenerateMetadata(TypedDict):
env_id: str
env_args: dict
@@ -237,6 +245,8 @@
`version_info` captures the verifiers framework version/commit and the environment package version/commit at generation time. Populated automatically by `GenerateOutputsBuilder`.
+`usage` aggregates token usage across all rollouts. All fields in `TokenUsage` are optional. The new naming convention (`cumulative_prefill_tokens`, `cumulative_decode_tokens`) is preferred; legacy field names (`input_tokens`, `output_tokens`) are supported for backward compatibility.
+
### RolloutScore / RolloutScores
```python
diff --git a/verifiers/utils/eval_display.py b/verifiers/utils/eval_display.py
--- a/verifiers/utils/eval_display.py
+++ b/verifiers/utils/eval_display.py
@@ -354,13 +354,15 @@
def _make_tokens_row(self, usage: TokenUsage) -> Table | None:
"""Create a tokens row with prefill/decode and context values."""
+ prefill = usage.get("cumulative_prefill_tokens")
+ if prefill is None:
+ prefill = usage.get("input_tokens", 0.0)
+ decode = usage.get("cumulative_decode_tokens")
+ if decode is None:
+ decode = usage.get("output_tokens", 0.0)
kv: dict[str, object] = {
- "prefill": format_numeric(
- usage.get("cumulative_prefill_tokens") or usage.get("input_tokens", 0.0)
- ),
- "decode": format_numeric(
- usage.get("cumulative_decode_tokens") or usage.get("output_tokens", 0.0)
- ),
+ "prefill": format_numeric(prefill),
+ "decode": format_numeric(decode),
}
ctx_non_completion = usage.get("longest_context_non_completion_tokens")
ctx_completion = usage.get("longest_context_completion_tokens")
@@ -980,14 +982,14 @@
else:
usage = env_state.usage
if usage is not None:
- prefill_tokens = format_numeric(
- usage.get("cumulative_prefill_tokens")
- or usage.get("input_tokens", 0.0)
- )
- decode_tokens = format_numeric(
- usage.get("cumulative_decode_tokens")
- or usage.get("output_tokens", 0.0)
- )
+ prefill = usage.get("cumulative_prefill_tokens")
+ if prefill is None:
+ prefill = usage.get("input_tokens", 0.0)
+ decode = usage.get("cumulative_decode_tokens")
+ if decode is None:
+ decode = usage.get("output_tokens", 0.0)
+ prefill_tokens = format_numeric(prefill)
+ decode_tokens = format_numeric(decode)
# error rate with color coding
error_rate = env_state.error_rate
diff --git a/verifiers/utils/metric_utils.py b/verifiers/utils/metric_utils.py
--- a/verifiers/utils/metric_utils.py
+++ b/verifiers/utils/metric_utils.py
@@ -74,7 +74,9 @@
def extract(self, output: RolloutOutput) -> float | None:
usage = output.get("token_usage")
if isinstance(usage, dict):
- value = usage.get("cumulative_prefill_tokens") or usage.get("input_tokens")
+ value = usage.get("cumulative_prefill_tokens")
+ if value is None:
+ value = usage.get("input_tokens")
if value is not None:
return float(value)
return None
@@ -86,7 +88,9 @@
def extract(self, output: RolloutOutput) -> float | None:
usage = output.get("token_usage")
if isinstance(usage, dict):
- value = usage.get("cumulative_decode_tokens") or usage.get("output_tokens")
+ value = usage.get("cumulative_decode_tokens")
+ if value is None:
+ value = usage.get("output_tokens")
if value is not None:
return float(value)
return NoneThis Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use `if val is None` instead of `or` for new-key/old-key fallback so that a legitimate 0.0 value is not treated as missing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously the locally computed usage dict only had prefill/decode, so longest_context_* was always None in the common path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ics-2026-04-03 merge in main
Replace the complex prefix-matching context token computation with a linear-rollout assumption (last trajectory step). Rename cumulative tracker keys from input_tokens/output_tokens to prefill_tokens/decode_tokens. The new input_tokens/output_tokens now represent non-completion and completion tokens in context respectively. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Responses where .usage is None are now skipped when finding the last step and summing completion tokens, matching the guard used by the legacy fallback in state_to_output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The old-key fallback is unnecessary here since StateUsageTracker and _coerce_token_usage already normalize to the new key names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Not sure I agree with the renaming, let's keep input_tokens + output_tokens as-is, these are user facing metrics for inference usage |
…vel to final_* Cumulative metrics (prefill_tokens/decode_tokens) revert to input_tokens/output_tokens to match main. Context-level metrics (input_tokens/output_tokens) become final_input_tokens/final_output_tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: All TokenUsage fields made optional unnecessarily
- Replaced total=False with NotRequired annotations for only the new optional fields (final_input_tokens, final_output_tokens) while keeping input_tokens and output_tokens as required fields.
Or push these changes by commenting:
@cursor push 79808092d3
Preview (79808092d3)
diff --git a/verifiers/types.py b/verifiers/types.py
--- a/verifiers/types.py
+++ b/verifiers/types.py
@@ -203,11 +203,11 @@
routed_experts: list[list[list[int]]] | None # [seq_len, layers, topk]
-class TokenUsage(TypedDict, total=False):
+class TokenUsage(TypedDict):
input_tokens: float
output_tokens: float
- final_input_tokens: float
- final_output_tokens: float
+ final_input_tokens: NotRequired[float]
+ final_output_tokens: NotRequired[float]
class VersionInfo(TypedDict):This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f1b88c2. Configure here.
… required Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ics-2026-04-03 merge main


Description
Change the token metrics:
input_tokensoutput_tokensfinal_input_tokensfinal_output_tokensoutput_tokensfor single-turn rollouts.In a single-turn rollout,
input_tokens == final_input_tokensandoutput_tokens == final_output_tokens. In a multi-turn rollout,input_tokens > final_input_tokensbecause earlier turns' prompts are counted again.The
final_*metrics assume a single, continuously extended trajectory. Non-linear trajectories (multi-agent, context summarization, history rewriting) are not accounted for.Type of Change
Testing
uv run pytestlocally.Checklist
Note
Medium Risk
Adds new
token_usagefields and changes how token usage is computed/aggregated from trajectories, which can alter reported metrics and downstream consumers expecting the old schema.Overview
Adds new per-rollout token usage metrics (
final_input_tokens,final_output_tokens) derived from the rollout trajectory’s last step, alongside existinginput_tokens/output_tokenstotals.Plumbs these fields through result serialization and aggregation:
state_to_outputnow enrichestoken_usage,GenerateOutputsBuilder/print_usagecompute averages for the new metrics when available, and TUI/eval displays render the additional fields. Updates types/docs to include the newTokenUsageshape, refactors token metrics into a keyed base class, and adds focused tests for trajectory-based context token computation and metric classes.Reviewed by Cursor Bugbot for commit 00c7d29. Bugbot is set up for automated code reviews on this repo. Configure here.