Skip to content

[FR-006] Orchestrator aggregates per-eval scores into RunAggregates summary #6

Description

@explosivebit

Trace: PRD prd-v0-1-smoke-evaluation-run · FR-006 · SPEC RunAggregates schema (architect finding #2 resolution)

Capability: At status=aggregating step, compute counts_by_status, counts_by_error_class, total_cost_usd, total_wall_clock_ms, per_task_metrics, budget_breach, available_models_count per SPEC RunAggregates.

Acceptance:

  • counts_by_status sum equals len(evals[]) (invariant for FR-009)
  • total_cost_usd cross-checked with LiteLLM proxy /credits
  • AC-5 from SPEC: failed eval present in evals[] + reflected in counts_by_status.failed

Implementation locus: apps/eval-core-py/src/orchestrator/aggregates.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/specAPI/data contract specificationphase/2-smokePhase 2 — Smoke run executionpriority/p1High — current Phase scopetype/featNew feature

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions