Skip to content

feat(evaluator): promote generic agentic-use runtimes into the Agent-Eval SDK#256

Open
arpitsardhana wants to merge 3 commits into
profbench-mvp-2from
aalgo-258-runner-sdk/arpsingh
Open

feat(evaluator): promote generic agentic-use runtimes into the Agent-Eval SDK#256
arpitsardhana wants to merge 3 commits into
profbench-mvp-2from
aalgo-258-runner-sdk/arpsingh

Conversation

@arpitsardhana

@arpitsardhana arpitsardhana commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

This MR adds the layers above and below the evaluator so you can run a full agent-evaluation pipeline end-to-end without writing the glue yourself.

  1. Orchestration — orchestrator.py
    AgentEvalOrchestrator is a thin driver that ties AgentEvaluator + gating into one call. Two entry points:
  • run_tasks(tasks, target=runtime, ...) — online: execute an agent runtime, score, gate.
  • score_attempts(tasks, attempts=..., ...) — offline: score already-captured attempts (no execution).

It stays backend-agnostic via two seams: extra_metrics (metrics to append, e.g. a reward metric) and a prepare_task hook (e.g. "build the image first"). It never introspects the runtime.

  1. Offline attempt sourcing — AgentAttemptSource (types.py)
    A new protocol: the offline counterpart to AgentAttemptRuntime. Instead of executing an agent, an implementation adapts a stored artifact (a run dir/file) into an AgentEvalAttempt, so prior runs can be re-scored.

  2. Pluggable execution environment — runtimes/environment.py + environment_spec.py + docker.py
    AgentEnvironmentProvider → AgentEnvironmentHandle with a single run(spec, role) (roles: agent/verifier). DockerEnvironmentProvider is the default; swap in local/remote without touching runtime logic.
    EnvironmentSpec/load_environment_spec/plan_task_build — declarative environment.yaml → BuildPlan (with a Dockerfile escape hatch).
    docker.py — stdlib subprocess Docker helpers (docker_run, build_dockerfile, docker_image_exists).

  3. Coding-agent CLI driver seam — runtimes/coding_agent.py
    CliAgentDriver is a generic AgentAttemptRuntime for any CLI that takes a prompt on stdin and writes a final answer file; it captures workspace/stdout/stderr/output as evidence. CodingAgentSpec is the per-agent adapter (command builder + trajectory→evidence). Ships reference ClaudeCodeSpec/CursorAgentSpec.

  4. Attempt/evidence shaping — attempts.py + runtimes/layout.py
    resolve_attempt_status(agent_ok) → maps a ran-but-failed agent to partial (still scorable) vs failed.
    standard_evidence_descriptors(...) → the canonical evidence map (initial_state/trace/logs/final_state/verifier_logs).
    RunLayout + resolve_run_dir (abs-path for mounts) + prepare_run_layout — the on-disk run scaffold.

  5. Verifier mechanic — runtimes/verify.py
    VerifierOutcome + collect_verifier_outcome (reads reward.txt/stdout from a verifier log dir) + apply_verify_to_metadata (stamps reward/pass onto an attempt so a metric can score it).

  6. Metrics — common_metrics.py
    AgentPhaseSuccessMetric — scores from attempt metadata.
    EvidencePresenceMetric — a true metric-over-evidence: reads candidate.evidence.filesystem(...) rather than a stamped reward.

  7. Typed measurements + deterministic gate — measurements.py + gating.py
    AttemptMeasurements — typed projection of tokens/runtime/reward/provenance from attempt metadata (one place that parses those keys).
    gating.py — summarize_run + evaluate_gate/GateThresholds/GateReport + write_gate_report + baseline loading: pass-rate / token-regression / runtime tie-breaker / cross-commit provenance checks → gate.json.

Guardrails

  • CI grep gate (tests/agent_eval/test_import_hygiene.py) keeps agent_eval/ free of NeMo-Platform imports.
  • All shared/* modules are pure re-export shims over their SDK homes (see runtimes/README.md shim→SDK table).
  • Also fixes a pre-existing SandboxSdkSandboxSDK typo in test_docker_sandbox_runtime.py.
flowchart TB
    subgraph Inputs
        T["AgentEvalTask(+metrics)"]
    end

    T --> ORCH["AgentEvalOrchestrator"]

    subgraph ONLINE["Online path: run_tasks(target=runtime)"]
        ORCH -->|prepare_task hook| BUILD["environment_spec: plan_task_build → BuildPlan<br/>docker.py: build image"]
        ORCH --> RT["AgentAttemptRuntime<br/>(e.g. CliAgentDriver + CodingAgentSpec)"]
        RT --> LAY["layout.py: resolve_run_dir / prepare_run_layout"]
        RT --> ENV["AgentEnvironmentProvider.prepare()<br/>→ Handle.run(spec, role=agent)"]
        ENV --> EXEC["DockerEnvironmentHandle → docker.py"]
        EXEC --> VER["verify.py: run(role=verifier)<br/>collect_verifier_outcome → apply_verify_to_metadata"]
        VER --> ATT["attempts.py: resolve_attempt_status<br/>standard_evidence_descriptors → AgentEvalAttempt(+evidence)"]
    end

    subgraph OFFLINE["Offline path: score_attempts(attempts=...)"]
        SRC["AgentAttemptSource.load_attempt()"] --> ATT
    end

    ATT --> EVAL["AgentEvaluator.run()"]
    EVAL --> MET["Metrics score per candidate<br/>AgentPhaseSuccessMetric / EvidencePresenceMetric<br/>(read candidate.evidence + metadata)"]
    MET --> RES["AgentEvalRunResult<br/>(attempts, results, summary)"]

    RES --> GATE["gating.py: summarize_run<br/>(via AttemptMeasurements)<br/>evaluate_gate vs baseline"]
    GATE --> OUT["persist bundle + gate.json<br/>(pass-rate / tokens / runtime / provenance)"]
Loading

Deliberately deferred (documented)

  • Converging the profbench codex runtime onto the new driver, implementing Claude/Cursor inside the nmp-agentic-base Docker env, removing the agentic-use stubs, and rewiring runtime_for_backend — bespoke per agent and not verifiable without those CLIs/images.
  • Removing the re-export shims (pure re-exports; deleting them is churn with no functional gain).

Test plan

  • pytest tests/agentic-use/tests/test_agentic_runtimes.py packages/nemo_evaluator_sdk/tests/agent_eval/107 passed
  • ty check clean on agent_eval; grep import-hygiene gate green
  • End-to-end CLI: run_agent_eval.py --task workspace-basic-cli-easy --backend workflow --skip-buildagent_ok: True, overall_score: 1.0, full persistence bundle + gate.json (gate_passed: True) written
  • CI green

@arpitsardhana arpitsardhana requested review from a team as code owners June 10, 2026 05:44
@arpitsardhana arpitsardhana self-assigned this Jun 10, 2026
@arpitsardhana arpitsardhana force-pushed the aalgo-258-runner-sdk/arpsingh branch from 19d307b to a3e08db Compare June 10, 2026 05:55
Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>
…agent drivers to agent-eval SDK

Extend nemo_evaluator_sdk.agent_eval from "evaluator + contracts" into a full
agent-evaluation pipeline by adding the layers above and below AgentEvaluator.

Orchestration:
- orchestrator.py: AgentEvalOrchestrator ties AgentEvaluator + gating into one
  call. run_tasks(target=runtime) (online) and score_attempts(attempts=...)
  (offline). Backend-agnostic via injected extra_metrics + a prepare_task hook;
  it never introspects the runtime.
- types.py: AgentAttemptSource protocol — the offline counterpart to
  AgentAttemptRuntime (adapt a stored artifact into an AgentEvalAttempt).

Execution layer (dependency-gated, no core import):
- runtimes/environment.py: AgentEnvironmentProvider/Handle with a single
  run(spec, role) (agent/verifier); DockerEnvironmentProvider default, swappable.
- runtimes/environment_spec.py: declarative environment.yaml -> BuildPlan
  (Dockerfile escape hatch); runtimes/docker.py: stdlib subprocess Docker helpers.
- runtimes/coding_agent.py: CliAgentDriver (generic AgentAttemptRuntime for
  stdin-prompt CLIs) + CodingAgentSpec adapter seam; reference Claude/Cursor specs.
- runtimes/layout.py: RunLayout + resolve_run_dir (abs paths for mounts) +
  prepare_run_layout.
- runtimes/verify.py: VerifierOutcome + collect_verifier_outcome +
  apply_verify_to_metadata.

Attempt + scoring:
- attempts.py: resolve_attempt_status (ran-but-failed -> scorable "partial") +
  standard_evidence_descriptors (initial_state/trace/logs/final_state/verifier_logs).
- common_metrics.py: AgentPhaseSuccessMetric and EvidencePresenceMetric, a real
  metric-over-evidence that reads candidate.evidence.filesystem(...).

Results + gating:
- measurements.py: AttemptMeasurements, one typed projection of
  tokens/runtime/reward/provenance from attempt metadata.
- gating.py: summarize_run + evaluate_gate/GateThresholds/GateReport +
  write_gate_report + baseline loading (pass-rate, token regression, runtime
  tie-breaker, cross-commit provenance) -> gate.json.

A CI grep gate (tests/agent_eval/test_import_hygiene.py) keeps agent_eval free of
external/platform imports. tests/agentic-use is rewired as a thin adapter over
these modules via pure re-export shims. Also fixes a pre-existing
SandboxSdk->SandboxSDK typo in test_docker_sandbox_runtime.py.

107 tests pass; ty and import-hygiene gate clean; e2e CLI run reaches
agent_ok=True, overall_score=1.0, gate_passed=True.

Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>
Remove the compatibility shims under tests/agentic-use/runtimes/shared that
re-exported promoted agent_eval SDK symbols, and import those generics directly
from nemo_evaluator_sdk.agent_eval (docker, environment, environment_spec,
gating, verify) at their use sites.

Consolidate the remaining NeMo-Platform-only glue into a single module,
shared/platform.py: the run layout with the platform state_dir, task_image_tag
+ platform DockerEnvironmentProvider, the namespaced AgentPhaseSuccessMetric +
VerifierRewardMetric, agent-log/usage parsing and the shared container env,
attempt construction (live + result.json/ResultDirAttemptSource), the live
VERIFY phase, and the agentic-use task loader. shared/ now holds only
platform.py, config.py, and constants.py.

Update orchestrator/workflow/aut runtimes, the package __init__ re-exports, the
runtime tests, and README/COMPLIANCE docs accordingly. 107 tests pass; ruff
clean.

Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>
@arpitsardhana arpitsardhana force-pushed the aalgo-258-runner-sdk/arpsingh branch from a3e08db to afb7dc8 Compare June 10, 2026 06:20
final_output_path: Path


class CodingAgentSpec:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should necessarily just consider Codex/Claude/Cursor "Coding" Agents. In the Agents Improving Agents reference, these get called "Generalized Agents" which I think is a good name for us. There's also a bit of an inconsistency in that we call these CliAgents below.

)


class ClaudeCodeSpec(CodingAgentSpec):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd likely split these specs out to their own modules.


class DockerEnvironmentHandle:
"""Docker-backed environment handle bound to one task image."""
class AbstractEnvironmentHandle:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe inherit abc.ABC and mark the relevant methods as @abstractmethod? Alternatively, we could just inject a callable Callable[[EnvRunSpec, EnvRole], Awaitable[EnvCommandResult]] and avoid the inheritence.


from runtimes.shared.docker import docker_run
from runtimes.shared.layout import task_image_tag
EnvRole = Literal["agent", "verifier"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not for this PR, but we should talk about env roles and how they differ. I'm also interested in whether we think the verifier needs to run in an isolated env.

Comment on lines +34 to +35
``yaml`` is imported lazily so that importing this module costs nothing for
callers that never load a spec.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is yaml a heavy enough import that this matters?

return self.metric_type

def output_spec(self) -> list[MetricOutputSpec]:
return [MetricOutputSpec.continuous_score("agent_phase_success")]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe [MetricOutputSpec.boolean("agent_phase_success")] instead? We can then update the compute scores to:

agent_ok = bool(input.candidate.metadata.get("agent_ok"))
return MetricResult(outputs=[MetricOutput(name="agent_phase_success", value=agent_ok])

Same point with EvidencePresenceMetric below.

Comment on lines +77 to +78
except (KeyError, ValueError):
score = 0.0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a log here so we can surface something more specific details on why there's a 0 result.

baseline_summary_path: Path | None = None


class AgentEvalOrchestrator:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing we should discuss on naming I think. We haven't used Orchestrator as a type in the past so we might want to look to our prior art for consistency. I think in our model execution we call these pipelines.

:class:`AgentEvalAttempt` so it can be (re)scored through ``AgentEvaluator``.
"""

def load_attempt(self, source: str | Path, *, task: AgentEvalTask) -> AgentEvalAttempt: ...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we may want to make these a symmetric serde instead of it being one-sided. Something like:

@runtime_checkable
class AgentAttemptSerde(Protocol): # or maybe AgentAttemptCodec?
  def read(self) -> AgentEvalAttempt: ...
  def write(self, AgentEvalAttempt): ...

Arguments should be passed to the concrete type's init:

class ResultDirAttemptSource:

  def __init__(self, path: str | Path, *, task: AgentEvalTask):
    self._path = path
    self._task = task

  def read(self) -> AgentEvalAttempt:
    # open files at path and parse to AgentEvalAttempt

  def write(self, AgentEvalAttempt):
    # write outputs to directory `self.path`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants