feat(evaluator): promote generic agentic-use runtimes into the Agent-Eval SDK by arpitsardhana · Pull Request #256 · NVIDIA-NeMo/nemo-platform

arpitsardhana · 2026-06-10T05:44:38Z

Summary

This MR adds the layers above and below the evaluator so you can run a full agent-evaluation pipeline end-to-end without writing the glue yourself.

Orchestration — orchestrator.py
AgentEvalOrchestrator is a thin driver that ties AgentEvaluator + gating into one call. Two entry points:

run_tasks(tasks, target=runtime, ...) — online: execute an agent runtime, score, gate.
score_attempts(tasks, attempts=..., ...) — offline: score already-captured attempts (no execution).

It stays backend-agnostic via two seams: extra_metrics (metrics to append, e.g. a reward metric) and a prepare_task hook (e.g. "build the image first"). It never introspects the runtime.

Offline attempt sourcing — AgentAttemptSource (types.py)
A new protocol: the offline counterpart to AgentAttemptRuntime. Instead of executing an agent, an implementation adapts a stored artifact (a run dir/file) into an AgentEvalAttempt, so prior runs can be re-scored.
Pluggable execution environment — runtimes/environment.py + environment_spec.py + docker.py
AgentEnvironmentProvider → AgentEnvironmentHandle with a single run(spec, role) (roles: agent/verifier). DockerEnvironmentProvider is the default; swap in local/remote without touching runtime logic.
EnvironmentSpec/load_environment_spec/plan_task_build — declarative environment.yaml → BuildPlan (with a Dockerfile escape hatch).
docker.py — stdlib subprocess Docker helpers (docker_run, build_dockerfile, docker_image_exists).
Coding-agent CLI driver seam — runtimes/coding_agent.py
CliAgentDriver is a generic AgentAttemptRuntime for any CLI that takes a prompt on stdin and writes a final answer file; it captures workspace/stdout/stderr/output as evidence. CodingAgentSpec is the per-agent adapter (command builder + trajectory→evidence). Ships reference ClaudeCodeSpec/CursorAgentSpec.
Attempt/evidence shaping — attempts.py + runtimes/layout.py
resolve_attempt_status(agent_ok) → maps a ran-but-failed agent to partial (still scorable) vs failed.
standard_evidence_descriptors(...) → the canonical evidence map (initial_state/trace/logs/final_state/verifier_logs).
RunLayout + resolve_run_dir (abs-path for mounts) + prepare_run_layout — the on-disk run scaffold.
Verifier mechanic — runtimes/verify.py
VerifierOutcome + collect_verifier_outcome (reads reward.txt/stdout from a verifier log dir) + apply_verify_to_metadata (stamps reward/pass onto an attempt so a metric can score it).
Metrics — common_metrics.py
AgentPhaseSuccessMetric — scores from attempt metadata.
EvidencePresenceMetric — a true metric-over-evidence: reads candidate.evidence.filesystem(...) rather than a stamped reward.
Typed measurements + deterministic gate — measurements.py + gating.py
AttemptMeasurements — typed projection of tokens/runtime/reward/provenance from attempt metadata (one place that parses those keys).
gating.py — summarize_run + evaluate_gate/GateThresholds/GateReport + write_gate_report + baseline loading: pass-rate / token-regression / runtime tie-breaker / cross-commit provenance checks → gate.json.

Guardrails

CI grep gate (tests/agent_eval/test_import_hygiene.py) keeps agent_eval/ free of NeMo-Platform imports.
All shared/* modules are pure re-export shims over their SDK homes (see runtimes/README.md shim→SDK table).
Also fixes a pre-existing SandboxSdk→SandboxSDK typo in test_docker_sandbox_runtime.py.

flowchart TB
    subgraph Inputs
        T["AgentEvalTask(+metrics)"]
    end

    T --> ORCH["AgentEvalOrchestrator"]

    subgraph ONLINE["Online path: run_tasks(target=runtime)"]
        ORCH -->|prepare_task hook| BUILD["environment_spec: plan_task_build → BuildPlan<br/>docker.py: build image"]
        ORCH --> RT["AgentAttemptRuntime<br/>(e.g. CliAgentDriver + CodingAgentSpec)"]
        RT --> LAY["layout.py: resolve_run_dir / prepare_run_layout"]
        RT --> ENV["AgentEnvironmentProvider.prepare()<br/>→ Handle.run(spec, role=agent)"]
        ENV --> EXEC["DockerEnvironmentHandle → docker.py"]
        EXEC --> VER["verify.py: run(role=verifier)<br/>collect_verifier_outcome → apply_verify_to_metadata"]
        VER --> ATT["attempts.py: resolve_attempt_status<br/>standard_evidence_descriptors → AgentEvalAttempt(+evidence)"]
    end

    subgraph OFFLINE["Offline path: score_attempts(attempts=...)"]
        SRC["AgentAttemptSource.load_attempt()"] --> ATT
    end

    ATT --> EVAL["AgentEvaluator.run()"]
    EVAL --> MET["Metrics score per candidate<br/>AgentPhaseSuccessMetric / EvidencePresenceMetric<br/>(read candidate.evidence + metadata)"]
    MET --> RES["AgentEvalRunResult<br/>(attempts, results, summary)"]

    RES --> GATE["gating.py: summarize_run<br/>(via AttemptMeasurements)<br/>evaluate_gate vs baseline"]
    GATE --> OUT["persist bundle + gate.json<br/>(pass-rate / tokens / runtime / provenance)"]

Deliberately deferred (documented)

Converging the profbench codex runtime onto the new driver, implementing Claude/Cursor inside the nmp-agentic-base Docker env, removing the agentic-use stubs, and rewiring runtime_for_backend — bespoke per agent and not verifiable without those CLIs/images.
Removing the re-export shims (pure re-exports; deleting them is churn with no functional gain).

Test plan

pytest tests/agentic-use/tests/test_agentic_runtimes.py packages/nemo_evaluator_sdk/tests/agent_eval/ → 107 passed
ty check clean on agent_eval; grep import-hygiene gate green
End-to-end CLI: run_agent_eval.py --task workspace-basic-cli-easy --backend workflow --skip-build → agent_ok: True, overall_score: 1.0, full persistence bundle + gate.json (gate_passed: True) written
CI green

Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>

…agent drivers to agent-eval SDK Extend nemo_evaluator_sdk.agent_eval from "evaluator + contracts" into a full agent-evaluation pipeline by adding the layers above and below AgentEvaluator. Orchestration: - orchestrator.py: AgentEvalOrchestrator ties AgentEvaluator + gating into one call. run_tasks(target=runtime) (online) and score_attempts(attempts=...) (offline). Backend-agnostic via injected extra_metrics + a prepare_task hook; it never introspects the runtime. - types.py: AgentAttemptSource protocol — the offline counterpart to AgentAttemptRuntime (adapt a stored artifact into an AgentEvalAttempt). Execution layer (dependency-gated, no core import): - runtimes/environment.py: AgentEnvironmentProvider/Handle with a single run(spec, role) (agent/verifier); DockerEnvironmentProvider default, swappable. - runtimes/environment_spec.py: declarative environment.yaml -> BuildPlan (Dockerfile escape hatch); runtimes/docker.py: stdlib subprocess Docker helpers. - runtimes/coding_agent.py: CliAgentDriver (generic AgentAttemptRuntime for stdin-prompt CLIs) + CodingAgentSpec adapter seam; reference Claude/Cursor specs. - runtimes/layout.py: RunLayout + resolve_run_dir (abs paths for mounts) + prepare_run_layout. - runtimes/verify.py: VerifierOutcome + collect_verifier_outcome + apply_verify_to_metadata. Attempt + scoring: - attempts.py: resolve_attempt_status (ran-but-failed -> scorable "partial") + standard_evidence_descriptors (initial_state/trace/logs/final_state/verifier_logs). - common_metrics.py: AgentPhaseSuccessMetric and EvidencePresenceMetric, a real metric-over-evidence that reads candidate.evidence.filesystem(...). Results + gating: - measurements.py: AttemptMeasurements, one typed projection of tokens/runtime/reward/provenance from attempt metadata. - gating.py: summarize_run + evaluate_gate/GateThresholds/GateReport + write_gate_report + baseline loading (pass-rate, token regression, runtime tie-breaker, cross-commit provenance) -> gate.json. A CI grep gate (tests/agent_eval/test_import_hygiene.py) keeps agent_eval free of external/platform imports. tests/agentic-use is rewired as a thin adapter over these modules via pure re-export shims. Also fixes a pre-existing SandboxSdk->SandboxSDK typo in test_docker_sandbox_runtime.py. 107 tests pass; ty and import-hygiene gate clean; e2e CLI run reaches agent_ok=True, overall_score=1.0, gate_passed=True. Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>

Remove the compatibility shims under tests/agentic-use/runtimes/shared that re-exported promoted agent_eval SDK symbols, and import those generics directly from nemo_evaluator_sdk.agent_eval (docker, environment, environment_spec, gating, verify) at their use sites. Consolidate the remaining NeMo-Platform-only glue into a single module, shared/platform.py: the run layout with the platform state_dir, task_image_tag + platform DockerEnvironmentProvider, the namespaced AgentPhaseSuccessMetric + VerifierRewardMetric, agent-log/usage parsing and the shared container env, attempt construction (live + result.json/ResultDirAttemptSource), the live VERIFY phase, and the agentic-use task loader. shared/ now holds only platform.py, config.py, and constants.py. Update orchestrator/workflow/aut runtimes, the package __init__ re-exports, the runtime tests, and README/COMPLIANCE docs accordingly. 107 tests pass; ruff clean. Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>

SandyChapman · 2026-06-10T13:08:49Z

+    final_output_path: Path
+
+
+class CodingAgentSpec:


I don't think we should necessarily just consider Codex/Claude/Cursor "Coding" Agents. In the Agents Improving Agents reference, these get called "Generalized Agents" which I think is a good name for us. There's also a bit of an inconsistency in that we call these CliAgents below.

SandyChapman · 2026-06-10T13:12:34Z

+        )
+
+
+class ClaudeCodeSpec(CodingAgentSpec):


I'd likely split these specs out to their own modules.

SandyChapman · 2026-06-10T13:16:00Z


-class DockerEnvironmentHandle:
-    """Docker-backed environment handle bound to one task image."""
+class AbstractEnvironmentHandle:


Maybe inherit abc.ABC and mark the relevant methods as @abstractmethod? Alternatively, we could just inject a callable Callable[[EnvRunSpec, EnvRole], Awaitable[EnvCommandResult]] and avoid the inheritence.

SandyChapman · 2026-06-10T13:19:51Z


-from runtimes.shared.docker import docker_run
-from runtimes.shared.layout import task_image_tag
+EnvRole = Literal["agent", "verifier"]


Maybe not for this PR, but we should talk about env roles and how they differ. I'm also interested in whether we think the verifier needs to run in an isolated env.

SandyChapman · 2026-06-10T13:20:34Z

+``yaml`` is imported lazily so that importing this module costs nothing for
+callers that never load a spec.


Is yaml a heavy enough import that this matters?

SandyChapman · 2026-06-10T13:22:09Z

+        return self.metric_type
+
+    def output_spec(self) -> list[MetricOutputSpec]:
+        return [MetricOutputSpec.continuous_score("agent_phase_success")]


Maybe [MetricOutputSpec.boolean("agent_phase_success")] instead? We can then update the compute scores to:

agent_ok = bool(input.candidate.metadata.get("agent_ok")) return MetricResult(outputs=[MetricOutput(name="agent_phase_success", value=agent_ok])

Same point with EvidencePresenceMetric below.

SandyChapman · 2026-06-10T13:24:00Z

+            except (KeyError, ValueError):
+                score = 0.0


Maybe a log here so we can surface something more specific details on why there's a 0 result.

SandyChapman · 2026-06-10T13:25:59Z

+    baseline_summary_path: Path | None = None
+
+
+class AgentEvalOrchestrator:


Another thing we should discuss on naming I think. We haven't used Orchestrator as a type in the past so we might want to look to our prior art for consistency. I think in our model execution we call these pipelines.

SandyChapman · 2026-06-10T13:32:25Z

+    :class:`AgentEvalAttempt` so it can be (re)scored through ``AgentEvaluator``.
+    """
+
+    def load_attempt(self, source: str | Path, *, task: AgentEvalTask) -> AgentEvalAttempt: ...


I'm thinking we may want to make these a symmetric serde instead of it being one-sided. Something like:

@runtime_checkable class AgentAttemptSerde(Protocol): # or maybe AgentAttemptCodec? def read(self) -> AgentEvalAttempt: ... def write(self, AgentEvalAttempt): ...

Arguments should be passed to the concrete type's init:

class ResultDirAttemptSource: def __init__(self, path: str | Path, *, task: AgentEvalTask): self._path = path self._task = task def read(self) -> AgentEvalAttempt: # open files at path and parse to AgentEvalAttempt def write(self, AgentEvalAttempt): # write outputs to directory `self.path`

arpitsardhana requested review from a team as code owners June 10, 2026 05:44

arpitsardhana self-assigned this Jun 10, 2026

arpitsardhana force-pushed the aalgo-258-runner-sdk/arpsingh branch from 19d307b to a3e08db Compare June 10, 2026 05:55

arpitsardhana added 3 commits June 9, 2026 23:17

fix layout

b0a68bd

Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>

arpitsardhana force-pushed the aalgo-258-runner-sdk/arpsingh branch from a3e08db to afb7dc8 Compare June 10, 2026 06:20

arpitsardhana requested review from SandyChapman and ngoncharenko June 10, 2026 06:25

SandyChapman reviewed Jun 10, 2026

View reviewed changes

SandyChapman assigned SandyChapman and arpitsardhana and unassigned arpitsardhana and SandyChapman Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluator): promote generic agentic-use runtimes into the Agent-Eval SDK#256

feat(evaluator): promote generic agentic-use runtimes into the Agent-Eval SDK#256
arpitsardhana wants to merge 3 commits into
profbench-mvp-2from
aalgo-258-runner-sdk/arpsingh

arpitsardhana commented Jun 10, 2026 •

edited

Loading

Uh oh!

SandyChapman Jun 10, 2026

Uh oh!

SandyChapman Jun 10, 2026

Uh oh!

SandyChapman Jun 10, 2026

Uh oh!

SandyChapman Jun 10, 2026

Uh oh!

SandyChapman Jun 10, 2026

Uh oh!

SandyChapman Jun 10, 2026

Uh oh!

SandyChapman Jun 10, 2026

Uh oh!

SandyChapman Jun 10, 2026

Uh oh!

SandyChapman Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		``yaml`` is imported lazily so that importing this module costs nothing for
		callers that never load a spec.

		baseline_summary_path: Path \| None = None


		class AgentEvalOrchestrator:

Conversation

arpitsardhana commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Guardrails

Deliberately deferred (documented)

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arpitsardhana commented Jun 10, 2026 •

edited

Loading