feat(evaluator): promote generic agentic-use runtimes into the Agent-Eval SDK#256
feat(evaluator): promote generic agentic-use runtimes into the Agent-Eval SDK#256arpitsardhana wants to merge 3 commits into
Conversation
19d307b to
a3e08db
Compare
Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>
…agent drivers to agent-eval SDK Extend nemo_evaluator_sdk.agent_eval from "evaluator + contracts" into a full agent-evaluation pipeline by adding the layers above and below AgentEvaluator. Orchestration: - orchestrator.py: AgentEvalOrchestrator ties AgentEvaluator + gating into one call. run_tasks(target=runtime) (online) and score_attempts(attempts=...) (offline). Backend-agnostic via injected extra_metrics + a prepare_task hook; it never introspects the runtime. - types.py: AgentAttemptSource protocol — the offline counterpart to AgentAttemptRuntime (adapt a stored artifact into an AgentEvalAttempt). Execution layer (dependency-gated, no core import): - runtimes/environment.py: AgentEnvironmentProvider/Handle with a single run(spec, role) (agent/verifier); DockerEnvironmentProvider default, swappable. - runtimes/environment_spec.py: declarative environment.yaml -> BuildPlan (Dockerfile escape hatch); runtimes/docker.py: stdlib subprocess Docker helpers. - runtimes/coding_agent.py: CliAgentDriver (generic AgentAttemptRuntime for stdin-prompt CLIs) + CodingAgentSpec adapter seam; reference Claude/Cursor specs. - runtimes/layout.py: RunLayout + resolve_run_dir (abs paths for mounts) + prepare_run_layout. - runtimes/verify.py: VerifierOutcome + collect_verifier_outcome + apply_verify_to_metadata. Attempt + scoring: - attempts.py: resolve_attempt_status (ran-but-failed -> scorable "partial") + standard_evidence_descriptors (initial_state/trace/logs/final_state/verifier_logs). - common_metrics.py: AgentPhaseSuccessMetric and EvidencePresenceMetric, a real metric-over-evidence that reads candidate.evidence.filesystem(...). Results + gating: - measurements.py: AttemptMeasurements, one typed projection of tokens/runtime/reward/provenance from attempt metadata. - gating.py: summarize_run + evaluate_gate/GateThresholds/GateReport + write_gate_report + baseline loading (pass-rate, token regression, runtime tie-breaker, cross-commit provenance) -> gate.json. A CI grep gate (tests/agent_eval/test_import_hygiene.py) keeps agent_eval free of external/platform imports. tests/agentic-use is rewired as a thin adapter over these modules via pure re-export shims. Also fixes a pre-existing SandboxSdk->SandboxSDK typo in test_docker_sandbox_runtime.py. 107 tests pass; ty and import-hygiene gate clean; e2e CLI run reaches agent_ok=True, overall_score=1.0, gate_passed=True. Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>
Remove the compatibility shims under tests/agentic-use/runtimes/shared that re-exported promoted agent_eval SDK symbols, and import those generics directly from nemo_evaluator_sdk.agent_eval (docker, environment, environment_spec, gating, verify) at their use sites. Consolidate the remaining NeMo-Platform-only glue into a single module, shared/platform.py: the run layout with the platform state_dir, task_image_tag + platform DockerEnvironmentProvider, the namespaced AgentPhaseSuccessMetric + VerifierRewardMetric, agent-log/usage parsing and the shared container env, attempt construction (live + result.json/ResultDirAttemptSource), the live VERIFY phase, and the agentic-use task loader. shared/ now holds only platform.py, config.py, and constants.py. Update orchestrator/workflow/aut runtimes, the package __init__ re-exports, the runtime tests, and README/COMPLIANCE docs accordingly. 107 tests pass; ruff clean. Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>
a3e08db to
afb7dc8
Compare
| final_output_path: Path | ||
|
|
||
|
|
||
| class CodingAgentSpec: |
There was a problem hiding this comment.
I don't think we should necessarily just consider Codex/Claude/Cursor "Coding" Agents. In the Agents Improving Agents reference, these get called "Generalized Agents" which I think is a good name for us. There's also a bit of an inconsistency in that we call these CliAgents below.
| ) | ||
|
|
||
|
|
||
| class ClaudeCodeSpec(CodingAgentSpec): |
There was a problem hiding this comment.
I'd likely split these specs out to their own modules.
|
|
||
| class DockerEnvironmentHandle: | ||
| """Docker-backed environment handle bound to one task image.""" | ||
| class AbstractEnvironmentHandle: |
There was a problem hiding this comment.
Maybe inherit abc.ABC and mark the relevant methods as @abstractmethod? Alternatively, we could just inject a callable Callable[[EnvRunSpec, EnvRole], Awaitable[EnvCommandResult]] and avoid the inheritence.
|
|
||
| from runtimes.shared.docker import docker_run | ||
| from runtimes.shared.layout import task_image_tag | ||
| EnvRole = Literal["agent", "verifier"] |
There was a problem hiding this comment.
Maybe not for this PR, but we should talk about env roles and how they differ. I'm also interested in whether we think the verifier needs to run in an isolated env.
| ``yaml`` is imported lazily so that importing this module costs nothing for | ||
| callers that never load a spec. |
There was a problem hiding this comment.
Is yaml a heavy enough import that this matters?
| return self.metric_type | ||
|
|
||
| def output_spec(self) -> list[MetricOutputSpec]: | ||
| return [MetricOutputSpec.continuous_score("agent_phase_success")] |
There was a problem hiding this comment.
Maybe [MetricOutputSpec.boolean("agent_phase_success")] instead? We can then update the compute scores to:
agent_ok = bool(input.candidate.metadata.get("agent_ok"))
return MetricResult(outputs=[MetricOutput(name="agent_phase_success", value=agent_ok])Same point with EvidencePresenceMetric below.
| except (KeyError, ValueError): | ||
| score = 0.0 |
There was a problem hiding this comment.
Maybe a log here so we can surface something more specific details on why there's a 0 result.
| baseline_summary_path: Path | None = None | ||
|
|
||
|
|
||
| class AgentEvalOrchestrator: |
There was a problem hiding this comment.
Another thing we should discuss on naming I think. We haven't used Orchestrator as a type in the past so we might want to look to our prior art for consistency. I think in our model execution we call these pipelines.
| :class:`AgentEvalAttempt` so it can be (re)scored through ``AgentEvaluator``. | ||
| """ | ||
|
|
||
| def load_attempt(self, source: str | Path, *, task: AgentEvalTask) -> AgentEvalAttempt: ... |
There was a problem hiding this comment.
I'm thinking we may want to make these a symmetric serde instead of it being one-sided. Something like:
@runtime_checkable
class AgentAttemptSerde(Protocol): # or maybe AgentAttemptCodec?
def read(self) -> AgentEvalAttempt: ...
def write(self, AgentEvalAttempt): ...Arguments should be passed to the concrete type's init:
class ResultDirAttemptSource:
def __init__(self, path: str | Path, *, task: AgentEvalTask):
self._path = path
self._task = task
def read(self) -> AgentEvalAttempt:
# open files at path and parse to AgentEvalAttempt
def write(self, AgentEvalAttempt):
# write outputs to directory `self.path`
Summary
This MR adds the layers above and below the evaluator so you can run a full agent-evaluation pipeline end-to-end without writing the glue yourself.
AgentEvalOrchestrator is a thin driver that ties AgentEvaluator + gating into one call. Two entry points:
It stays backend-agnostic via two seams: extra_metrics (metrics to append, e.g. a reward metric) and a prepare_task hook (e.g. "build the image first"). It never introspects the runtime.
Offline attempt sourcing — AgentAttemptSource (types.py)
A new protocol: the offline counterpart to AgentAttemptRuntime. Instead of executing an agent, an implementation adapts a stored artifact (a run dir/file) into an AgentEvalAttempt, so prior runs can be re-scored.
Pluggable execution environment — runtimes/environment.py + environment_spec.py + docker.py
AgentEnvironmentProvider → AgentEnvironmentHandle with a single run(spec, role) (roles: agent/verifier). DockerEnvironmentProvider is the default; swap in local/remote without touching runtime logic.
EnvironmentSpec/load_environment_spec/plan_task_build — declarative environment.yaml → BuildPlan (with a Dockerfile escape hatch).
docker.py — stdlib subprocess Docker helpers (docker_run, build_dockerfile, docker_image_exists).
Coding-agent CLI driver seam — runtimes/coding_agent.py
CliAgentDriver is a generic AgentAttemptRuntime for any CLI that takes a prompt on stdin and writes a final answer file; it captures workspace/stdout/stderr/output as evidence. CodingAgentSpec is the per-agent adapter (command builder + trajectory→evidence). Ships reference ClaudeCodeSpec/CursorAgentSpec.
Attempt/evidence shaping — attempts.py + runtimes/layout.py
resolve_attempt_status(agent_ok) → maps a ran-but-failed agent to partial (still scorable) vs failed.
standard_evidence_descriptors(...) → the canonical evidence map (initial_state/trace/logs/final_state/verifier_logs).
RunLayout + resolve_run_dir (abs-path for mounts) + prepare_run_layout — the on-disk run scaffold.
Verifier mechanic — runtimes/verify.py
VerifierOutcome + collect_verifier_outcome (reads reward.txt/stdout from a verifier log dir) + apply_verify_to_metadata (stamps reward/pass onto an attempt so a metric can score it).
Metrics — common_metrics.py
AgentPhaseSuccessMetric — scores from attempt metadata.
EvidencePresenceMetric — a true metric-over-evidence: reads candidate.evidence.filesystem(...) rather than a stamped reward.
Typed measurements + deterministic gate — measurements.py + gating.py
AttemptMeasurements — typed projection of tokens/runtime/reward/provenance from attempt metadata (one place that parses those keys).
gating.py — summarize_run + evaluate_gate/GateThresholds/GateReport + write_gate_report + baseline loading: pass-rate / token-regression / runtime tie-breaker / cross-commit provenance checks → gate.json.
Guardrails
tests/agent_eval/test_import_hygiene.py) keepsagent_eval/free of NeMo-Platform imports.shared/*modules are pure re-export shims over their SDK homes (seeruntimes/README.mdshim→SDK table).SandboxSdk→SandboxSDKtypo intest_docker_sandbox_runtime.py.flowchart TB subgraph Inputs T["AgentEvalTask(+metrics)"] end T --> ORCH["AgentEvalOrchestrator"] subgraph ONLINE["Online path: run_tasks(target=runtime)"] ORCH -->|prepare_task hook| BUILD["environment_spec: plan_task_build → BuildPlan<br/>docker.py: build image"] ORCH --> RT["AgentAttemptRuntime<br/>(e.g. CliAgentDriver + CodingAgentSpec)"] RT --> LAY["layout.py: resolve_run_dir / prepare_run_layout"] RT --> ENV["AgentEnvironmentProvider.prepare()<br/>→ Handle.run(spec, role=agent)"] ENV --> EXEC["DockerEnvironmentHandle → docker.py"] EXEC --> VER["verify.py: run(role=verifier)<br/>collect_verifier_outcome → apply_verify_to_metadata"] VER --> ATT["attempts.py: resolve_attempt_status<br/>standard_evidence_descriptors → AgentEvalAttempt(+evidence)"] end subgraph OFFLINE["Offline path: score_attempts(attempts=...)"] SRC["AgentAttemptSource.load_attempt()"] --> ATT end ATT --> EVAL["AgentEvaluator.run()"] EVAL --> MET["Metrics score per candidate<br/>AgentPhaseSuccessMetric / EvidencePresenceMetric<br/>(read candidate.evidence + metadata)"] MET --> RES["AgentEvalRunResult<br/>(attempts, results, summary)"] RES --> GATE["gating.py: summarize_run<br/>(via AttemptMeasurements)<br/>evaluate_gate vs baseline"] GATE --> OUT["persist bundle + gate.json<br/>(pass-rate / tokens / runtime / provenance)"]Deliberately deferred (documented)
nmp-agentic-baseDocker env, removing the agentic-use stubs, and rewiringruntime_for_backend— bespoke per agent and not verifiable without those CLIs/images.Test plan
pytest tests/agentic-use/tests/test_agentic_runtimes.py packages/nemo_evaluator_sdk/tests/agent_eval/→ 107 passedty checkclean onagent_eval; grep import-hygiene gate greenrun_agent_eval.py --task workspace-basic-cli-easy --backend workflow --skip-build→agent_ok: True,overall_score: 1.0, full persistence bundle +gate.json(gate_passed: True) written