Add experimental Harbor integration for GRPO environment training by adithya-s-k · Pull Request #6018 · huggingface/trl

adithya-s-k · 2026-06-11T22:35:43Z

Adds an experimental integration for training on Harbor agentic task suites with GRPOTrainer via environment_factory. It lives at trl.experimental.harbor, is gated behind a new trl[harbor] extra, and is lazy-imported so non-users pay nothing.

It mirrors the structure of the OpenReward integration (#5752, #5729, #5696): a single HarborSpec maps one task suite to the three trainer slots —

from trl import GRPOConfig, GRPOTrainer
from trl.experimental.harbor import HarborSpec

spec = HarborSpec("AdithyaSK/data_agent_rl_environment_train", agent="bash", num_tasks=64)
trainer = GRPOTrainer(
    model="Qwen/Qwen3-4B",
    args=GRPOConfig(num_generations=8, max_steps=50, max_tool_calling_iterations=25),
    train_dataset=spec.train_dataset,
    environment_factory=spec.environment_factory,
    reward_funcs=spec.reward_funcs,
)
trainer.train()

HarborEnv wraps a Harbor sandbox + verifier. TRL drives the rollout loop: it calls the env's tool methods during generation and reads env.reward after. The base agent (harness) is pluggable via agent= — a built-in name, a HarborEnv subclass, or an import/file path.
Built-in bash harness, plus jupyter and terminal_notes example harnesses under examples/scripts/harbor/harnesses/ (folder-per-harness, each with a README listing its tools).

External agents only (for now)

Harbor supports external agents (run outside the sandbox, drive the loop via exec) and installed agents (installed into the image, run headless inside the container, trajectory parsed after). Only the external pattern is supported, because RL needs the trainer to drive generation turn-by-turn and capture the policy's tokens/log-probs + env mask — which an opaque in-container agent can't expose. A HarborEnv is therefore an external agent: tool methods exec into the sandbox, but the loop and the model under training stay in TRL. (Documented in docs/source/harbor.md.)

Notable changes

pyproject.toml: new harbor extra (harbor>=0.13.0; python_version >= '3.12'); relaxes the vLLM cap vllm>=0.12.0,<=0.19.0 → vllm>=0.22.0. The old cap pins transformers<5, which breaks environment_factory (needs transformers>=5.2). This change is separable from the rest of the PR if you'd prefer it split out.
E2B COPY workaround: E2B's from_dockerfile build honors RUN but silently drops files COPY'd from the build context, breaking task healthchecks that run those files (e.g. a data-pull hook). HarborEnv replicates the Dockerfile's COPY directives at runtime (upload as the sandbox user, mv into place as root).

Testing

tests/experimental/test_harbor.py — 12 tests (agent resolution, dataset/metadata columns, reward func), green; gated by require_harbor.
Verified end-to-end: gpt-4.1 over the bash harness loads task data and lands reward=1.0 on a data-agent task.

Note

Medium Risk
New experimental integration touches GRPO rollouts, sandbox execution, and optional cloud backends (Docker/E2B); dependency stack needs vLLM ≥0.22 and transformers ≥5.2, but changes are isolated behind trl[harbor] and lazy imports.

Overview
Adds experimental Harbor support so GRPOTrainer can train on Harbor task suites (instruction + sandbox image + in-sandbox verifier) through the same environment_factory slot as OpenEnv/OpenReward.

Core API (trl.experimental.harbor, trl[harbor] extra): HarborSpec maps one suite to train_dataset, a picklable environment_factory, and default verifier-based reward_funcs. HarborEnv / HarborBashEnv run Harbor sandboxes in-process (lazy harbor import), with per-env async loops, external-agent tool exec, lazy verifier rewards, and an E2B workaround that re-uploads Dockerfile COPY files at runtime.

Examples & harnesses: examples/scripts/harbor/data_agent.py plus optional jupyter and terminal_notes HarborEnv subclasses (stateful kernel, shell + notes tools).

Docs & tests: New docs/source/harbor.md, README/index “What’s New”, GRPO/OpenEnv integration comparison table, example overview entry; tests/experimental/test_harbor.py and is_harbor_available / require_harbor.

^{Reviewed by Cursor Bugbot for commit 678bbc9. Bugbot is set up for automated code reviews on this repo. Configure here.}

Trainer support. Works with GRPOTrainer today (reads env.reward off the live env instances). AsyncGRPOTrainer support is forward-wired: _outcome_reward_func also accepts an environment_reward= list, so it activates automatically once #5892 (merged) + #6031 (forwards env rewards from the rollout worker) land — no change needed here. This PR carries no dependency on those; the GRPO path is self-contained.

Train on Harbor agentic task suites with GRPOTrainer via environment_factory. HarborSpec maps one task suite to the three trainer slots (train_dataset / environment_factory / reward_funcs), mirroring the OpenReward integration (huggingface#5752, huggingface#5729, huggingface#5696). HarborEnv wraps a Harbor sandbox + verifier; the base agent (harness) is pluggable — built-in `bash`, plus `jupyter` and `terminal_notes` example harnesses (folder-per-harness, each with a README). HarborEnv follows Harbor's *external agent* pattern (the policy drives the loop and tool methods exec into the sandbox); Harbor's *installed agents* are not supported, since RL needs the trainer to drive generation and capture the policy's tokens/log-probs, which an opaque in-container agent can't expose. - trl/experimental/harbor: HarborEnv (+ HarborBashEnv) and HarborSpec - examples/scripts/harbor: data_agent.py + harnesses/ - docs/source/harbor.md (+ toctree, example_overview entries) - tests/experimental/test_harbor.py (require_harbor + is_harbor_available) - pyproject: add `harbor` extra; relax vllm cap to >=0.22.0 (0.19 pins transformers<5, which breaks environment_factory; it needs transformers>=5.2) E2B's from_dockerfile build honors RUN but silently drops COPY'd build-context files, so HarborEnv replicates the Dockerfile's COPY directives at runtime (upload as the sandbox user, mv into place as root) — healthchecks that run those files (e.g. a data-pull hook) then work. Verified end-to-end: gpt-4.1 over the bash harness lands reward=1.0 on a data-agent task.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b5c076888f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T22:38:39Z

+        verifier = VerifierFactory.create_verifier_from_config(
+            VerifierConfig(), task=self._task, trial_paths=self._paths, environment=self._env
+        )


Use the task's verifier config

For Harbor tasks that define a [verifier] block, this constructs a fresh default VerifierConfig() instead of using self._task.config.verifier. That silently drops task-provided verifier settings such as env credentials/model parameters, a custom import_path, user, or separate verifier environment, so those valid Harbor tasks will either run the wrong verifier or fail during reward computation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T22:38:39Z

+harbor = [
+    "harbor>=0.13.0; python_version >= '3.12'",  # harbor requires Python 3.12+ (pulls its sandbox backends)
+]


Include the cloud backend extras used by the example

The new example and docs recommend running with --env e2b, but installing trl[vllm,harbor] only pulls Harbor's base package; Harbor keeps providers such as e2b, daytona, and runloop behind optional extras. A user following the script metadata or quick-start will therefore hit an import error as soon as environment_type="e2b" is selected, even with E2B_API_KEY set.

Useful? React with 👍 / 👎.

…rainer Review fixes (PR huggingface#6018): - _verify: forward the task's [verifier].env to the verifier (override_env=), mirroring Harbor's trial runner, instead of dropping it with a bare VerifierConfig(). - HarborSpec dataset: task_index now reflects the selected suite position when `indices=` is used (was always 0..len-1). - _resolve_agent: split the agent selector on the last ':' (rpartition) so a Windows drive path (D:\...\harness.py:Class) isn't misparsed as a module. - docs/example: install the chosen sandbox backend's Harbor extra (e.g. `harbor[e2b]` for --env e2b); the trl[harbor] extra stays backend-free. AsyncGRPO compatibility: - HarborSpec.environment_factory returns a functools.partial (picklable) instead of a closure, so AsyncGRPOTrainer can ship it to its separate rollout-worker process. - HarborEnv runs its asyncio loop on a daemon thread and submits via run_coroutine_threadsafe, so sync tool methods work both from a plain thread (GRPOTrainer) and from inside a running event loop (AsyncGRPOTrainer's worker), where loop.run_until_complete would otherwise raise. Verified: 12 unit tests pass; gpt-4.1 over the bash harness still lands reward=1.0 end-to-end.

GRPOTrainer passes live env instances as environments=; AsyncGRPOTrainer runs envs in its rollout worker and passes already-captured per-rollout rewards as environment_reward=. _outcome_reward_func now accepts both so the same HarborSpec plugs into either trainer. Additive: GRPO runs are unaffected.

Avoids doc-builder reflowing the triple-quoted TOML block, which corrupted the fixture under some doc-builder versions.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit a76c804. Configure here.}

doc-builder imports trl.experimental.harbor on 3.10 to autodoc HarborSpec, but tomllib is stdlib only on 3.11+. Move the import into _read_task_meta (guarded by its existing try/except) so the module imports everywhere; real usage is 3.12+.

bot-ci-comment · 2026-06-15T14:43:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

GRPOTrainer/AsyncGRPOTrainer discover tool methods via inspect.getmembers, which evaluates properties. On a fresh env (never reset) the reward property would start the Harbor backend and import harbor (absent in the trainer venv), crashing init. Return 0.0 when no sandbox exists yet.

sergiopaniego

great work as always!
just a few comments

sergiopaniego · 2026-06-16T13:40:59Z

    title: Kernels Hub
  - local: liger_kernel_integration
    title: Liger Kernel
+  - local: harbor


this should be ordered alphabetically

Done — reordered to harbor → openenv → openreward.

sorry for the misunderstanding 😅 the whole Integrations section should be ordered alphabetically, not only the env integrations

- sections: - local: deepspeed_integration title: DeepSpeed - local: harbor title: Harbor - local: kernels_hub title: Kernels Hub - local: liger_kernel_integration title: Liger Kernel - local: openenv title: OpenEnv - local: openreward title: OpenReward - local: peft_integration title: PEFT - local: ptt_integration title: Post Training Toolkit - local: rapidfire_integration title: RapidFire AI - local: trackio_integration title: Trackio - local: unsloth_integration title: Unsloth - local: vllm_integration title: vLLM title: Integrations

sergiopaniego · 2026-06-16T13:43:57Z

+Under the hood `HarborSpec` does three things, lazily on first access:
+
+1. **`spec.train_dataset`**: resolves the task suite to local task directories (downloading the HF dataset if needed) and builds a `datasets.Dataset` with `prompt` (empty — the env's instruction is appended at `reset`), `task_dir`, `task_index`, plus per-task `task.toml` metadata columns.
+2. **`spec.environment_factory`**: returns a zero-arg callable producing a fresh per-rollout [`~trl.experimental.harbor.HarborEnv`]. On `reset(task_dir)` it starts the task's Harbor sandbox and returns its instruction; tool methods exec into the sandbox; `env.reward` runs the verifier once after the rollout.


~trl.experimental.harbor.HarborEnv is not rendered as a link. Same for ~...HarborBashEnv

Fixed — all references now use the [~trl.experimental.harbor.HarborEnv] autodoc form (HarborEnv, HarborBashEnv, HarborSpec), so they render as links. Doc build passes.

these aren't links yet for me.
we would need:

[[autodoc]] trl.experimental.harbor.HarborSpec [[autodoc]] trl.experimental.harbor.HarborEnv [[autodoc]] trl.experimental.harbor.HarborBashEnv

no?

Good catch — the cross-refs only resolve when the target has an autodoc entry. Added [[autodoc]] blocks for HarborEnv and HarborBashEnv alongside HarborSpec under the ## API section, so the [~trl.experimental.harbor.HarborEnv] / HarborBashEnv references now render as links. Verified all three import without harbor installed and doc-builder passes.

Docs: compare environment integrations (OpenEnv, OpenReward, Harbor)

…y, terminal_notes/env.py) + show both in guide

sergiopaniego · 2026-06-17T11:57:49Z

 ]
 vllm = [
-    "vllm>=0.12.0,<=0.19.0",
+    "vllm>=0.22.0",  # cap relaxed from <=0.19.0: 0.19 pins transformers<5, which breaks environment_factory (needs transformers>=5.2)


Let's be careful with this change. We usually do the vLLM bump in its own dedicated PR, and always one minor at a time (e.g. #5879, #5547). This one is bundled into the Harbor PR, drops the upper cap, and jumps 0.12 → 0.22, which affects all trl[vllm] users

For a specific experimental integration, we don't have to be consistent with pyproject, you can just document that, to use Harbor, this requires vllm=0.22 (check async grpo, we did the same)

qgallouedec · 2026-06-17T12:35:01Z

 ```

-You can also provide tools through `environment_factory`. In this mode, [`GRPOTrainer`] creates one environment instance per rollout and exposes the environment's public methods as tools.
+You can also provide tools through `environment_factory`. In this mode, [`GRPOTrainer`] creates one environment instance per rollout and exposes the environment's public methods as tools. See the [OpenEnv guide](openenv) for the `environment_factory` contract and a comparison of the available integrations (OpenEnv, OpenReward, Harbor).


Not super important but I don't understand why we have to access the openenv section to see the integration comparison. Maybe it makes more sense have the comparison table here?

qgallouedec

Nice, I didn't run it myself, but for an initial step, it looks good.

A few things before approval: address the pyproject comment + subsection ordering

Also, you could update the "What's new" section in README.md + docs/source/index.md

…req), alphabetize Integrations TOC, move env-integration comparison to grpo_trainer.md, add env integrations to What's New

qgallouedec

lgtm, a few final comments!

qgallouedec · 2026-06-18T15:00:23Z

 **TRL v1:** We released TRL v1 — a major milestone that marks a real shift in what TRL is. Read the [blog post](https://huggingface.co/blog/trl-v1) to learn more.



Suggested change

**TRL v1:** We released TRL v1 — a major milestone that marks a real shift in what TRL is. Read the [blog post](https://huggingface.co/blog/trl-v1) to learn more.

qgallouedec · 2026-06-18T15:00:58Z


 **TRL v1:** We released TRL v1 — a major milestone that marks a real shift in what TRL is. Read the [blog post](https://huggingface.co/blog/trl-v1) to learn more.

+**🌍 Environment integrations:** Train agents against real environments via [`GRPOTrainer`]'s `environment_factory` — pick from [OpenEnv](openenv), [OpenReward](openreward), and [Harbor](harbor) task suites.


I was thinking more about something focused on Harbor, like "We now support harbor!"

Done — made it Harbor-focused in both index.md and README.md: "🚢 Harbor: We now support Harbor — train agents against sandboxed task suites (instruction + sandbox image + in-sandbox verifier) via GRPOTrainer's environment_factory." Dropped the generic env-integrations line.

qgallouedec · 2026-06-18T15:03:14Z

 **TRL v1:** We released TRL v1 — a major milestone that marks a real shift in what TRL is. Read the [blog post](https://huggingface.co/blog/trl-v1) to learn more.



Suggested change

**TRL v1:** We released TRL v1 — a major milestone that marks a real shift in what TRL is. Read the [blog post](https://huggingface.co/blog/trl-v1) to learn more.

qgallouedec · 2026-06-18T15:04:07Z

+> Harbor drives generation through vLLM and uses `environment_factory`, which requires `vllm>=0.22.0` and `transformers>=5.2.0`. The `trl[vllm]` extra currently pins `vllm<=0.19` (which in turn pins `transformers<5`), so install these explicitly. Their constraints conflict, so install vLLM first and then force-install transformers:
+>
+> ```bash
+> pip install 'vllm>=0.22.0'
+> pip install 'transformers>=5.2.0' --no-deps
+> ```


sergiopaniego

LGTM just one comment

sergiopaniego · 2026-06-18T15:22:13Z

+Under the hood `HarborSpec` does three things, lazily on first access:
+
+1. **`spec.train_dataset`**: resolves the task suite to local task directories (downloading the HF dataset if needed) and builds a `datasets.Dataset` with `prompt` (empty — the env's instruction is appended at `reset`), `task_dir`, `task_index`, plus per-task `task.toml` metadata columns.
+2. **`spec.environment_factory`**: returns a zero-arg callable producing a fresh per-rollout [`~trl.experimental.harbor.HarborEnv`]. On `reset(task_dir)` it starts the task's Harbor sandbox and returns its instruction; tool methods exec into the sandbox; `env.reward` runs the verifier once after the rollout.


these aren't links yet for me.
we would need:

[[autodoc]] trl.experimental.harbor.HarborSpec [[autodoc]] trl.experimental.harbor.HarborEnv [[autodoc]] trl.experimental.harbor.HarborBashEnv

no?

…rBashEnv autodoc blocks so cross-refs render as links

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread trl/experimental/harbor/_spec.py Outdated

Comment thread trl/experimental/harbor/_spec.py

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

adithya-s-k mentioned this pull request Jun 11, 2026

async grpo native weight sync with vllm>=0.22.0 #5892

Merged

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread examples/scripts/harbor/harnesses/jupyter/env.py

Comment thread trl/experimental/harbor/_env.py

adithya-s-k added 2 commits June 15, 2026 13:48

harbor: build task.toml test fixture from a line list

bb446bb

Avoids doc-builder reflowing the triple-quoted TOML block, which corrupted the fixture under some doc-builder versions.

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread trl/experimental/harbor/_spec.py

adithya-s-k added 2 commits June 15, 2026 13:56

Merge remote-tracking branch 'upstream/main' into experimental-harbor

001cd6e

harbor: apply doc-builder docstring style

a76c804

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread examples/scripts/harbor/harnesses/terminal_notes/env.py

sergiopaniego reviewed Jun 16, 2026

View reviewed changes

sergiopaniego added 3 commits June 16, 2026 15:58

Merge branch 'main' into experimental-harbor

ba28a18

Docs: compare environment integrations (OpenEnv, OpenReward, Harbor)

7600e96

remove redundancy

08aae44

sergiopaniego mentioned this pull request Jun 16, 2026

Docs: compare environment integrations (OpenEnv, OpenReward, Harbor) adithya-s-k/trl#2

Merged

8 tasks

adithya-s-k added 3 commits June 17, 2026 11:47

Merge pull request #2 from huggingface/docs-environment-integrations

0dafb6a

Docs: compare environment integrations (OpenEnv, OpenReward, Harbor)

Merge remote-tracking branch 'upstream/main' into experimental-harbor

f73e92f

docs(harbor): fix harness usage paths to folder layout (jupyter/env.p…

adcf95d

…y, terminal_notes/env.py) + show both in guide

sergiopaniego reviewed Jun 17, 2026

View reviewed changes

qgallouedec reviewed Jun 17, 2026

View reviewed changes

adithya-s-k added 2 commits June 18, 2026 13:29

Address review: revert global vllm pin (document Harbor's vllm>=0.22 …

47fff48

…req), alphabetize Integrations TOC, move env-integration comparison to grpo_trainer.md, add env integrations to What's New

Merge remote-tracking branch 'upstream/main' into experimental-harbor

c39cd77

adithya-s-k requested review from qgallouedec and sergiopaniego June 18, 2026 14:11

qgallouedec approved these changes Jun 18, 2026

View reviewed changes

qgallouedec reviewed Jun 18, 2026

View reviewed changes

sergiopaniego approved these changes Jun 18, 2026

View reviewed changes

Address review: Harbor-focused What's New entry + add HarborEnv/Harbo…

678bbc9

…rBashEnv autodoc blocks so cross-refs render as links

adithya-s-k merged commit 9e2c72a into huggingface:main Jun 18, 2026
14 checks passed

qgallouedec mentioned this pull request Jun 18, 2026

Add license to Harbor examples #6104

Merged

		TRL v1: We released TRL v1 — a major milestone that marks a real shift in what TRL is. Read the [blog post](https://huggingface.co/blog/trl-v1) to learn more.


		TRL v1: We released TRL v1 — a major milestone that marks a real shift in what TRL is. Read the [blog post](https://huggingface.co/blog/trl-v1) to learn more.

		🌍 Environment integrations: Train agents against real environments via [`GRPOTrainer`]'s `environment_factory` — pick from [OpenEnv](openenv), [OpenReward](openreward), and [Harbor](harbor) task suites.

Conversation

adithya-s-k commented Jun 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

External agents only (for now)

Notable changes

Testing

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bot-ci-comment Bot commented Jun 15, 2026

Uh oh!

sergiopaniego left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sergiopaniego left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

adithya-s-k commented Jun 11, 2026 •

edited by cursor Bot

Loading

qgallouedec left a comment •

edited

Loading