[codex] Optimize Terra stepping and reset accounting; expose masks#34
Open
Idate96 wants to merge 5 commits into
Open
[codex] Optimize Terra stepping and reset accounting; expose masks#34Idate96 wants to merge 5 commits into
Idate96 wants to merge 5 commits into
Conversation
175a629 to
3398456
Compare
Contributor
|
i think the last commit is faulty, because the line metadata and the agent position use different conventions. metadata: x = column and y = row. agent: so line_dist = jnp.abs(abc[0] * current_pos[1] + abc[1] * current_pos[0] + abc[2]) / denom was correct. |
Contributor
Author
|
Thanks @SawneyX, you were right. I had trusted the x/y names too much; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR contains the Terra environment side of the fast-reset, timeout-accounting, and diagnostic action-mask work for
multi-agent.step_no_resetso batched training can avoid composing full resets unless an env is actually doneinfo["final_observation"]through auto-reset so PPO can bootstrap max-step truncations from the pre-reset observationinfo["timeout_done"]andobservation/info["episode_progress"]task_done, not time-limit timeoutaction_maskand edge-progress features through observations/info for diagnostics and optional probesImportant framing: the paired
terra-baselinesPR now keeps PPO action masking disabled by default. Env-sideaction_maskremains useful as an observation/debug signal, for masked-vs-unmasked diagnostic scripts, and for future robot-side safety checks, but the normal train/eval actor path is unmasked.Why
The old path mixed true task terminals and max-step truncations too much for PPO accounting. Synchronized timeouts can spike value loss and explained variance, and bootstrapping from the reset observation is the wrong target for a time-limit truncation. This branch gives the PPO side explicit reset/timeout information while keeping task success semantics separate from horizon expiry.
The paired
terra-baselinesbranch uses this to:final_observationCurrent Baseline And Learning Evidence
The clean reference run is:
terra-clean-multiagent-4x4090-autotune0-euler-pr-2026-05-13-19-49-55Local stochastic rollout probe shape for the comparison below:
solo_excavator, 32 envs, 550 max steps, seeds0 1 2 3, first episode only, stochastic actions.terra-clean-multiagent-4x4090-autotune0-euler-pr-2026-05-13-19-49-55.pklterra-mask-multiagent-4gpu-online-euler-pr-2026-05-15-00-50-06/ W&Bti3k3tdp04e8dadaCaveats:
50/128 = 39.1%, roughly tied with the clean baseline, with lower entropy (0.035vs0.116) and less remaining dig (0.087vs0.131). W&B also shows the old count-style masked eval peaking around step73,200before declining toward the latest checkpoint.eval/success_ratecan mean successes per env and can exceed 1. The paired baselines branch fixes future logs so bounded first-episode success iseval/success_rateand the old count-style quantity is logged separately aseval/successes_per_env.Validation
Validated from the paired local trees:
PYTHONPATH=/home/lorenzo/moleworks/terra_mask_wip:/home/lorenzo/moleworks/terra-baselines_mask_wippython -m py_compile terra/env.py terra/state.pygit diff --checkscripts/validation/validate_edge_mask_changes.py --case all --jax-platforms cpu --dataset-path /home/lorenzo/moleworks/terra_data/train --dataset-size 1The full validation sweep passed:
reset_preparedwiringfinal_observationstep_no_resetparity withTerraEnv.step