Skip to content

[codex] Optimize Terra stepping and reset accounting; expose masks#34

Open
Idate96 wants to merge 5 commits into
multi-agentfrom
codex/mask-speedup-wip
Open

[codex] Optimize Terra stepping and reset accounting; expose masks#34
Idate96 wants to merge 5 commits into
multi-agentfrom
codex/mask-speedup-wip

Conversation

@Idate96
Copy link
Copy Markdown
Contributor

@Idate96 Idate96 commented May 15, 2026

Summary

This PR contains the Terra environment side of the fast-reset, timeout-accounting, and diagnostic action-mask work for multi-agent.

  • factors env stepping into step_no_reset so batched training can avoid composing full resets unless an env is actually done
  • preserves info["final_observation"] through auto-reset so PPO can bootstrap max-step truncations from the pre-reset observation
  • adds info["timeout_done"] and observation/info["episode_progress"]
  • gates terminal success reward on true task_done, not time-limit timeout
  • keeps reachability recomputation gated to effective terrain-changing DO actions
  • exposes coarse action_mask and edge-progress features through observations/info for diagnostics and optional probes
  • keeps reset info pytree structure aligned with step outputs

Important framing: the paired terra-baselines PR now keeps PPO action masking disabled by default. Env-side action_mask remains useful as an observation/debug signal, for masked-vs-unmasked diagnostic scripts, and for future robot-side safety checks, but the normal train/eval actor path is unmasked.

Why

The old path mixed true task terminals and max-step truncations too much for PPO accounting. Synchronized timeouts can spike value loss and explained variance, and bootstrapping from the reset observation is the wrong target for a time-limit truncation. This branch gives the PPO side explicit reset/timeout information while keeping task success semantics separate from horizon expiry.

The paired terra-baselines branch uses this to:

  • bootstrap timeouts from final_observation
  • stop GAE at reset boundaries
  • randomize initial episode ages so timeout phases are not synchronized
  • log bounded first-episode eval success separately from legacy successes-per-env counters
  • train/evaluate with an unmasked actor by construction while retaining masks for diagnostics
  • feed edge/progress affordances to the critic only in the ResMap setup

Current Baseline And Learning Evidence

The clean reference run is:

terra-clean-multiagent-4x4090-autotune0-euler-pr-2026-05-13-19-49-55

Local stochastic rollout probe shape for the comparison below: solo_excavator, 32 envs, 550 max steps, seeds 0 1 2 3, first episode only, stochastic actions.

checkpoint/run mode success avg return policy entropy notes
clean baseline terra-clean-multiagent-4x4090-autotune0-euler-pr-2026-05-13-19-49-55.pkl unmasked 48/128 = 37.5% 3.956 0.035 per-seed successes: 14, 13, 8, 13; 278 invalid sampled actions
masked run terra-mask-multiagent-4gpu-online-euler-pr-2026-05-15-00-50-06 / W&B ti3k3tdp masked 37/128 = 28.9% 2.983 0.116 invalid sampled actions: 0; W&B throughput about 95k FPS
ResMap terminal-fix run W&B 04e8dada unmasked 76/128 = 59.4% 5.484 0.143 per-seed successes: 19, 21, 17, 19; W&B throughput about 19k FPS

Caveats:

  • These are local rollout probes, not a full eval sweep.
  • The masked result above is the latest synced checkpoint, not the best checkpoint from that run. An earlier local sync of the same masked run produced 50/128 = 39.1%, roughly tied with the clean baseline, with lower entropy (0.035 vs 0.116) and less remaining dig (0.087 vs 0.131). W&B also shows the old count-style masked eval peaking around step 73,200 before declining toward the latest checkpoint.
  • Existing online W&B jobs still use the old in-memory eval counter where eval/success_rate can mean successes per env and can exceed 1. The paired baselines branch fixes future logs so bounded first-episode success is eval/success_rate and the old count-style quantity is logged separately as eval/successes_per_env.
  • The ResMap run changes architecture and capacity, so it is not a pure reset-only ablation. The clean run above is the baseline reference for policy quality.

Validation

Validated from the paired local trees:

PYTHONPATH=/home/lorenzo/moleworks/terra_mask_wip:/home/lorenzo/moleworks/terra-baselines_mask_wip

  • python -m py_compile terra/env.py terra/state.py
  • git diff --check
  • scripts/validation/validate_edge_mask_changes.py --case all --jax-platforms cpu --dataset-path /home/lorenzo/moleworks/terra_data/train --dataset-size 1

The full validation sweep passed:

  • PPO mask logits
  • multi-device training accounting and reset_prepared wiring
  • model policy input compatibility
  • compact reward logging
  • edge/no-mask model shape
  • critic affordance shapes
  • checkpoint config restore
  • timeout value bootstrap from final_observation
  • initial episode progress randomization
  • GAE timeout bootstrap without reset-episode leakage
  • state action mask and step dispatch
  • synthetic and dataset-backed env action masks
  • batched fast reset parity
  • step_no_reset parity with TerraEnv.step
  • episode progress plus final pre-reset observation

@Idate96 Idate96 marked this pull request as ready for review May 15, 2026 10:35
@Idate96 Idate96 changed the title [codex] Optimize Terra stepping and expose coarse masks [codex] Optimize Terra stepping, masks, and reset accounting May 18, 2026
@Idate96 Idate96 force-pushed the codex/mask-speedup-wip branch from 175a629 to 3398456 Compare May 18, 2026 08:58
@Idate96 Idate96 changed the title [codex] Optimize Terra stepping, masks, and reset accounting [codex] Optimize Terra stepping and reset accounting; expose masks May 18, 2026
@SawneyX
Copy link
Copy Markdown
Contributor

SawneyX commented May 21, 2026

i think the last commit is faulty, because the line metadata and the agent position use different conventions.

metadata: x = column and y = row.

agent:
current_pos[0] = row / y
current_pos[1] = col / x

so line_dist = jnp.abs(abc[0] * current_pos[1] + abc[1] * current_pos[0] + abc[2]) / denom was correct.

@Idate96
Copy link
Copy Markdown
Contributor Author

Idate96 commented May 21, 2026

Thanks @SawneyX, you were right. I had trusted the x/y names too much; pos_base is used as [row, col], while the metadata line is A*x + B*y + C with x=col and y=row. I reverted the faulty commit and pushed da3a54d4, which keeps the original formula but makes the row/col <-> x/y conversion explicit and adds a regression for the exact counterexample plus the cabin-angle rejection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants