Skip to content

fix(gateway): s6 lifecycle supervision + container_environment API key#32

Open
5kahoisaac wants to merge 2 commits into
somratpro:mainfrom
5kahoisaac:pr/4-gateway-s6-lifecycle
Open

fix(gateway): s6 lifecycle supervision + container_environment API key#32
5kahoisaac wants to merge 2 commits into
somratpro:mainfrom
5kahoisaac:pr/4-gateway-s6-lifecycle

Conversation

@5kahoisaac

Copy link
Copy Markdown
Contributor

Split of #26 (part 4/5).

Gateway / s6 lifecycle + container_environment API key

  • start.sh: replace PID-based gateway supervision with a health-endpoint monitor loop; use hermes gateway run/restart (s6 hand-off aware), add wait_for_port_free, graceful CLI-based shutdown, and the HERMES_GATEWAY_NO_SUPERVISE export. Fixes the s6-supervise restart storm / shutdown hang.
  • Dockerfile: ARG HERMES_AGENT_VERSION with the default applied at FROM (:-latest); API_SERVER_* as Docker ENV (read from s6 container_environment, not start.sh exports); COPY the cont-init hook.
  • cont-init.d/016-huggingmes-api-server-key: alias GATEWAY_TOKENAPI_SERVER_KEY in the gateway's container_environment so its API server can bind 8642.
  • docker-compose.yml: HERMES_GATEWAY_NO_SUPERVISE default.

Note: happy to discuss the HERMES_AGENT_VERSION default-at-FROM behavior you flagged.

Test plan

  • Gateway survives a restart without the supervise storm.
  • API_SERVER_* reach the s6-supervised gateway; 8642 binds; dashboard shows Gateway: Online.

@5kahoisaac 5kahoisaac force-pushed the pr/4-gateway-s6-lifecycle branch from af3ee89 to 5b35cf9 Compare June 23, 2026 05:21
@5kahoisaac 5kahoisaac marked this pull request as ready for review June 23, 2026 05:53
@somratpro

Copy link
Copy Markdown
Owner

Thanks for splitting this out. I’m going to hold off on merging this one for now because it changes the highest-risk runtime path: gateway supervision, s6 handoff, container environment propagation, shutdown behavior, and Docker defaults.

A couple of concerns I’d like resolved first:

  • ARG HERMES_AGENT_VERSION now defaults only in FROM, but the later ENV HERMES_AGENT_VERSION=${HERMES_AGENT_VERSION} may be empty when no build arg is supplied. I’d prefer preserving a runtime-visible default of latest.
  • If GATEWAY_TOKEN is absent, start.sh still generates an ephemeral API_SERVER_KEY, but the cont-init hook cannot see that generated value because it runs before start.sh. In that case the s6-supervised gateway may still miss API_SERVER_KEY.

Could you add live Docker/HF evidence for this PR specifically? I’d want to see logs showing:

  • API server binds 127.0.0.1:8642
  • dashboard reports Gateway online
  • no s6 restart storm
  • restart/shutdown works cleanly
  • behavior when GATEWAY_TOKEN is unset or explicitly set

The direction may be right, but I want runtime proof before merging this one.

@5kahoisaac

Copy link
Copy Markdown
Contributor Author

Addressed both concerns:

  1. HERMES_AGENT_VERSION default: Changed ENV HERMES_AGENT_VERSION=${HERMES_AGENT_VERSION} to ${HERMES_AGENT_VERSION:-latest} so the runtime variable always has a visible value even when no --build-arg is supplied.

  2. API_SERVER_KEY race with cont-init: The hook 016-huggingmes-api-server-key now handles the no-GATEWAY_TOKEN case directly — it generates an ephemeral key and writes it to /run/s6/container_environment/API_SERVER_KEY before any gateway service can start. The start.sh path is still there as a fallback for the non-s6 code path, but the gateway's s6 service will now always have the key available at launch.

Regarding the live Docker/HF evidence request — I don't have a running HF Space to attach logs from, but if you can share one I'm happy to validate there. Alternatively if you want to run it locally the docker-compose.yml on pr/4-gateway-s6-lifecycle has HERMES_GATEWAY_NO_SUPERVISE: ${HERMES_GATEWAY_NO_SUPERVISE:-true} to test the supervised path.

@somratpro

Copy link
Copy Markdown
Owner

Thanks for updating this. The two concerns I raised earlier look addressed now:

  • HERMES_AGENT_VERSION keeps a runtime-visible latest default via HERMES_AGENT_VERSION=${HERMES_AGENT_VERSION:-latest}.
  • The cont-init hook now generates an ephemeral API_SERVER_KEY when GATEWAY_TOKEN is absent, so the s6-supervised gateway should not miss the key in that path.

One blocker remains: this PR is no longer mergeable against current main. After the recent merges, docker-compose.yml now conflicts around DEV_MODE vs HERMES_GATEWAY_NO_SUPERVISE; both should be preserved.

Could you rebase/update this PR on the latest main and resolve that conflict?

After that, I still want runtime evidence before merging because this changes the gateway supervision path. Please include logs showing:

  • 127.0.0.1:8642 binds successfully
  • dashboard reports Gateway online
  • no s6 restart storm
  • restart/shutdown works cleanly
  • behavior with GATEWAY_TOKEN set and unset

Once it is up to date and has that Docker/HF runtime proof, I’m open to merging it.

…nt API key

- start.sh: replace PID-based gateway supervision with a health-endpoint monitor
  loop; use `hermes gateway run/restart` (s6 hand-off aware), add wait_for_port_free,
  graceful CLI-based shutdown, and HERMES_GATEWAY_NO_SUPERVISE export.
- Dockerfile: ARG HERMES_AGENT_VERSION (default at FROM), API_SERVER_* as Docker ENV
  (read from s6 container_environment), and COPY the cont-init.d hook.
- cont-init.d/016-huggingmes-api-server-key: alias GATEWAY_TOKEN -> API_SERVER_KEY in
  the gateway's container_environment so its API server can bind 8642.
- docker-compose.yml: HERMES_GATEWAY_NO_SUPERVISE default.
…PI_SERVER_KEY

- Dockerfile: use ${HERMES_AGENT_VERSION:-latest} in ENV so the runtime
  variable is never empty when no --build-arg is supplied (preserves the
  same default already used in FROM).

- cont-init.d/016-huggingmes-api-server-key: when GATEWAY_TOKEN is absent
  generate an ephemeral API_SERVER_KEY directly in container_environment so
  the s6-supervised gateway can start its API server. Previously the key was
  only generated in start.sh (which runs after cont-init), creating a race
  where the gateway would launch without the key and refuse to bind port 8642.
@5kahoisaac 5kahoisaac force-pushed the pr/4-gateway-s6-lifecycle branch from 14ee06e to 1202878 Compare June 24, 2026 15:19
@5kahoisaac

Copy link
Copy Markdown
Contributor Author

Rebased onto current main and resolved the docker-compose.yml conflict — both DEV_MODE (from the merged PR #31) and HERMES_GATEWAY_NO_SUPERVISE are now present. The diff is back to 4 files only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants