fix(transport): reuse inbound socket for replies in WebSocket fallback (EHOSTUNREACH on stale point-to-point routes)#163
Open
leoburti wants to merge 1 commit into
Conversation
The WebSocketFallbackTransport always dials a NEW outbound connection to reach a peer. On a direct point-to-point link (e.g. a macOS Thunderbolt bridge) the kernel's cloned route to the peer can go stale, so every fresh outbound connect fails with `connect EHOSTUNREACH <peer> - Local (<self>:<ephemeral>)` (the OS picks the correct source and still reports no route) — while the inbound socket from that peer is perfectly usable. There was also no liveness/idle handling (`maxIdleTimeoutMs` was accepted but never used; `getOrCreateConnection` trusted only `readyState===OPEN`). This makes the transport: - index each server-accepted (inbound) socket by peer IP (`inboundByHost`) - in `getOrCreateConnection`, before dialing: reuse a live outbound, else a live inbound socket for the same peer IP (WebSocket is full-duplex), else dial (unchanged fallback — no regression when no inbound exists) - add a 5s ping/pong liveness probe + `__alive` gating so a half-open socket is detected and skipped (a direct link won't RST a dropped peer) - retry-once with eviction in `send()` if the chosen socket dies mid-send Validated in production over a Thunderbolt link (the failing direction went from 0/N to N/N, attempts:1, zero EHOSTUNREACH) and with a new unit test asserting the reply rides the inbound socket (`created` stays 0). Refs ruvnet#162 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
WebSocketFallbackTransport(the federation transport from #153) always dials a new outbound connection to reach a peer. On a direct point-to-point link (e.g. a macOS Thunderbolt bridge, no tailnet) the kernel's cloned route to the peer can go stale, so every fresh outbound connect fails:The OS picks the correct source IP and still reports "no route". Meanwhile the inbound socket from that peer is perfectly usable — WebSocket is full-duplex. There's also no liveness/idle handling:
maxIdleTimeoutMsis accepted but never used, andgetOrCreateConnectiontrusts onlyreadyState === OPEN, so a half-open socket on a link that never sends RST is reused blindly.Full diagnosis in #162.
Fix
inboundByHost).getOrCreateConnection, before dialing: reuse a live outbound (current behavior) → else a live inbound socket for the same peer IP → else dial (unchanged fallback, so no regression when no inbound exists).__alivegating (a direct link won't RST a dropped peer, soreadyStatelags a dead TCP connection by seconds).send()retries once with eviction if the chosen socket dies mid-send.Because each peer keeps one socket alive via its heartbeat, traffic rides it and no fresh (stale-route-prone) dial is needed. Backward compatible: with no inbound socket present, behavior is identical to before.
Validation
attempts:1, zero EHOSTUNREACH).tests/transport/quic-loader.test.ts): a reply rides the inbound socket — the receiver gets the message while the replier'sgetStats().createdstays 0 (proving reuse, not a dial).tscclean.Refs #162