Repeater login & polling resilience: rejection vs timeout, auto path-reset, bounded mesh calls#296
Repeater login & polling resilience: rejection vs timeout, auto path-reset, bounded mesh calls#296drewmccal wants to merge 4 commits into
Conversation
Adding a repeater showed "Failed to log in ... Check password" on ANY failure, including a timeout. Root cause: the SDK's send_login_sync only waits for LOGIN_SUCCESS within a tight window (suggested_timeout/800 ~ a few seconds), so a slow/multi-hop login reply times out and a rejected password (a distinct LOGIN_FAILED frame) looks identical to no-response. Add _login_to_repeater(): registers waiters for both LOGIN_SUCCESS and LOGIN_FAILED, sends the login, and races them with generous headroom (20s). The add_repeater step now reports: - login_failed -> password rejected (LOGIN_FAILED received) - login_timeout -> no response (likely busy/unreachable; not the password) and the two error strings are reworded to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Root cause confirmed in testing: a repeater with a stale stored path never receives a direct login, so it times out — logging in from the app only worked after clearing the path (flood). The add-repeater flow had the same problem. _login_to_repeater now does a direct attempt (12s) and, on timeout, resets the path (reset_path -> flood) and retries once with more headroom (20s) — mirroring the manual fix. A LOGIN_FAILED (wrong password) returns immediately without touching the path. login_timeout message reworded to note the path was already reset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A freshly-added or moved repeater has a stale stored direct path, so the first req_status_sync silently times out and the Online sensor sits at "unknown" (gray card). Previously the path was only reset after MAX_FAILURES_BEFORE_PATH_RESET (3) backoff-spaced failures, so recovery took minutes. Add _call_with_path_recovery: when a mesh request times out and a direct path exists, reset the path to flood, refresh the contact, and retry once within the same poll — the same thing the iOS app's manual "clear path" did. Wire it into the repeater status request, the repeater recovery login, and the telemetry request. It only resets when out_path_len > -1, so once a node is flooding there is nothing to reset and it cannot loop; the disable_path_reset flag is still honored and the post-3-failures reset remains as a fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
No mesh call in a repeater poll had a timeout. The "_sync" SDK commands have short internal timeouts, but a wedged BLE/serial link can leave an await hanging indefinitely. When that happens the whole _update_repeater task never completes, and the "still running, skipping" guard then skips that repeater on every tick forever — no status polls, no neighbor fetches, until HA restarts. This surfaced after enabling repeater neighbors: fetch_all_neighbours (44 neighbors) is a large multi-frame query that hung, freezing the poll task. The status sensors had already updated (card green) but the neighbor list never populated and polling stalled. Wrap every mesh call in the poll path with asyncio.wait_for: - req_status_sync / send_login_sync / req_telemetry_sync via the existing _call_with_path_recovery helper (a timed-out call is treated as a normal no-response, which flows into path reset + backoff) - reset_path in _reset_node_path - fetch_all_neighbours in _fetch_repeater_neighbours (own NEIGHBOR_FETCH_TIMEOUT) A hang now becomes an ordinary recoverable failure instead of a permanent wedge. Adds MESH_COMMAND_TIMEOUT / NEIGHBOR_FETCH_TIMEOUT (60s) and a test that a hanging command is treated as no-response and triggers recovery. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
awolden
left a comment
There was a problem hiding this comment.
the bounded-call timeouts are good, keep those. drop the coordinator first-timeout path reset (_call_with_path_recovery on login/status/telemetry):
- stale paths already recover via
MAX_FAILURES_BEFORE_PATH_RESET(3 failures); this just moves the trigger to the 1st timeout reset_pathforces a flood (region-wide airtime); the 3-strike threshold is a deliberate throttle, we want fewer floods not more- flip-flop: a successful flood-retry resets the failure counter so backoff never engages, so a marginal node re-floods nearly every poll
- reset + retry sends also bypass the rate limiter
zooming out: a lot of the integration's design is about being a good citizen on the mesh (rate limiter, backoff, the 3-strike threshold). flooding on the first timeout cuts directly against that, it makes us one of the noisier nodes in the region instead of one of the quieter ones. that's the opposite of the direction we want to push.
keep the existing 3-strike reset. rejection-vs-timeout in the add-repeater flow is fine (one-time).
the test is a standalone copy of the method and has drifted from the real one (sleep(0) vs sleep(0.5)); drive the real coordinator method instead.
Closes #202
What
Three related robustness fixes for repeater login and polling, all stemming from the same lesson: a stale stored path (or a wedged link) shouldn't look like a generic failure.
1. Distinguish login rejection from timeout (config flow)
send_login_syncused a tight (~5 s) window and only waited forLOGIN_SUCCESS, ignoringLOGIN_FAILED— so a wrong password and an unreachable repeater both surfaced as the same vague error. Now_attempt_loginwaits for bothLOGIN_SUCCESSandLOGIN_FAILEDon a wider window and reportssuccess/rejected/timeoutdistinctly (login_failedvslogin_timeouterror strings).2. Reset stale path → flood and retry on first timeout
A freshly-added or moved repeater has a stale direct path, so the first request silently times out and the Online sensor sits at "unknown" (gray card) until
MAX_FAILURES_BEFORE_PATH_RESETbackoff-spaced failures elapse._login_to_repeater(config flow) and_call_with_path_recovery(coordinator) now reset the path to flood and retry once on the first timeout — the same thing the iOS app's manual "clear path" did. Applied to the repeater status request, recovery login, and telemetry request. Only fires when a direct path exists (out_path_len > -1), so it can't loop while flooding, and it honorsdisable_path_reset.3. Bound every mesh call so a hang can't wedge polling
No mesh call in a repeater poll had a timeout. A wedged BLE/serial link could leave one
awaithanging forever; the poll task then never completed and the "still running, skipping" guard skipped that repeater on every tick until restart (observed with a largefetch_all_neighbours). Every mesh call in the poll path is now wrapped inasyncio.wait_for(MESH_COMMAND_TIMEOUT/NEIGHBOR_FETCH_TIMEOUT), turning a hang into an ordinary recoverable failure.Tests
tests/test_path_recovery.pycovers the path-reset-on-timeout logic and the hang-treated-as-no-response behavior.Tested with a Thinknode M1 and Raspberry Pi 4 running HA.
🤖 Generated with Claude Code