Repeater login & polling resilience: rejection vs timeout, auto path-reset, bounded mesh calls by drewmccal · Pull Request #296 · meshcore-dev/meshcore-ha

drewmccal · 2026-06-24T20:02:58Z

Closes #202

What

Three related robustness fixes for repeater login and polling, all stemming from the same lesson: a stale stored path (or a wedged link) shouldn't look like a generic failure.

1. Distinguish login rejection from timeout (config flow)

send_login_sync used a tight (~5 s) window and only waited for LOGIN_SUCCESS, ignoring LOGIN_FAILED — so a wrong password and an unreachable repeater both surfaced as the same vague error. Now _attempt_login waits for both LOGIN_SUCCESS and LOGIN_FAILED on a wider window and reports success / rejected / timeout distinctly (login_failed vs login_timeout error strings).

2. Reset stale path → flood and retry on first timeout

A freshly-added or moved repeater has a stale direct path, so the first request silently times out and the Online sensor sits at "unknown" (gray card) until MAX_FAILURES_BEFORE_PATH_RESET backoff-spaced failures elapse. _login_to_repeater (config flow) and _call_with_path_recovery (coordinator) now reset the path to flood and retry once on the first timeout — the same thing the iOS app's manual "clear path" did. Applied to the repeater status request, recovery login, and telemetry request. Only fires when a direct path exists (out_path_len > -1), so it can't loop while flooding, and it honors disable_path_reset.

3. Bound every mesh call so a hang can't wedge polling

No mesh call in a repeater poll had a timeout. A wedged BLE/serial link could leave one await hanging forever; the poll task then never completed and the "still running, skipping" guard skipped that repeater on every tick until restart (observed with a large fetch_all_neighbours). Every mesh call in the poll path is now wrapped in asyncio.wait_for (MESH_COMMAND_TIMEOUT / NEIGHBOR_FETCH_TIMEOUT), turning a hang into an ordinary recoverable failure.

Tests

tests/test_path_recovery.py covers the path-reset-on-timeout logic and the hang-treated-as-no-response behavior.
Tested with a Thinknode M1 and Raspberry Pi 4 running HA.

🤖 Generated with Claude Code

Adding a repeater showed "Failed to log in ... Check password" on ANY failure, including a timeout. Root cause: the SDK's send_login_sync only waits for LOGIN_SUCCESS within a tight window (suggested_timeout/800 ~ a few seconds), so a slow/multi-hop login reply times out and a rejected password (a distinct LOGIN_FAILED frame) looks identical to no-response. Add _login_to_repeater(): registers waiters for both LOGIN_SUCCESS and LOGIN_FAILED, sends the login, and races them with generous headroom (20s). The add_repeater step now reports: - login_failed -> password rejected (LOGIN_FAILED received) - login_timeout -> no response (likely busy/unreachable; not the password) and the two error strings are reworded to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Root cause confirmed in testing: a repeater with a stale stored path never receives a direct login, so it times out — logging in from the app only worked after clearing the path (flood). The add-repeater flow had the same problem. _login_to_repeater now does a direct attempt (12s) and, on timeout, resets the path (reset_path -> flood) and retries once with more headroom (20s) — mirroring the manual fix. A LOGIN_FAILED (wrong password) returns immediately without touching the path. login_timeout message reworded to note the path was already reset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A freshly-added or moved repeater has a stale stored direct path, so the first req_status_sync silently times out and the Online sensor sits at "unknown" (gray card). Previously the path was only reset after MAX_FAILURES_BEFORE_PATH_RESET (3) backoff-spaced failures, so recovery took minutes. Add _call_with_path_recovery: when a mesh request times out and a direct path exists, reset the path to flood, refresh the contact, and retry once within the same poll — the same thing the iOS app's manual "clear path" did. Wire it into the repeater status request, the repeater recovery login, and the telemetry request. It only resets when out_path_len > -1, so once a node is flooding there is nothing to reset and it cannot loop; the disable_path_reset flag is still honored and the post-3-failures reset remains as a fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

No mesh call in a repeater poll had a timeout. The "_sync" SDK commands have short internal timeouts, but a wedged BLE/serial link can leave an await hanging indefinitely. When that happens the whole _update_repeater task never completes, and the "still running, skipping" guard then skips that repeater on every tick forever — no status polls, no neighbor fetches, until HA restarts. This surfaced after enabling repeater neighbors: fetch_all_neighbours (44 neighbors) is a large multi-frame query that hung, freezing the poll task. The status sensors had already updated (card green) but the neighbor list never populated and polling stalled. Wrap every mesh call in the poll path with asyncio.wait_for: - req_status_sync / send_login_sync / req_telemetry_sync via the existing _call_with_path_recovery helper (a timed-out call is treated as a normal no-response, which flows into path reset + backoff) - reset_path in _reset_node_path - fetch_all_neighbours in _fetch_repeater_neighbours (own NEIGHBOR_FETCH_TIMEOUT) A hang now becomes an ordinary recoverable failure instead of a permanent wedge. Adds MESH_COMMAND_TIMEOUT / NEIGHBOR_FETCH_TIMEOUT (60s) and a test that a hanging command is treated as no-response and triggers recovery. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

awolden

the bounded-call timeouts are good, keep those. drop the coordinator first-timeout path reset (_call_with_path_recovery on login/status/telemetry):

stale paths already recover via MAX_FAILURES_BEFORE_PATH_RESET (3 failures); this just moves the trigger to the 1st timeout
reset_path forces a flood (region-wide airtime); the 3-strike threshold is a deliberate throttle, we want fewer floods not more
flip-flop: a successful flood-retry resets the failure counter so backoff never engages, so a marginal node re-floods nearly every poll
reset + retry sends also bypass the rate limiter

zooming out: a lot of the integration's design is about being a good citizen on the mesh (rate limiter, backoff, the 3-strike threshold). flooding on the first timeout cuts directly against that, it makes us one of the noisier nodes in the region instead of one of the quieter ones. that's the opposite of the direction we want to push.

keep the existing 3-strike reset. rejection-vs-timeout in the add-repeater flow is fine (one-time).

the test is a standalone copy of the method and has drifted from the real one (sleep(0) vs sleep(0.5)); drive the real coordinator method instead.

drewmccal and others added 4 commits June 24, 2026 13:19

awolden requested changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repeater login & polling resilience: rejection vs timeout, auto path-reset, bounded mesh calls#296

Repeater login & polling resilience: rejection vs timeout, auto path-reset, bounded mesh calls#296
drewmccal wants to merge 4 commits into
meshcore-dev:mainfrom
drewmccal:feature/login-feedback

drewmccal commented Jun 24, 2026 •

edited

Loading

Uh oh!

awolden left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewmccal commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

1. Distinguish login rejection from timeout (config flow)

2. Reset stale path → flood and retry on first timeout

3. Bound every mesh call so a hang can't wedge polling

Tests

Uh oh!

awolden left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drewmccal commented Jun 24, 2026 •

edited

Loading