Skip to content

Repeater login & polling resilience: rejection vs timeout, auto path-reset, bounded mesh calls#296

Open
drewmccal wants to merge 4 commits into
meshcore-dev:mainfrom
drewmccal:feature/login-feedback
Open

Repeater login & polling resilience: rejection vs timeout, auto path-reset, bounded mesh calls#296
drewmccal wants to merge 4 commits into
meshcore-dev:mainfrom
drewmccal:feature/login-feedback

Conversation

@drewmccal

@drewmccal drewmccal commented Jun 24, 2026

Copy link
Copy Markdown

Closes #202

What

Three related robustness fixes for repeater login and polling, all stemming from the same lesson: a stale stored path (or a wedged link) shouldn't look like a generic failure.

1. Distinguish login rejection from timeout (config flow)

send_login_sync used a tight (~5 s) window and only waited for LOGIN_SUCCESS, ignoring LOGIN_FAILED — so a wrong password and an unreachable repeater both surfaced as the same vague error. Now _attempt_login waits for both LOGIN_SUCCESS and LOGIN_FAILED on a wider window and reports success / rejected / timeout distinctly (login_failed vs login_timeout error strings).

2. Reset stale path → flood and retry on first timeout

A freshly-added or moved repeater has a stale direct path, so the first request silently times out and the Online sensor sits at "unknown" (gray card) until MAX_FAILURES_BEFORE_PATH_RESET backoff-spaced failures elapse. _login_to_repeater (config flow) and _call_with_path_recovery (coordinator) now reset the path to flood and retry once on the first timeout — the same thing the iOS app's manual "clear path" did. Applied to the repeater status request, recovery login, and telemetry request. Only fires when a direct path exists (out_path_len > -1), so it can't loop while flooding, and it honors disable_path_reset.

3. Bound every mesh call so a hang can't wedge polling

No mesh call in a repeater poll had a timeout. A wedged BLE/serial link could leave one await hanging forever; the poll task then never completed and the "still running, skipping" guard skipped that repeater on every tick until restart (observed with a large fetch_all_neighbours). Every mesh call in the poll path is now wrapped in asyncio.wait_for (MESH_COMMAND_TIMEOUT / NEIGHBOR_FETCH_TIMEOUT), turning a hang into an ordinary recoverable failure.

Tests

tests/test_path_recovery.py covers the path-reset-on-timeout logic and the hang-treated-as-no-response behavior.
Tested with a Thinknode M1 and Raspberry Pi 4 running HA.

🤖 Generated with Claude Code

drewmccal and others added 4 commits June 24, 2026 13:19
Adding a repeater showed "Failed to log in ... Check password" on ANY
failure, including a timeout. Root cause: the SDK's send_login_sync only
waits for LOGIN_SUCCESS within a tight window (suggested_timeout/800 ~ a
few seconds), so a slow/multi-hop login reply times out and a rejected
password (a distinct LOGIN_FAILED frame) looks identical to no-response.

Add _login_to_repeater(): registers waiters for both LOGIN_SUCCESS and
LOGIN_FAILED, sends the login, and races them with generous headroom
(20s). The add_repeater step now reports:
  - login_failed  -> password rejected (LOGIN_FAILED received)
  - login_timeout -> no response (likely busy/unreachable; not the password)
and the two error strings are reworded to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Root cause confirmed in testing: a repeater with a stale stored path
never receives a direct login, so it times out — logging in from the app
only worked after clearing the path (flood). The add-repeater flow had
the same problem.

_login_to_repeater now does a direct attempt (12s) and, on timeout,
resets the path (reset_path -> flood) and retries once with more headroom
(20s) — mirroring the manual fix. A LOGIN_FAILED (wrong password) returns
immediately without touching the path. login_timeout message reworded to
note the path was already reset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A freshly-added or moved repeater has a stale stored direct path, so the
first req_status_sync silently times out and the Online sensor sits at
"unknown" (gray card). Previously the path was only reset after
MAX_FAILURES_BEFORE_PATH_RESET (3) backoff-spaced failures, so recovery
took minutes.

Add _call_with_path_recovery: when a mesh request times out and a direct
path exists, reset the path to flood, refresh the contact, and retry once
within the same poll — the same thing the iOS app's manual "clear path"
did. Wire it into the repeater status request, the repeater recovery
login, and the telemetry request.

It only resets when out_path_len > -1, so once a node is flooding there is
nothing to reset and it cannot loop; the disable_path_reset flag is still
honored and the post-3-failures reset remains as a fallback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
No mesh call in a repeater poll had a timeout. The "_sync" SDK commands
have short internal timeouts, but a wedged BLE/serial link can leave an
await hanging indefinitely. When that happens the whole _update_repeater
task never completes, and the "still running, skipping" guard then skips
that repeater on every tick forever — no status polls, no neighbor fetches,
until HA restarts.

This surfaced after enabling repeater neighbors: fetch_all_neighbours
(44 neighbors) is a large multi-frame query that hung, freezing the poll
task. The status sensors had already updated (card green) but the neighbor
list never populated and polling stalled.

Wrap every mesh call in the poll path with asyncio.wait_for:
- req_status_sync / send_login_sync / req_telemetry_sync via the existing
  _call_with_path_recovery helper (a timed-out call is treated as a normal
  no-response, which flows into path reset + backoff)
- reset_path in _reset_node_path
- fetch_all_neighbours in _fetch_repeater_neighbours (own NEIGHBOR_FETCH_TIMEOUT)

A hang now becomes an ordinary recoverable failure instead of a permanent
wedge. Adds MESH_COMMAND_TIMEOUT / NEIGHBOR_FETCH_TIMEOUT (60s) and a test
that a hanging command is treated as no-response and triggers recovery.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@awolden awolden left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the bounded-call timeouts are good, keep those. drop the coordinator first-timeout path reset (_call_with_path_recovery on login/status/telemetry):

  • stale paths already recover via MAX_FAILURES_BEFORE_PATH_RESET (3 failures); this just moves the trigger to the 1st timeout
  • reset_path forces a flood (region-wide airtime); the 3-strike threshold is a deliberate throttle, we want fewer floods not more
  • flip-flop: a successful flood-retry resets the failure counter so backoff never engages, so a marginal node re-floods nearly every poll
  • reset + retry sends also bypass the rate limiter

zooming out: a lot of the integration's design is about being a good citizen on the mesh (rate limiter, backoff, the 3-strike threshold). flooding on the first timeout cuts directly against that, it makes us one of the noisier nodes in the region instead of one of the quieter ones. that's the opposite of the direction we want to push.

keep the existing 3-strike reset. rejection-vs-timeout in the add-repeater flow is fine (one-time).

the test is a standalone copy of the method and has drifted from the real one (sleep(0) vs sleep(0.5)); drive the real coordinator method instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Login timing out with larger hop numbers despite repeater responding and providing path.

2 participants