Skip to content

feat(ble-proxy): serialize BLE connects + structured slot-exhaustion errors#704

Draft
Apollon77 wants to merge 14 commits into
mainfrom
fix/better-errorhandling
Draft

feat(ble-proxy): serialize BLE connects + structured slot-exhaustion errors#704
Apollon77 wants to merge 14 commits into
mainfrom
fix/better-errorhandling

Conversation

@Apollon77

@Apollon77 Apollon77 commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

A single physical device advertising several rotating BLE addresses caused matter.js parallel-PASE to fire concurrent BLE connect attempts at it. The proxy backend (HA/ESPHome) ran out of connection slots and commissioning failed with a cascade of timeouts (8 of 9 started attempt(s) failed).

This serializes BLE connects through the proxy and surfaces slot exhaustion as a clear, structured error. Scope: JS/TS server side only — see "Deferred" below for the python-client counterpart.

What changed

  • Connect gate (ProxyBleCentralInterface): a matter.js Semaphore (concurrency = 1) acquired before Connect and held for the channel lifetime, released idempotently on close/failure. At most one BLE connection in flight, so the connect storm never forms. A queued connect aborts on interface close, an incoming options.abort, or a 20 s bounded wait.
  • Orphan teardown: matter.js never signals a blocked openChannel (it only races the promise via Abort.attempt, never passes options.abort on the BLE path) and Transport.close() only fires at node teardown. So a connect finishing after its commission was abandoned would hold the single slot forever. We track handed-out channels (#openChannels) and a generation check tears down a connect that completes after an abort; ProxyBleCentralInterface.abortPendingConnects() rejects queued waiters and closes orphans.
  • Generic commissioning lifecycle: ControllerCommandHandler emits policy-free commissioningStarted / commissioningEnded (per attempt; never for invalid requests). matter-server/src/bleConnectGate.ts wireBleConnectGate() ref-counts them and calls abortPendingConnects() only when no commission is in progress — so finishing one attempt can't tear down a concurrent attempt's connect. These events are generic and reusable by other JS consumers.
  • Structured errors (server side): new wire codes out_of_connection_slots + connection_aborted, propagated as a typed BleProxyError (carries .code). isOutOfConnectionSlotsError() classifies by structured code or a case-insensitive connection_failed message-text fallback, so an un-updated python client still gets the clear, actionable error ("increase connection_slots / add proxies").
  • Python client: only the discover_faileddiscovery_failed wire-code fix is included here (independent bug — the canonical code in the enum + docs is discovery_failed).

Tests

  • TS: serialization (only one Connect in flight), queue-abort on interface close, abortPendingConnects reject-but-reusable, orphan teardown (already-open + finishes-after-abort), clear slot-exhaustion error, typed-error round-trip; commissioningStarted/commissioningEnded balanced/per-attempt/not-on-invalid-args; wireBleConnectGate ref-count (release only after last concurrent attempt).
  • Full suite 247/247 (incl. live Commission-On-Network); python 6 passed; format/lint/build green.

Deferred to the bleak_retry_connector connect_strategy PR

The python client's structured connect-error classification (_classify_connect_error + bleak marker matching) was reverted from this PR to keep it JS-only; it belongs with the python-side connect path in the connect_strategy work. The server's message-text fallback keeps things working until then. The python/HA side also needs to wire the commissioning gate behavior. Tracked locally in ble-proxy/python-structured-connect-errors.

Other follow-ups (pre-existing, not in this PR)

  • _handle_connect TimeoutError returns connection_failed rather than the existing timeout code.
  • BleProxyHandler.#cleanupClient rejects with a plain Error (could be BleProxyError).
  • matter.js upstream: ControllerCommissioner omits { abort } to ble.openChannel (TCP path passes it); plumbing it would let the gate cancel queued connects instantly instead of via the 20 s backstop.

🤖 Generated with Claude Code

Apollon77 and others added 8 commits June 1, 2026 14:41
Add out_of_connection_slots and connection_aborted wire codes, a
BleProxyError carrying the code, and isOutOfConnectionSlotsError() which
classifies by structured code with a connection_failed message-text
fallback for proxy clients predating the code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#handleResponse now rejects failed commands with BleProxyError so callers
can read the structured .code (e.g. out_of_connection_slots) instead of
only a flattened message. Adds a TestProxyClientError test helper so the
test client can emit a chosen wire error code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…odes

_handle_connect now maps BLE connect exceptions to out_of_connection_slots,
connection_aborted, device_not_found, or bluetooth_unavailable (else
connection_failed), matching by exception class name + message markers so
bleak_retry_connector stays a soft dependency.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a Semaphore(1) connect gate in ProxyBleCentralInterface so matter.js
parallel-PASE against a device's rotating BLE addresses can't exhaust the
proxy backend's connection slots. The slot is held for the channel lifetime
and released idempotently on close/failure; a queued connect aborts on
interface close, an incoming options.abort, or a 20s bounded wait.

Because matter.js never signals a blocked openChannel, also track handed-out
channels and tear down orphans (plus reject queued waiters) via
abortPendingConnects(), and a generation check tears down a connect that
completes after an abort — so a cancelled/timed-out commission can't leave a
channel holding the single slot forever. Surface a clear actionable error on
slot exhaustion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Emit a commissioningEnded event from ControllerCommandHandler.commissionNode
(finally, so success and failure both fire) and wire it in MatterServer to
ProxyBle.abortPendingConnects(). BLE is only used during commissioning, so
this releases queued connect waiters and tears down any orphaned channel
holding the single connection slot once an attempt finishes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ted codes

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The orphan branch threw into the connect-failure catch, which already sends
the proxy Disconnect — so it was sent twice. Let the catch own it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
node_modules/lockfile were on 0.17.0 while package.json pins the alpha; the
branch did not type-check until synced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 1, 2026 15:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens BLE Proxy commissioning by preventing concurrent BLE connect storms (which can exhaust proxy backend connection slots) and by introducing structured, typed error codes for clearer diagnosis across the TS server and Python proxy client.

Changes:

  • Serialize BLE connects through the proxy with a single-slot semaphore, plus lifecycle abort/cleanup to prevent “orphaned” connections from holding the only slot.
  • Add structured wire error codes (out_of_connection_slots, connection_aborted) and propagate them as typed errors (BleProxyError) end-to-end.
  • Add focused TS + Python tests covering connect gating, abort behavior, orphan cleanup, and error classification/round-trip.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
python_ble_proxy/tests/test_connect_errors.py Adds unit tests for Python-side connect-error classification into structured wire codes.
python_ble_proxy/matter_ble_proxy/client.py Implements connect-error classification and uses it when connect fails.
packages/ws-controller/test/CommissioningLifecycleTest.ts Adds tests asserting a new commissioning lifecycle event fires on success/failure.
packages/ws-controller/src/controller/ControllerCommandHandler.ts Introduces commissioningEnded event emitted after commissionNode attempt finishes.
packages/matter-server/src/MatterServer.ts Hooks commissioning lifecycle to abort pending proxy BLE connects when commissioning ends.
packages/ble-proxy/test/BleProxyTestClient.ts Enhances test proxy client to return structured error codes from handlers.
packages/ble-proxy/test/BleProxyProtocolTest.ts Adds tests for slot-exhaustion classification and typed error preservation.
packages/ble-proxy/test/BleProxyIntegrationTest.ts Adds integration tests for connect serialization, abort behavior, orphan cleanup, and error messaging.
packages/ble-proxy/src/ProxyBleChannel.ts Implements connect concurrency gate, abort generation handling, and slot-exhaustion error surfacing.
packages/ble-proxy/src/ProxyBle.ts Exposes abortPendingConnects() to trigger central-interface connect abort/cleanup.
packages/ble-proxy/src/BleProxyProtocol.ts Adds new error codes, slot-exhaustion message marker matching, and BleProxyError.
packages/ble-proxy/src/BleProxyHandler.ts Preserves structured error codes by rejecting pending commands with BleProxyError.
package-lock.json Updates workspace Node engine constraints in the lockfile to >=22.13.0.
docs/ble-proxy-protocol.md Documents the new structured connect error codes.

Comment thread python_ble_proxy/matter_ble_proxy/client.py Outdated
Comment thread packages/ws-controller/src/controller/ControllerCommandHandler.ts
Apollon77 and others added 3 commits June 1, 2026 17:38
The client sent discover_failed; the protocol enum and docs use
discovery_failed. Align the client with the canonical wire code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mirror the TS isOutOfConnectionSlotsError lowercasing so mixed-case backend
messages still classify as slot-exhaustion/aborted instead of falling back
to connection_failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Keep commissioningStarted/commissioningEnded as plain lifecycle facts on the
controller (emitted per attempt, never for invalid requests) and move the
"release the BLE connect gate only when no commission is in progress" policy
into wireBleConnectGate, the consumer-side ref-counter. Finishing one
commission can no longer abort a concurrent attempt's connect, and the
controller event stays generic and reusable.

Addresses PR review feedback on the commissioningEnded finally placement.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

python_ble_proxy/matter_ble_proxy/client.py:498

  • The PR description's follow-ups mention a pre-existing wire-code drift (discover_failed vs discovery_failed) as not addressed in this PR, but this change updates the Python client to emit discovery_failed. Please update the PR description/follow-ups so it matches what the code actually does (or revert this change if it was meant to stay out of scope).
        if conn.client.is_connected:
            await conn.client.disconnect()
        await self._send_success(cmd_id)

    async def _handle_discover_services(self, cmd_id: int, args: dict[str, Any]) -> None:
        conn = self._get_connection(args["connection_handle"])

Apollon77 and others added 2 commits June 1, 2026 19:41
Keep PR #704 JS-only. The python client's _classify_connect_error (and its
case-insensitive markers) move to the bleak_retry_connector connect_strategy
PR, where the python-side connect path lives. Server still classifies
slot-exhaustion via the connection_failed message-text fallback, so an
un-updated client keeps working. discovery_failed fix retained.

Tracked in todo: ble-proxy/python-structured-connect-errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 2 comments.

Comment on lines 484 to 488
async close() {
this.#connected = false;
this.#releaseSlot();
this.#cleanupObservers();
this.#terminateIterator();
Comment on lines +332 to +335
if (this.#closed || this.#abortGeneration !== abortGenAtStart) {
await proxyChannel.close();
throw new BleError(`BLE connect to ${peripheralAddress} aborted before completion`);
}
@Apollon77 Apollon77 marked this pull request as draft June 1, 2026 19:19
@mxssmu

mxssmu commented Jul 2, 2026

Copy link
Copy Markdown

We've been chasing what looks like exactly the failure class this PR targets, on a
~47-node Home Assistant Matter deployment (HAOS addon 9.0.2/9.0.3 = matterjs-server
1.1.1/1.1.2): BLE-proxy commissioning reliably fails post-C3 with GATT error 133 /
handle loss, with logs showing parallel connect attempts selecting a stale advertisement
address. Reproduced 4/4 on Govee H801D devices on a clean 9.0.2 validation run
(2026-06-25); the signature persists through 1.1.2.

Independent validation of this branch on macOS 26.5.1 arm64 (Apple M5), Node v26.3.1:
clean npm ci + build; full test suite 247/247 passing including the new
connect-gate/orphan-teardown/slot-exhaustion tests and the live Commission-On-Network
test; python_ble_proxy 6/6. One note: the branch currently conflicts with main,
v1.1.2, and v1.1.7 on 4 files (post-#708/#710 refactor), though it combines cleanly
with #677.

Since the branch conflicts with main post-#708/#710, we went ahead and did the rebase:
https://github.com/smalluniverse-net/matterjs-server/tree/pr-704-rebased-main (single
squashed commit onto 1.1.7, authored to you; all 14-commit-series semantics ported, both
test lineages passing 251/251, #677 still merges cleanly on top). One design note for
your judgment: under #708's multi-client model we scoped the connect gate PER
proxy-client connection (per physical adapter) rather than globally -- identical behavior
for single-client deployments, and a ~10-line change to a global semaphore if you prefer
the literal #704 semantics. We also folded in two of the open Copilot items
(slot-release-after-BTP-close, options.abort.aborted in the post-connect check, with a
regression test). Take whatever is useful -- cherry-pick, or we can open a PR against
your branch if that is easier.

Separately, we're looking at contributing two mock-transport tests the suite lacks for
this class: rotating-address advertisement simulation and scripted post-handshake
disconnect injection.

@Apollon77

Copy link
Copy Markdown
Collaborator Author

@mxssmu Which HA is running? There were changes and optimizations in HA core in the new 2026.07 ... please try that first because we were about to abandon this in favor of these changes

@mxssmu

mxssmu commented Jul 3, 2026

Copy link
Copy Markdown

@Apollon77 Thanks -- answering directly, plus fresh data from a 6-device commissioning
session we ran today, which happens to be exactly the "try 2026.07 and report back" you
asked for.

Environment: HA Core 2026.7.0b1 (current 2026.07 beta), Matter Server add-on 9.0.3
(matterjs-server 1.1.2), Bluetooth exclusively via ESPHome BLE proxies, 55 nodes on the
fabric (the "~47" in our first comment was stale; the fabric also grew by 6 today).

Result: 2026.7.0b1 is clearly better than what we reported before -- but the BLE-relay
class is still present, and it is device-sticky.
Today we commissioned 6 factory-fresh
Govee H6013 (Matter-over-WiFi; BLE is only the onboarding transport). In a single
interleaved run, 3 of the 6 succeeded over the BLE-proxy path (31-59 s each) -- a real
improvement over the 0/4 we saw on 1.1.1/1.1.2 with H801D, so credit where due on the
2026.07 changes. The other 3 failed in that same run (interleaved order
fail/ok/ok/fail/ok/fail: two operation_aborted at ~27 s, one 180 s timeout), and the
failures were sticky to the same 3 devices: across roughly 15 further BLE attempts today
(multiple factory resets, fresh commissioning windows, different orderings) those 3 failed
every single time, while their 3 siblings had succeeded immediately. Later attempts failed
at the discovery stage -- No commissionable device was discovered -- while the ESPHome
proxies' scans were seeing exactly those devices' 0xFFF6 commissionable advertisements
(discriminators 1382/3910/3697, RSSI -84..-89, 4-5 adverts per scan window) the whole time.
The adverts reach the proxy layer; commissioning discovery never matches them. (One honesty
note: that discovery-stage signature is earlier in the pipeline than the post-C3 GATT-133 /
stale-address data in our first comment, and a different model, H6013 vs H801D -- but same
BLE-relay lane.)

Controls that isolate the residual failure to the BLE lane (same HA instance, same
devices, same setup codes, minutes apart):

  • We joined the same 3 stuck bulbs to WiFi via the vendor app (vendor BLE provisioning, no
    Matter involvement), confirmed _matterc._udp CM=1 advertisements via mDNS, then
    commissioned each over IP (network_only): 3/3 success, 15-41 s each, first try.
  • Also falsified along the way: Matter Server add-on restart (no change), power-cycling the
    BLE proxy (no change), attempting ~12 minutes into an open commissioning window in case it
    was advert-timing (no change).

So devices, setup codes, fabric, and controller are all demonstrably healthy, and the
residual failure is deterministic per device rather than random radio flakiness -- which to
us smells like the advertisement/address-selection behavior this PR touches, and is why we
would gently push back on abandoning it outright: 2026.07 moved the needle from 0/4 to 3/6
for us, but not to 6/6.

Two caveats we are happy to close ourselves: (1) this is beta 1 -- if the changes you mean
landed after b1 we will rerun on 2026.7 final the day it ships; (2) our add-on is 9.0.3/1.1.2
-- if the optimizations live on the add-on side we will rerun on 9.0.4/1.1.7. Could you point
us at the specific core/add-on PRs you had in mind so we test the right combination? We can
also run an instrumented build with whatever debug logging would help, and the rebased branch
of this PR (previous comment) is still available for a native-BLE comparison run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants