htlcswitch: separate onchain and offchain intercpeted HTLCs cleanly#10895
htlcswitch: separate onchain and offchain intercpeted HTLCs cleanly#10895ziggie1984 wants to merge 5 commits into
Conversation
PR Severity: CRITICAL
Critical (2 files):
Medium (1 file):
Low (4 files - excluded from severity bump counts):
AnalysisThis PR modifies core htlcswitch package files. Changes to held_htlc_set.go and interceptable_switch.go affect the HTLC forwarding and payment routing state machine, warranting expert review. Bump check: 4 non-test files (threshold >20), 467 lines changed (threshold >500), 1 critical package. No bump applied. To override, add a severity-override-{critical,high,medium,low} label. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the HTLC interception mechanism to accurately manage HTLCs that transition from off-chain forwarding to on-chain resolution, particularly in scenarios involving channel force-closures. By introducing distinct handling for off-chain and on-chain held HTLCs, the changes prevent incorrect auto-failing or misdirection of settlement attempts, thereby enhancing the robustness and reliability of HTLC management within the system. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the heldHtlcSet to distinguish between off-chain and on-chain held HTLCs using a new heldEntry interface, ensuring proper handling of resolutions and expirations for both flows. It also updates the InterceptableSwitch and witness_beacon to support this, and adds comprehensive unit and integration tests. A critical deadlock risk was identified in witness_beacon.go where SubscribeUpdates holds a lock while making a blocking call to the interceptor, which can conflict with the switch's event loop attempting to acquire the same lock during resolution; releasing the lock early is recommended.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
cf2e20c to
02347b0
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the heldHtlcSet to track held HTLCs as either off-chain or on-chain entries using a new heldEntry interface. This change ensures that on-chain re-offers can replace old off-chain holds, allowing settlements to reach the witness beacon after an incoming channel force-closes. Additionally, integration tests are added to verify on-chain settlement scenarios, and minor fixes are made to the witness beacon subscription flow. The reviewer's feedback correctly points out a style guide violation where the newHeldHtlcSet function lacks a documentation comment.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
02347b0 to
6495243
Compare
5e3c3c3 to
424d1a8
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the heldHtlcSet to support both off-chain and on-chain held HTLCs via a new heldEntry interface, resolving an issue where on-chain forward interceptor settlements failed to reach the witness beacon after a channel force-close. The changes also include updates to InterceptableSwitch and witness_beacon.go, comprehensive unit and integration tests, and updated documentation. Feedback is provided to add a documentation comment to newHeldHtlcSet to align with the repository's style guide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
✅ Re-review posted: #10895 (review) Prior findings: 4 addressed, 0 still unresolved, 0 withdrawn. 🔁 Need another re-review after pushing changes? Reply with |
There was a problem hiding this comment.
Re-review of the on-chain/off-chain interceptor split. All prior findings are now resolved or have been dismissed by the maintainer: the two major items (F1 on-chain non-settle rejection, F4 exported-field breaking-change note) stay addressed, and the two release-note defects (F5 Git conflict markers, F6 hard-line-break separator) are fixed in docs/release-notes/release-notes-0.21.1.md.
A fresh pass over the current diff surfaced no genuinely-new defects. The concerns the specialists raised either restate findings the maintainer already dismissed with sound reasoning (the auto_fail_height overload — F3; the stream-terminating on-chain rejection — F7; the broad addOffChain exists-guard — F8), describe the PR's intended and itest-covered design (on-chain holds surviving interceptor disconnect; type-aware expire pruning on-chain entries rather than failing them), or were already debated and settled in PR review (the deliberately unbuffered onchainInterceptDone channel).
Status of prior findings
- F1 addressed: Resolved —
onChainHeld.resolveinhtlcswitch/held_htlc_set.gonow returnsErrCannotFailOnChainforFailandErrCannotResumeOnChainforResume/ResumeModified, matching theForwardHtlcInterceptResponsecontract documented inrouter.proto. Non-settle actions no longer silently forward to a dead link. - F4 addressed: Resolved — the bug-fix bullet in
docs/release-notes/release-notes-0.21.1.mdnow documents thehtlcswitch.InterceptedPacketchange for Go callers, directing them to the newDeadlinefield to distinguish off-chain auto-fail heights from on-chain settlement deadlines, orAutoFailHeight()for the legacy flattened value. - F5 addressed: Resolved — the Deprecations section of
docs/release-notes/release-notes-0.21.1.mdno longer contains the<<<<<<</=======/>>>>>>>Git conflict markers; both### ⚠️ Warningdeprecation entries render cleanly. - F6 addressed: Resolved — the separator between the tor bullet and the new interceptor bullet is now a genuine empty line rather than a two-space hard line break.
|
/gateway approve |
|
❌ Cannot approve: 2 finding(s) still open.
Address them and run |
|
/gateway dismiss F7, we are aware of the interceptor stream error feedback capabilities right now |
|
❌ |
|
/gateway dismiss F7 we are aware of the interceptor stream error feedback capabilities right now |
|
🚫 Dismissed F7 (minor) by @saubyk — we are aware of the interceptor stream error feedback capabilities right now Open findings on this PR: 🟡 F8 (minor) |
|
/gateway dismiss F8 onchain is always the final state in case a channel goes onchain, your explicit case can not happen or only in some edge case race scenario where I think it is ok no not log anything but just ignore this call and exit early. |
|
🚫 Dismissed F8 (minor) by @saubyk — onchain is always the final state in case a channel goes onchain, your explicit case can not happen or only in some edge case race scenario where I think it is ok no not log anything but just ignore this call and exit early. All findings on this PR are addressed or dismissed. |
|
/gateway approve |
There was a problem hiding this comment.
✅ Approved — all prior findings addressed.
Findings recap
Findings:
F1(major) — addressed:onChainHeld.resolveacceptsFwdActionResume,FwdActionResumeModified, andFwdActionFailand forwards them toh.fwd.Resume()/`h.fwd…F2(minor) — unresolved:interceptOnChainonly callssendForwardwhen!wasHeld, so when an existing off-chain hold is promoted to on-chain the interceptor clie…F3(minor) — unresolved:InterceptedPacket.AutoFailHeight()projects both either-arms to a singleint32, and the RPC layer (forward_interceptor.go) maps that i…F4(minor) — addressed: The exportedAutoFailHeight int32field onhtlcswitch.InterceptedPacket(stable since 2022) is removed in favor of the `Deadline fn.Eith…F5(major) — addressed: The Deprecations section contains unresolved Git conflict markers committed into the file:<<<<<<< HEAD(line 63),=======(line 68), an…F6(minor) — addressed: The separator line between the tor bullet and the new interceptor bullet is(two spaces) rather than an empty line. Two trailing spaces…F7(minor) — unresolved:onChainHeld.resolvereturningErrCannotFailOnChain/ErrCannotResumeOnChainis correct per the documented contract, but that error propa…F8(minor) — unresolved:addOffChaindoesif h.exists(key) { return nil }, which no-ops for any existing entry — including anonChainHeld. That correctly pre…
Dismissed:
F2by @ziggie1984 —F3by @ziggie1984 —F7by @saubyk — we are aware of the interceptor stream error feedback capabilities right nowF8by @saubyk — onchain is always the final state in case a channel goes onchain, your explicit case can not happen or only in some edge case race scenario where I think it is ok no not log anything but just ignore this call and exit early.
Approved by @saubyk via /gateway approve. Last reviewed at b6bf686. Skill v0.3.0, model claude-opus-4-8.
| @@ -944,7 +948,8 @@ message ForwardHtlcInterceptRequest { | |||
| bytes onion_blob = 9; | |||
|
|
|||
| // The block height at which this htlc will be auto-failed to prevent the | |||
There was a problem hiding this comment.
Overloading auto_fail_height leaves clients with no machine-readable way to know whether this request is still off-chain or has moved on-chain. The same field now means auto-fail height for off-chain and settlement deadline for on-chain, and forward_interceptor.go only sends the flattened value, so a client receiving an on-chain offer after reconnect cannot know that RESUME/FAIL are invalid/stream-terminating or how to interpret the deadline. Please add an explicit type/phase field or another reliable RPC invariant. (openai/gpt-5.5)
There was a problem hiding this comment.
I see the concern, but I’m not convinced an extra type field is the right fix in this PR. The deeper issue is that the interceptor RPC does not have a good per-resolution failure communication path. A type discriminator would only help clients pre-classify this one case, while invalid actions still fail through the stream/error path.
There was a problem hiding this comment.
I agree that the stream/error model is a deeper limitation, and a phase/type field would not fix per-resolution ACK/error handling by itself.
The part I still think is unresolved is that the type field prevents clients from making the invalid choice in the first place. Without it, auto_fail_height has two meanings, and the request does not give clients a machine-readable way to know whether RESUME/FAIL are valid or whether the height is an auto-fail height vs an on-chain settle deadline.
The possible client-side issues are:
- A client may keep using the old off-chain
auto_fail_height, assume the HTLC is no longer actionable after that height, and drop it even though it moved on-chain and can still be settled. - A client may send
RESUME,RESUME_MODIFIED, orFAILfor an on-chain hold, which now returns an error and terminates the interceptor stream. - A continuously connected client is not re-notified on off-chain -> on-chain promotion, so it may never learn the new on-chain deadline unless it reconnects.
- Monitoring/state machines cannot distinguish “this will auto-fail off-chain” from “this can only be settled on-chain until this deadline” without heuristics.
So I’m fine with improving per-resolution errors in a follow-up, but I don’t think that removes the need for either an explicit request phase/deadline type, or a documented reconnect/replay invariant that clients can rely on to refresh promoted holds.
| return err | ||
| } | ||
|
|
||
| if s.interceptor != nil && !wasHeld { |
There was a problem hiding this comment.
Skipping the send when wasHeld means a connected interceptor that first saw this as an off-chain hold never receives the replacement on-chain packet/deadline. After this call the held entry has been replaced with onChainHeld, so Resume/Fail now error and the useful deadline is htlc.RefundTimeout, but the client still only has the old off-chain auto-fail height until it reconnects. This can cause a continuously connected interceptor to stop trying or take the wrong action and miss a valid on-chain settle window; please re-offer/update the packet on promotion. (openai/gpt-5.5)
There was a problem hiding this comment.
see above, this design decision was made because we did not want to re-notify clients, clients would just see the same forward with another deadline and might be confused. I think the bigger problem as described in the upper comment is the failure feedback of the stream which should potentially be addressed in a followup PR.
|
No issues found by |
yyforyongyu
left a comment
There was a problem hiding this comment.
We still use human review as the last shield, so I think it's important to make sure the final commits are easy to review. One rule I use in my agent is to stick to the commit hygiene, sth like this,
Commit hygiene rules
- fold in the changes - don't introduce new commits to address comments, always address on the original commits
- No introduce-then-remove / introduce-then-rename within a stack — build toward the end-state shape directly; don't add a type/file only to delete or rename it in a later commit of the same series.
This way we can move things way faster - when commits are clean, reviewers can approve more quickly. In addition to the issues posted, here is my agent's analysis,
Main Violations
- eb5a25d54 htlcswitch: track held HTLC source introduces releaseAll() and onChainHeld.release().
- ba9ac97d3 htlcswitch: retain on-chain holds on disconnect removes that shape and renames/changes it to releaseOffChainHeld().
- This violates “no introduce-then-remove / introduce-then-rename.”
- eb5a25d54 overloads InterceptedPacket.AutoFailHeight for on-chain deadlines.
- d2fe6ad76 htlcswitch: split interceptor deadline semantics replaces that with Deadline fn.Either[...] and AutoFailHeight().
- This should be built in the end-state shape directly, not introduced as one field then replaced.
- 18a73f235 htlcswitch: ignore non-settle on-chain resolutions introduces errIgnoreOnChainResolution and documents non-settle actions as silently ignored.
- ecd0e1ada htlcswitch: surface invalid on-chain resolutions removes that behavior and replaces it with ErrCannotResumeOnChain / ErrCannotFailOnChain.
- This is a direct semantic reversal inside the same stack and should be folded into the original source-tracking commit.
- 1e572701e routerrpc: add clarifying docs for the intercepted forward documents “silently ignored.”
- ecd0e1ada later rewrites that to “stream-terminating error.”
- The docs should only introduce the final behavior.
- 7519d149e htlcswitch: use mock.Mock in held HTLC set tests is pure review-comment cleanup of tests introduced earlier.
- This should be folded into the commit that introduces/rewrites held_htlc_set_test.go.
- c72c5cc31 htlcswitch: evict resolved on-chain holds adds cleanup for on-chain entries after a previous commit had already defined on-chain retention semantics.
- This is a lifecycle completion/fixup and should likely be folded into the core track held HTLC source commit or a single “on-chain lifecycle” commit.
Suggested Rewrite Shape
- Commit 1: itest: cover on-chain interceptor settlement
- Commit 2: htlcswitch: track off-chain and on-chain held HTLCs
- Commit 3: witnessbeacon: avoid interceptor deadlock
- Commit 4: routerrpc: document on-chain interceptor semantics
- Commit 5: docs: add v0.21.1 release note
Commit 2 should already include:
- offChainHeld / onChainHeld
- Deadline fn.Either[...]
- releaseOffChainHeld
- removeOnChain
- ErrCannotResumeOnChain / ErrCannotFailOnChain
- final tests using mock.Mock
- final promotion/disconnect/expiry behavior
Commit 4 should only document final behavior, not intermediate “silently ignored” behavior.
Given that agents do the coding, I think it's now a low cost to maintain the commit hygiene in our repo, just one more rule/prompt.
| // errIgnoreOnChainResolution is an internal signal used when a | ||
| // non-settle resolution for an on-chain HTLC should be ignored without | ||
| // removing the held entry. | ||
| errIgnoreOnChainResolution = errors.New( |
There was a problem hiding this comment.
I'm not sure about the reasoning of your agent - aren't all these changes made in the same PR?
I think it should be both. I chose to have new commits on-top of the old ones to make the review easier, because once people review the PR and have comments, the new commits show that your comments were addressed without going through the older commits again. Especially on a bigger PR like this. But I agree that the final PR can have the squashed version so that we do not have the changes of changes in one PR. However I would not like do a strict rule we already had cases were we introduce new code to make the commit story easier to understand and later drop it again. So I would say it should be a case by case evaluation. That being said, I am going to squash the changes as proposed by your review comment. |
Add coverage for held forwards that move on chain after the incoming channel force closes. The restart case exercises the path where Bob loses the in-memory held set and contractcourt re-offers the HTLC through the witness beacon. The no-restart case keeps the original off-chain hold and proves that settlement must still reach the on-chain resolver.
Store held forwards as off-chain or on-chain entries instead of a raw InterceptedForward map. Off-chain entries keep the existing resume, fail, settle and auto-fail behavior. On-chain entries are settle-only and expire by pruning local interceptor state. When contractcourt re-offers a circuit that is already held off-chain, replace the stored entry with the on-chain forward so a later SETTLE reaches the witness beacon instead of the old link mailbox path. Also set the on-chain interceptor deadline to the HTLC refund timeout. This keeps the public interceptor deadline populated while ensuring only off-chain held entries use that value to fail back. Only off-chain held HTLCs can be released when an optional interceptor disconnects, because they can resume into the link forwarding flow. On-chain held HTLCs have no link flow to resume. Keep them in the held set so a reconnecting interceptor can replay and settle them while contractcourt waits for the preimage or on-chain expiry. Use distinct internal deadline types for off-chain auto-fail heights and on-chain settlement deadlines instead of overloading the intercepted packet field. Project both variants back into the existing router RPC auto_fail_height field to preserve wire compatibility. Reject mismatched held HTLC deadline types in tests. On-chain intercepted HTLCs can only be settled. Resume and fail actions already return concrete errors through the on-chain intercepted forward, so let those errors propagate to the interceptor client instead of converting them to success. Keep the held entry tracked on these errors so the client can reconnect and settle the HTLC later.
Release the preimage beacon lock before invoking the on-chain interceptor. The interceptor path can block on the htlcswitch event loop, while resolution of another held on-chain HTLC can call back into the beacon to add a preimage. If interceptor delivery fails after the subscriber was registered, cancel the subscription before returning the error. On-chain held entries are replay handles for the interceptor while contractcourt waits for a preimage or on-chain expiry. Once the resolver tears down, keeping the handle until the refund timeout can replay a stale HTLC to a reconnecting interceptor. Thread a dedicated cleanup signal from the witness subscription cancel path back through the interceptable switch event loop. The held set only removes on-chain entries for that signal, leaving off-chain entries under the link flow lifecycle.
routerrpc: document on-chain interceptor responses
b6bf686 to
dc02a60
Compare
| } | ||
|
|
||
| // RemoveOnChainIntercept removes an on-chain intercepted forward from the held | ||
| // set once its contractcourt resolver has finished. |
There was a problem hiding this comment.
Probably we could just rely on this for the onchain held HTLCs and not prune them at all, but they are pruned as soon as the resolver gets cleared.
There was a problem hiding this comment.
but I vote for keeping the pruning logic for the onchain HTLCs in the switch. Just a second redundancy to clean the held htlc map.
yyforyongyu
left a comment
There was a problem hiding this comment.
LGTM⛩️
I've left a concern about the current RPC design, as using one field to mean two things is always a bad idea. We can keep auto_fail_height for the existing off-chain meaning and add settle_deadline for the on-chain case. That preserves compatibility while avoiding a field whose name means “auto-fail” in a state where no auto-fail happens However given this is a minor release, adding a new field can be too heavy - but we should definitely fix it in our next release.
Agree let's fix it in a followup PR for 22. |
Change Description
Fixes #10892
Look at the first commit which shows via itests the failure cases of the old Interceptor Implementation
The second commit introduces a new interface and distingishes between onchain and offchain HTLC for the interceptor. It does not change any public interface.