sweep: don't ratchet starting fee rate on non-fee failures#10898
sweep: don't ratchet starting fee rate on non-fee failures#10898jtobin wants to merge 2 commits into
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses an issue where the sweeper would incorrectly increase the starting fee rate for sweep attempts that failed due to resource constraints rather than fee-related issues. By distinguishing between fee-related failures and resource-related failures, the system now avoids ratcheting the fee rate when it cannot resolve the underlying problem, ensuring that inputs are not unnecessarily stranded. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request modifies the sweeper's fee bumping logic to avoid ratcheting up the starting fee rate on resource-related failures (such as ErrNotEnoughInputs and ErrNotEnoughBudget). Instead, these failures preserve the existing starting fee rate to prevent inputs from being permanently stranded. Corresponding unit tests have been added and updated to verify this behavior. A review comment suggests correcting a grammatical error in a warning log message from 'Fail to' to 'Failed to'.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| errors.Is(err, ErrNotEnoughInputs) { | ||
|
|
||
| // Calculate the next fee rate for the retry. | ||
| log.Warnf("Fail to fee bump tx %v: %v", oldTx.TxHash(), err) |
There was a problem hiding this comment.
The log message uses the incorrect verb tense "Fail to". It should be "Failed to" to be grammatically correct and consistent with other log messages in this file (e.g., "Failed to bump tx").
| log.Warnf("Fail to fee bump tx %v: %v", oldTx.TxHash(), err) | |
| log.Warnf("Failed to fee bump tx %v: %v", oldTx.TxHash(), err) |
|
PR Severity: CRITICAL. All changed files are in sweep/* (fee_bumper.go and sweeper.go), which handles output sweeping, fund recovery, and fee bumping — a CRITICAL package. 2 non-test files, 86 non-test lines changed (well under bump thresholds). Test files excluded: fee_bumper_test.go, sweeper_test.go. To override, add a severity-override-{critical,high,medium,low} label. <!-- pr-severity-bot --> |
When a sweep fails to publish, the sweeper computes a higher retry fee rate and stores it as the input's new starting rate. That's the right move for fee-related failures, where the broadcast was rejected for not paying enough, but the same machinery also fires on resource failures (no wallet UTXO available, the input set's budget exceeded), which carry no fee-rate signal. Repeatedly ratcheting the rate on these can push it past an input's intrinsic budget, after which the aggregator silently skips the input and the sweep is stranded forever. Distinguish fee-rate-bearing failures from resource ones at the point where the bump result is produced, and skip the rate ratchet for the latter: the input's existing starting fee rate is preserved across non-fee failures, so the next retry starts at the rate the original sweep request specified.
893025b to
b1da0b2
Compare
|
@jtobin, remember to re-request review from reviewers when ready |
|
/gateway review |
|
✅ Review posted: #10898 (review) 6 finding(s); 6 inline, 0 in body. 🔁 Need a re-review after pushing changes? Reply with |
There was a problem hiding this comment.
This PR stops the sweeper from ratcheting an input's starting fee rate on resource failures (ErrNotEnoughInputs, ErrNotEnoughBudget), which previously could push the starting rate past an input's budget and silently strand the sweep. The intent is sound and the fix is well-targeted, and the new tests directly assert the no-ratchet behavior on both the initial and replacement paths.
The mechanism, however, is a magic value: a BumpResult with FeeRate == 0 now means "preserve the existing starting rate," consumed by markInputsPublishFailed. This is only correct if zero is produced exclusively by the two resource-failure branches and never by a legitimate fee computation — and the retained ErrMaxPosition branch and the "all other failures" branch in handleReplacementTxError both still attach calculateRetryFeeRate(r) to a TxFailed result. The correctness of the entire change rests on that disjointness, which the visible diff does not establish. On a CRITICAL fund-recovery path this needs to be confirmed (or the intent carried explicitly) before merge. Secondary concerns: the two sibling error handlers now classify the same error set with divergent code shapes, and a comment carried over from the old shared-case structure now misdescribes ErrMaxPosition.
Findings: 🔴 0 Blocker · 🟠 2 Major · 🟡 4 Minor · 🔵 0 Nit
| // FeeRate at zero; preserving the input's existing starting | ||
| // rate in that case avoids stranding inputs whose intrinsic | ||
| // budget can't accommodate a higher rate. | ||
| if feeRate == 0 { |
There was a problem hiding this comment.
🟠 Major F1: FeeRate==0 overloaded as a 'preserve' sentinel; collision with a legitimate zero is unproven
markInputsPublishFailed now treats feeRate == 0 as "no fee dimension — preserve the input's existing StartingFeeRate" and continues without writing it. This invariant holds only if a successful fee computation can never yield zero. It is not closed in the changed code: the retained ErrMaxPosition case in handleInitialTxError (sweep/fee_bumper.go:1118) and the "all other fee-bump failures" tail of handleReplacementTxError both assign result.FeeRate = calculateRetryFeeRate(r) to a TxFailed result, and handleInitialTxError assigns that value even when calculateRetryFeeRate returned an error (only Event/Err flip to fatal). BumpResult.Validate() requires a non-zero FeeRate only for TxConfirmed, so a zero on TxFailed passes silently. If calculateRetryFeeRate or feeFunction.FeeRate() can legitimately return 0 (a fee floor of zero, a zero-weight/zero-deadline edge, or the pre-first-ratchet starting rate), that result is silently reinterpreted as "preserve," freezing the input at a stale rate — the same class of mis-pricing this PR is trying to eliminate. calculateRetryFeeRate and FeeRate() are not present in the provided file_contents (the file was truncated), so I cannot confirm reachability; this is a blocker if zero is reachable on any non-resource path. The robust fix is to not overload the value channel — carry the intent explicitly, e.g. fn.Option[chainfee.SatPerKWeight] or a preserveStartingFeeRate bool on BumpResult, so "no fee signal" and "a computed rate of zero" are distinguishable.
| case errors.Is(err, ErrNotEnoughInputs), | ||
| errors.Is(err, ErrNotEnoughBudget): | ||
|
|
||
| result.Event = TxFailed |
There was a problem hiding this comment.
🟠 Major F2: Preserving the rate on budget failure may strand a deadline-sensitive sweep
For ErrNotEnoughBudget/ErrNotEnoughInputs the input now retries indefinitely at its original starting fee rate. For a budget-exhausted input that is the intended fix, but the same branch is reachable for inputs with a confirmation deadline (e.g. an outgoing-HTLC timeout sweep) whose budget is momentarily insufficient while the deadline still approaches. If deadline pressure escalates the fee only through the now-skipped ratchet path, such a sweep can sit at a low rate and miss its deadline — a fund-safety regression and a potential griefing lever for a counterparty that can keep a channel near its budget ceiling. I cannot tell from the diff whether a separate deadline-driven escalation (independent of this ratchet) keeps time-sensitive sweeps moving; the relevant fee-function/deadline code is not in the loaded context. Confirm that preserving the starting rate on a budget failure cannot strand a deadline-bound sweep before relying on this behavior; if it can, the resource-failure handling needs to distinguish deadline-bound inputs.
| // so these inputs can be retried with a different group in the next | ||
| // block. | ||
| // block. The error is fee-related (the linear fee function has run | ||
| // past its budget), so we still calculate a retry fee rate. |
There was a problem hiding this comment.
🟡 Minor F3: Retained ErrMaxPosition ratchet branch lacks an asserting test
The refactor splits the former fallthrough so ErrMaxPosition keeps calling calculateRetryFeeRate while the resource errors no longer do. This divergence is the load-bearing semantic of the PR, and the deleted fallthrough is easy to misread. The new TestHandleInitialTxErrorNoRatchetOnResourceFailure covers the resource branch, but the diff shows no test asserting that ErrMaxPosition still ratchets (still attaches a non-zero FeeRate). Add coverage so a future edit can't silently fold ErrMaxPosition back into the no-ratchet set.
| @@ -1114,20 +1114,13 @@ func (t *TxPublisher) handleInitialTxError(r *monitorRecord, err error) { | |||
|
|
|||
There was a problem hiding this comment.
🟡 Minor F4: ErrMaxPosition comment conflates position exhaustion with budget exhaustion
The rewritten comment on case errors.Is(err, ErrMaxPosition) opens "When the error is due to budget being used up" — wording carried over from when this case shared a block with ErrNotEnoughBudget. Now that the budget errors live in a separate case below, this conflates the fee-function position error with a budget error and will mislead the next reader. Reword to describe position exhaustion of the linear fee function.
| // at zero signals the sweeper to preserve the input's existing | ||
| // starting fee rate; otherwise repeated resource failures would | ||
| // ratchet the rate past the input's intrinsic budget and strand it. | ||
| if errors.Is(err, ErrNotEnoughBudget) || |
There was a problem hiding this comment.
🟡 Minor F5: Resource-failure classification duplicated across two handlers with divergent shapes
errors.Is(err, ErrNotEnoughInputs) || errors.Is(err, ErrNotEnoughBudget), plus a near-identical multi-line justification, now appears in both handleInitialTxError (as a switch case) and handleReplacementTxError (as an if/early-return). This is the second reorganization of this classification in ~4 months (blame: PR #9447 2025-02-12, PR #9627 2025-03-20), so the contract is still churning — and the two sites must stay in lockstep. Extract a single predicate (e.g. isResourceFailure(err error) bool) and reference it from both, so a future third resource error can't be added to one handler and missed in the other.
| errors.Is(err, ErrNotEnoughInputs) { | ||
|
|
||
| // Calculate the next fee rate for the retry. | ||
| log.Warnf("Fail to fee bump tx %v: %v", oldTx.TxHash(), err) |
There was a problem hiding this comment.
🟡 Minor F6: Relocated log line keeps grammar error 'Fail to fee bump'
The relocated warning logs "Fail to fee bump tx %v: %v", inconsistent with the sibling log.Errorf("Failed to bump tx %v: %v", ...) a few lines down (also already flagged in review). Since the surrounding block is being rewritten anyway, fix it in this pass.
| log.Warnf("Fail to fee bump tx %v: %v", oldTx.TxHash(), err) | |
| log.Warnf("Failed to fee bump tx %v: %v", oldTx.TxHash(), err) |
|
🤖 gateway audit metadata for this PR — auto-generated, please don't edit. |
Supplementary to #10897 (and replaces #10816).
The sweeper responds to a failed sweep attempt by ratcheting the input's fee rate for the next one. For fee-related failures this is the right response, but the same machinery also fires on resource failures, e.g. ErrNotEnoughInputs or ErrNotEnoughBudget. These failures carry no fee rate signal, so ratcheting the fee rate can't resolve them.
Each spurious ratchet raises the bar the input's budget must clear, and once the starting rate exceeds an input's budget, the aggregator filters it out of every future input set and the sweep is silently stranded. This PR skips the ratchet for failures with no fee dimension, preserving the input's existing starting fee rate across retries.
The TxUnknownSpend path also attaches a retry fee rate. It's deliberately left untouched here, since an unknown spend can reflect a genuine fee race with a counterparty; with #10897, that residual ratchet can no longer strand small-budget inputs.