Skip to content

sweep: don't ratchet starting fee rate on non-fee failures#10898

Open
jtobin wants to merge 2 commits into
lightningnetwork:masterfrom
jtobin:sweep-fee-ratchet
Open

sweep: don't ratchet starting fee rate on non-fee failures#10898
jtobin wants to merge 2 commits into
lightningnetwork:masterfrom
jtobin:sweep-fee-ratchet

Conversation

@jtobin

@jtobin jtobin commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Supplementary to #10897 (and replaces #10816).

The sweeper responds to a failed sweep attempt by ratcheting the input's fee rate for the next one. For fee-related failures this is the right response, but the same machinery also fires on resource failures, e.g. ErrNotEnoughInputs or ErrNotEnoughBudget. These failures carry no fee rate signal, so ratcheting the fee rate can't resolve them.

Each spurious ratchet raises the bar the input's budget must clear, and once the starting rate exceeds an input's budget, the aggregator filters it out of every future input set and the sweep is silently stranded. This PR skips the ratchet for failures with no fee dimension, preserving the input's existing starting fee rate across retries.

The TxUnknownSpend path also attaches a retry fee rate. It's deliberately left untouched here, since an unknown spend can reflect a genuine fee race with a counterparty; with #10897, that residual ratchet can no longer strand small-budget inputs.

@jtobin jtobin self-assigned this Jun 11, 2026
@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue where the sweeper would incorrectly increase the starting fee rate for sweep attempts that failed due to resource constraints rather than fee-related issues. By distinguishing between fee-related failures and resource-related failures, the system now avoids ratcheting the fee rate when it cannot resolve the underlying problem, ensuring that inputs are not unnecessarily stranded.

Highlights

  • Prevent unnecessary fee rate ratcheting: Modified the sweeper logic to skip fee rate increments when encountering resource-related failures like ErrNotEnoughInputs or ErrNotEnoughBudget, which do not have a fee-rate dimension.
  • Preserve existing fee rates: Updated the sweeper to preserve the existing starting fee rate when a failure is reported with a zero fee rate, preventing inputs from being stranded due to artificially inflated fee requirements.
  • Comprehensive testing: Added new unit tests in both fee_bumper_test.go and sweeper_test.go to verify that resource failures do not trigger fee rate increments and that existing rates are correctly preserved.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the sweeper's fee bumping logic to avoid ratcheting up the starting fee rate on resource-related failures (such as ErrNotEnoughInputs and ErrNotEnoughBudget). Instead, these failures preserve the existing starting fee rate to prevent inputs from being permanently stranded. Corresponding unit tests have been added and updated to verify this behavior. A review comment suggests correcting a grammatical error in a warning log message from 'Fail to' to 'Failed to'.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread sweep/fee_bumper.go
errors.Is(err, ErrNotEnoughInputs) {

// Calculate the next fee rate for the retry.
log.Warnf("Fail to fee bump tx %v: %v", oldTx.TxHash(), err)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The log message uses the incorrect verb tense "Fail to". It should be "Failed to" to be grammatically correct and consistent with other log messages in this file (e.g., "Failed to bump tx").

Suggested change
log.Warnf("Fail to fee bump tx %v: %v", oldTx.TxHash(), err)
log.Warnf("Failed to fee bump tx %v: %v", oldTx.TxHash(), err)

@github-actions github-actions Bot added the severity-critical Requires expert review - security/consensus critical label Jun 11, 2026
@github-actions

Copy link
Copy Markdown

PR Severity: CRITICAL. All changed files are in sweep/* (fee_bumper.go and sweeper.go), which handles output sweeping, fund recovery, and fee bumping — a CRITICAL package. 2 non-test files, 86 non-test lines changed (well under bump thresholds). Test files excluded: fee_bumper_test.go, sweeper_test.go. To override, add a severity-override-{critical,high,medium,low} label. <!-- pr-severity-bot -->

jtobin added 2 commits June 11, 2026 13:14
When a sweep fails to publish, the sweeper computes a higher retry
fee rate and stores it as the input's new starting rate. That's the
right move for fee-related failures, where the broadcast was rejected
for not paying enough, but the same machinery also fires on resource
failures (no wallet UTXO available, the input set's budget exceeded),
which carry no fee-rate signal. Repeatedly ratcheting the rate on
these can push it past an input's intrinsic budget, after which the
aggregator silently skips the input and the sweep is stranded
forever.

Distinguish fee-rate-bearing failures from resource ones at the
point where the bump result is produced, and skip the rate ratchet
for the latter: the input's existing starting fee rate is preserved
across non-fee failures, so the next retry starts at the rate the
original sweep request specified.
@jtobin jtobin force-pushed the sweep-fee-ratchet branch from 893025b to b1da0b2 Compare June 11, 2026 15:48
@jtobin jtobin requested a review from ziggie1984 June 12, 2026 11:35
@lightninglabs-deploy

Copy link
Copy Markdown
Collaborator

@jtobin, remember to re-request review from reviewers when ready

@saubyk

saubyk commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

/gateway review

@lightninglabs-gateway

lightninglabs-gateway Bot commented Jun 25, 2026

Copy link
Copy Markdown

✅ Review posted: #10898 (review)

6 finding(s); 6 inline, 0 in body.

🔁 Need a re-review after pushing changes? Reply with /gateway re-review.
Maintainers can also /gateway dismiss <id> to silence specific findings, or anyone can /gateway explain <id> for elaboration.

@lightninglabs-gateway lightninglabs-gateway Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR stops the sweeper from ratcheting an input's starting fee rate on resource failures (ErrNotEnoughInputs, ErrNotEnoughBudget), which previously could push the starting rate past an input's budget and silently strand the sweep. The intent is sound and the fix is well-targeted, and the new tests directly assert the no-ratchet behavior on both the initial and replacement paths.

The mechanism, however, is a magic value: a BumpResult with FeeRate == 0 now means "preserve the existing starting rate," consumed by markInputsPublishFailed. This is only correct if zero is produced exclusively by the two resource-failure branches and never by a legitimate fee computation — and the retained ErrMaxPosition branch and the "all other failures" branch in handleReplacementTxError both still attach calculateRetryFeeRate(r) to a TxFailed result. The correctness of the entire change rests on that disjointness, which the visible diff does not establish. On a CRITICAL fund-recovery path this needs to be confirmed (or the intent carried explicitly) before merge. Secondary concerns: the two sibling error handlers now classify the same error set with divergent code shapes, and a comment carried over from the old shared-case structure now misdescribes ErrMaxPosition.

Findings: 🔴 0 Blocker · 🟠 2 Major · 🟡 4 Minor · 🔵 0 Nit

Comment thread sweep/sweeper.go
// FeeRate at zero; preserving the input's existing starting
// rate in that case avoids stranding inputs whose intrinsic
// budget can't accommodate a higher rate.
if feeRate == 0 {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major F1: FeeRate==0 overloaded as a 'preserve' sentinel; collision with a legitimate zero is unproven

markInputsPublishFailed now treats feeRate == 0 as "no fee dimension — preserve the input's existing StartingFeeRate" and continues without writing it. This invariant holds only if a successful fee computation can never yield zero. It is not closed in the changed code: the retained ErrMaxPosition case in handleInitialTxError (sweep/fee_bumper.go:1118) and the "all other fee-bump failures" tail of handleReplacementTxError both assign result.FeeRate = calculateRetryFeeRate(r) to a TxFailed result, and handleInitialTxError assigns that value even when calculateRetryFeeRate returned an error (only Event/Err flip to fatal). BumpResult.Validate() requires a non-zero FeeRate only for TxConfirmed, so a zero on TxFailed passes silently. If calculateRetryFeeRate or feeFunction.FeeRate() can legitimately return 0 (a fee floor of zero, a zero-weight/zero-deadline edge, or the pre-first-ratchet starting rate), that result is silently reinterpreted as "preserve," freezing the input at a stale rate — the same class of mis-pricing this PR is trying to eliminate. calculateRetryFeeRate and FeeRate() are not present in the provided file_contents (the file was truncated), so I cannot confirm reachability; this is a blocker if zero is reachable on any non-resource path. The robust fix is to not overload the value channel — carry the intent explicitly, e.g. fn.Option[chainfee.SatPerKWeight] or a preserveStartingFeeRate bool on BumpResult, so "no fee signal" and "a computed rate of zero" are distinguishable.

Comment thread sweep/fee_bumper.go
case errors.Is(err, ErrNotEnoughInputs),
errors.Is(err, ErrNotEnoughBudget):

result.Event = TxFailed

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Major F2: Preserving the rate on budget failure may strand a deadline-sensitive sweep

For ErrNotEnoughBudget/ErrNotEnoughInputs the input now retries indefinitely at its original starting fee rate. For a budget-exhausted input that is the intended fix, but the same branch is reachable for inputs with a confirmation deadline (e.g. an outgoing-HTLC timeout sweep) whose budget is momentarily insufficient while the deadline still approaches. If deadline pressure escalates the fee only through the now-skipped ratchet path, such a sweep can sit at a low rate and miss its deadline — a fund-safety regression and a potential griefing lever for a counterparty that can keep a channel near its budget ceiling. I cannot tell from the diff whether a separate deadline-driven escalation (independent of this ratchet) keeps time-sensitive sweeps moving; the relevant fee-function/deadline code is not in the loaded context. Confirm that preserving the starting rate on a budget failure cannot strand a deadline-bound sweep before relying on this behavior; if it can, the resource-failure handling needs to distinguish deadline-bound inputs.

Comment thread sweep/fee_bumper.go
// so these inputs can be retried with a different group in the next
// block.
// block. The error is fee-related (the linear fee function has run
// past its budget), so we still calculate a retry fee rate.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor F3: Retained ErrMaxPosition ratchet branch lacks an asserting test

The refactor splits the former fallthrough so ErrMaxPosition keeps calling calculateRetryFeeRate while the resource errors no longer do. This divergence is the load-bearing semantic of the PR, and the deleted fallthrough is easy to misread. The new TestHandleInitialTxErrorNoRatchetOnResourceFailure covers the resource branch, but the diff shows no test asserting that ErrMaxPosition still ratchets (still attaches a non-zero FeeRate). Add coverage so a future edit can't silently fold ErrMaxPosition back into the no-ratchet set.

Comment thread sweep/fee_bumper.go
@@ -1114,20 +1114,13 @@ func (t *TxPublisher) handleInitialTxError(r *monitorRecord, err error) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor F4: ErrMaxPosition comment conflates position exhaustion with budget exhaustion

The rewritten comment on case errors.Is(err, ErrMaxPosition) opens "When the error is due to budget being used up" — wording carried over from when this case shared a block with ErrNotEnoughBudget. Now that the budget errors live in a separate case below, this conflates the fee-function position error with a budget error and will mislead the next reader. Reword to describe position exhaustion of the linear fee function.

Comment thread sweep/fee_bumper.go
// at zero signals the sweeper to preserve the input's existing
// starting fee rate; otherwise repeated resource failures would
// ratchet the rate past the input's intrinsic budget and strand it.
if errors.Is(err, ErrNotEnoughBudget) ||

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor F5: Resource-failure classification duplicated across two handlers with divergent shapes

errors.Is(err, ErrNotEnoughInputs) || errors.Is(err, ErrNotEnoughBudget), plus a near-identical multi-line justification, now appears in both handleInitialTxError (as a switch case) and handleReplacementTxError (as an if/early-return). This is the second reorganization of this classification in ~4 months (blame: PR #9447 2025-02-12, PR #9627 2025-03-20), so the contract is still churning — and the two sites must stay in lockstep. Extract a single predicate (e.g. isResourceFailure(err error) bool) and reference it from both, so a future third resource error can't be added to one handler and missed in the other.

Comment thread sweep/fee_bumper.go
errors.Is(err, ErrNotEnoughInputs) {

// Calculate the next fee rate for the retry.
log.Warnf("Fail to fee bump tx %v: %v", oldTx.TxHash(), err)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor F6: Relocated log line keeps grammar error 'Fail to fee bump'

The relocated warning logs "Fail to fee bump tx %v: %v", inconsistent with the sibling log.Errorf("Failed to bump tx %v: %v", ...) a few lines down (also already flagged in review). Since the surrounding block is being rewritten anyway, fix it in this pass.

Suggested change
log.Warnf("Fail to fee bump tx %v: %v", oldTx.TxHash(), err)
log.Warnf("Failed to fee bump tx %v: %v", oldTx.TxHash(), err)

@lightninglabs-gateway

Copy link
Copy Markdown

🤖 gateway audit metadata for this PR — auto-generated, please don't edit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway-active severity-critical Requires expert review - security/consensus critical

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[flake]: suspected custom_channels_htlc_force_close_MPP flake [bug]: After Taproot Assets channel force-closure, funds are not swept.

3 participants