Skip to content

maintenance: split rescue + promote into a separate tokio task (v0.7) #303

Description

@hardbyte

Status

Conditional follow-up — open for visibility, not a foregone conclusion.

Context

The 0.6 release decision for #242 (recorded at #242 comment 4597482450) deliberately deferred the architectural split of the maintenance leader's tokio::select! arm. Instead, 0.6 ships the observability that lets us answer "is starvation actually happening in real fleets?":

  • awa.maintenance.branch.duration histogram (per-branch wall-clock body time)
  • awa.maintenance.branch.overrun counter (Prometheus: awa_maintenance_branch_overrun_total{branch=...})
  • tracing::warn! on the on-time → delayed transition

Landed in PR #302.

When this v0.7 issue activates

Only if real fleet telemetry shows non-trivial awa_maintenance_branch_overrun_total rates on the user-visible branches (promote_scheduled, rescue_stale_heartbeats, rescue_expired_deadlines, rescue_expired_callbacks). If the counter stays at zero across all production deployments through the 0.6 cycle, this issue closes as wontfix.

What the split would look like (sketch)

  • Move the rescue trio + promote_scheduled (the user-visible / SLO-relevant branches) into their own tokio::spawn-ed task with its own advisory-lock-aware select loop.
  • Hygiene branches (refresh_admin_metadata, cleanup, cron_sync, queue_stats, recompute_dirty_admin_metadata, priority_aging, rotate_*) stay in the existing leader loop.
  • Both tasks share the leader advisory lock (already session-scoped on leader_conn); the split is purely about tokio scheduling, not about distributed coordination.
  • TLA+ doesn't need re-modeling — every branch is still its own SQL transaction.

Related

Acceptance (once activated)

  • Real fleet evidence of awa_maintenance_branch_overrun_total rate ≥ X on at least one user-visible branch (X TBD when the data arrives).
  • A design ADR (probably ADR-030 or later) describing the split, the advisory-lock interaction, and a roll-forward path.
  • A long-horizon benchmark validating that the split reduces overrun events without introducing new races.

Metadata

Metadata

Assignees

No one assigned

    Labels

    operationalOperational tooling and configuration

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions