maintenance: split rescue + promote into a separate tokio task (v0.7)

## Status

**Conditional follow-up — open for visibility, not a foregone conclusion.**

## Context

The 0.6 release decision for #242 (recorded at [#242 comment 4597482450](https://github.com/hardbyte/awa/issues/242#issuecomment-4597482450)) deliberately deferred the architectural split of the maintenance leader's `tokio::select!` arm. Instead, 0.6 ships the observability that lets us answer "is starvation actually happening in real fleets?":

- `awa.maintenance.branch.duration` histogram (per-branch wall-clock body time)
- `awa.maintenance.branch.overrun` counter (Prometheus: `awa_maintenance_branch_overrun_total{branch=...}`)
- `tracing::warn!` on the on-time → delayed transition

Landed in PR #302.

## When this v0.7 issue activates

Only if real fleet telemetry shows non-trivial `awa_maintenance_branch_overrun_total` rates on the user-visible branches (`promote_scheduled`, `rescue_stale_heartbeats`, `rescue_expired_deadlines`, `rescue_expired_callbacks`). If the counter stays at zero across all production deployments through the 0.6 cycle, this issue closes as wontfix.

## What the split would look like (sketch)

- Move the rescue trio + `promote_scheduled` (the user-visible / SLO-relevant branches) into their own `tokio::spawn`-ed task with its own advisory-lock-aware select loop.
- Hygiene branches (`refresh_admin_metadata`, `cleanup`, `cron_sync`, `queue_stats`, `recompute_dirty_admin_metadata`, `priority_aging`, `rotate_*`) stay in the existing leader loop.
- Both tasks share the leader advisory lock (already session-scoped on `leader_conn`); the split is purely about tokio scheduling, not about distributed coordination.
- TLA+ doesn't need re-modeling — every branch is still its own SQL transaction.

## Related

- #242 — the umbrella decision and observability work (closed via #302)
- ADR-028 — `maintenance-only-runtime-role`. Distinct concern: ADR-028 is a separate *process* dedicated to maintenance, vs this issue which is intra-process *task* split. Both could ship; neither implies the other.

## Acceptance (once activated)

- Real fleet evidence of `awa_maintenance_branch_overrun_total` rate ≥ X on at least one user-visible branch (X TBD when the data arrives).
- A design ADR (probably ADR-030 or later) describing the split, the advisory-lock interaction, and a roll-forward path.
- A long-horizon benchmark validating that the split reduces overrun events without introducing new races.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maintenance: split rescue + promote into a separate tokio task (v0.7) #303

Status

Context

When this v0.7 issue activates

What the split would look like (sketch)

Related

Acceptance (once activated)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

maintenance: split rescue + promote into a separate tokio task (v0.7) #303

Description

Status

Context

When this v0.7 issue activates

What the split would look like (sketch)

Related

Acceptance (once activated)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions