Status
Conditional follow-up — open for visibility, not a foregone conclusion.
Context
The 0.6 release decision for #242 (recorded at #242 comment 4597482450) deliberately deferred the architectural split of the maintenance leader's tokio::select! arm. Instead, 0.6 ships the observability that lets us answer "is starvation actually happening in real fleets?":
awa.maintenance.branch.duration histogram (per-branch wall-clock body time)
awa.maintenance.branch.overrun counter (Prometheus: awa_maintenance_branch_overrun_total{branch=...})
tracing::warn! on the on-time → delayed transition
Landed in PR #302.
When this v0.7 issue activates
Only if real fleet telemetry shows non-trivial awa_maintenance_branch_overrun_total rates on the user-visible branches (promote_scheduled, rescue_stale_heartbeats, rescue_expired_deadlines, rescue_expired_callbacks). If the counter stays at zero across all production deployments through the 0.6 cycle, this issue closes as wontfix.
What the split would look like (sketch)
- Move the rescue trio +
promote_scheduled (the user-visible / SLO-relevant branches) into their own tokio::spawn-ed task with its own advisory-lock-aware select loop.
- Hygiene branches (
refresh_admin_metadata, cleanup, cron_sync, queue_stats, recompute_dirty_admin_metadata, priority_aging, rotate_*) stay in the existing leader loop.
- Both tasks share the leader advisory lock (already session-scoped on
leader_conn); the split is purely about tokio scheduling, not about distributed coordination.
- TLA+ doesn't need re-modeling — every branch is still its own SQL transaction.
Related
Acceptance (once activated)
- Real fleet evidence of
awa_maintenance_branch_overrun_total rate ≥ X on at least one user-visible branch (X TBD when the data arrives).
- A design ADR (probably ADR-030 or later) describing the split, the advisory-lock interaction, and a roll-forward path.
- A long-horizon benchmark validating that the split reduces overrun events without introducing new races.
Status
Conditional follow-up — open for visibility, not a foregone conclusion.
Context
The 0.6 release decision for #242 (recorded at #242 comment 4597482450) deliberately deferred the architectural split of the maintenance leader's
tokio::select!arm. Instead, 0.6 ships the observability that lets us answer "is starvation actually happening in real fleets?":awa.maintenance.branch.durationhistogram (per-branch wall-clock body time)awa.maintenance.branch.overruncounter (Prometheus:awa_maintenance_branch_overrun_total{branch=...})tracing::warn!on the on-time → delayed transitionLanded in PR #302.
When this v0.7 issue activates
Only if real fleet telemetry shows non-trivial
awa_maintenance_branch_overrun_totalrates on the user-visible branches (promote_scheduled,rescue_stale_heartbeats,rescue_expired_deadlines,rescue_expired_callbacks). If the counter stays at zero across all production deployments through the 0.6 cycle, this issue closes as wontfix.What the split would look like (sketch)
promote_scheduled(the user-visible / SLO-relevant branches) into their owntokio::spawn-ed task with its own advisory-lock-aware select loop.refresh_admin_metadata,cleanup,cron_sync,queue_stats,recompute_dirty_admin_metadata,priority_aging,rotate_*) stay in the existing leader loop.leader_conn); the split is purely about tokio scheduling, not about distributed coordination.Related
maintenance-only-runtime-role. Distinct concern: ADR-028 is a separate process dedicated to maintenance, vs this issue which is intra-process task split. Both could ship; neither implies the other.Acceptance (once activated)
awa_maintenance_branch_overrun_totalrate ≥ X on at least one user-visible branch (X TBD when the data arrives).