Proposal: RPC Refactor + Retry Queue Worker (Rescanner Replacement)

## 1) Summary
Refactor RPC infrastructure into a cleaner package layout and replace `rescanner` with a `retry` worker that reuses the same queue engine as `manual`, backed by an additional dedicated retry queue.

## 2) Why We Need This
Current pain points in `main`:

1. RPC construction and provider wiring are duplicated across chain builder functions in `internal/worker/factory.go`.
2. Failover, transport, auth, and chain-specific logic are too tightly coupled in `internal/rpc`.
3. Rescanner depends on a shared `failedChan` fan-in/fan-out model.
4. `failedChan` is non-blocking and drops when full (`failedChan full, dropping block event`), which can delay retries.
5. `FailedBlockEvent` includes `Chain`, but rescanner listener currently consumes from shared channel without chain filtering, creating cross-chain contamination risk.

## 3) Goals
1. Standardize RPC composition (`transport` + `failover` + chain manager).
2. Remove channel-driven rescanner path and replace with deterministic queue-driven retry.
3. Reuse one queue worker implementation for `manual` and `retry`.
4. Preserve operational behavior for regular/catchup/manual/mempool workers.
5. Provide safe migration from legacy failed-block storage.

## 4) Non-Goals
1. No change to business event payload schema.
2. No change to chain parser semantics.
3. No hard cutover requiring immediate config rewrite by operators.

## 5) Proposed Architecture

### 5.1 RPC Package Refactor
Target package layout:

1. `pkg/rpc/failover`
2. `pkg/rpc/transport/httpx`
3. `pkg/rpc/transport/jsonrpc`
4. `pkg/rpc/{evm,tron,bitcoin,solana,sui,cosmos,aptos,ton}`
5. `pkg/rpc/bootstrap`

Key design:

1. Each chain package exposes `NewProviderManager(chainName, chainCfg)`.
2. `bootstrap` has generic builders:
   - `BuildHTTPFailover(...)`
   - `BuildGRPCFailover(...)`
3. `ProviderManager[T]` owns provider selection, retries, blacklisting, and metrics.
4. Transport concerns (`auth`, request/response handling, JSON-RPC batching) are isolated from failover policy.

Notes:

1. Branch `refactor-rpc` currently uses `internal/rpc/*` with this architecture already applied.
2. We can land behavior refactor first under `internal/rpc`, then move to `pkg/rpc` in a follow-up rename if we want a lower-risk rollout.

### 5.2 Worker Refactor: Rescanner -> Retry Queue
Replace rescanner with a queue worker model:

1. Keep one generic queue worker runtime:
   - `internal/worker/queue/worker.go`
2. `ManualWorker` and `RetryWorker` both wrap queue worker core.
3. Two queue stores:
   - Manual queue: `missing_blocks:*`
   - Retry queue: `retry_blocks:*`
4. Retry queue uses small range granularity (`MaxBlocksPerRange = 1`) for per-block retries.

Block failure flow:

1. Regular/Catchup/Manual processing hits block error.
2. Runtime processor enqueues `result.Number` to retry queue (`AddRange(network, n, n)`).
3. Retry worker consumes queue and reprocesses.
4. On success, queue progress is updated and range removed.

Result:

1. Remove `failedChan` and `FailedBlockEvent` from worker runtime path.
2. Eliminate shared channel race/cross-chain misrouting.
3. Retry behavior becomes observable and deterministic through Redis queue state.

## 6) Data Migration and Compatibility

### 6.1 Legacy Failed Block Migration
At manager bootstrap (per chain):

1. Read legacy failed blocks from `blockStore.GetFailedBlocks(internalCode)`.
2. Enqueue each block into retry queue.
3. Remove migrated entries from legacy failed block store.
4. Log migrated count.

### 6.2 Config Compatibility
Short-term compatibility policy:

1. Keep parsing `services.worker.rescanner.enabled`.
2. Mark it deprecated and ignore at runtime.
3. Add `services.worker.retry.enabled` (recommended) or keep retry always-on (if queue idle cost is acceptable).

Recommended:

1. Introduce explicit `retry` config flag with default `true`.
2. Keep `rescanner` key accepted for at least 2 release cycles with warning logs.

### 6.3 CLI / UX
1. Replace user-facing wording from `rescanner` to `automatic retry worker`.
2. No extra flag required in phase 1 if retry is default-enabled.

## 7) Rollout Plan

### Phase 0: Preparation
1. Add queue store abstraction for retry (`pkg/store/blockrangestore`).
2. Add metrics names for retry queue depth/throughput.

### Phase 1: RPC Layer Refactor (No Behavior Change)
1. Introduce provider-manager builders per chain.
2. Keep existing indexer behavior untouched.
3. Add unit tests for failover manager and transport helpers.

### Phase 2: Queue Runtime Unification
1. Introduce runtime core + queue worker.
2. Switch manual worker to queue worker implementation.
3. Add retry worker using same queue worker engine.

### Phase 3: Rescanner Deprecation
1. Remove `failedChan` writes and listeners from worker flow.
2. Migrate legacy failed blocks at startup.
3. Keep config backward compatibility warning for `rescanner`.

### Phase 4: Cleanup
1. Remove dead rescanner code paths.
2. Finalize docs and runbook updates.
3. Optionally rename `internal/rpc` to `pkg/rpc` if not done yet.

## 8) Testing Strategy
1. Unit tests:
   - Queue add/merge/claim/remove semantics.
   - Retry enqueue on block error in processor.
   - Retry worker range progress + removal behavior.
   - Failover provider switching and blacklist recovery.
2. Integration tests:
   - Multi-chain run with injected RPC failures to verify no cross-chain retry pollution.
   - Migration test from legacy failed blocks into retry queue.
3. Regression tests:
   - Event emission path unchanged for successful blocks.
   - Catchup/manual behavior unchanged.

## 9) Observability
Add or expose:

1. Retry queue depth per chain.
2. Retry processed/success/failure counts.
3. Retry enqueue rate by worker mode source.
4. Failover metrics snapshot per chain/provider.
5. Warning count for deprecated rescanner config usage.

## 10) Risks and Mitigations
1. Risk: Duplicate retries if legacy store and retry queue are both active.
   - Mitigation: one-time migration then delete legacy entries.
2. Risk: Redis queue growth under persistent RPC outage.
   - Mitigation: queue depth alerting + provider failover tuning + retry backoff.
3. Risk: Behavior drift during package move (`internal/rpc` -> `pkg/rpc`).
   - Mitigation: split into separate PRs (behavior first, path rename second).

## 11) Acceptance Criteria
1. No worker path writes to or depends on `failedChan`.
2. Failed blocks are retried only through retry queue.
3. Cross-chain retry contamination is impossible by design.
4. Manual and retry workers share one queue runtime implementation.
5. RPC builders are standardized by chain manager and failover abstraction.
6. Existing production config still boots with deprecation warnings only.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: RPC Refactor + Retry Queue Worker (Rescanner Replacement) #69

1) Summary

2) Why We Need This

3) Goals

4) Non-Goals

5) Proposed Architecture

5.1 RPC Package Refactor

5.2 Worker Refactor: Rescanner -> Retry Queue

6) Data Migration and Compatibility

6.1 Legacy Failed Block Migration

6.2 Config Compatibility

6.3 CLI / UX

7) Rollout Plan

Phase 0: Preparation

Phase 1: RPC Layer Refactor (No Behavior Change)

Phase 2: Queue Runtime Unification

Phase 3: Rescanner Deprecation

Phase 4: Cleanup

8) Testing Strategy

9) Observability

10) Risks and Mitigations

11) Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Proposal: RPC Refactor + Retry Queue Worker (Rescanner Replacement) #69

Description

1) Summary

2) Why We Need This

3) Goals

4) Non-Goals

5) Proposed Architecture

5.1 RPC Package Refactor

5.2 Worker Refactor: Rescanner -> Retry Queue

6) Data Migration and Compatibility

6.1 Legacy Failed Block Migration

6.2 Config Compatibility

6.3 CLI / UX

7) Rollout Plan

Phase 0: Preparation

Phase 1: RPC Layer Refactor (No Behavior Change)

Phase 2: Queue Runtime Unification

Phase 3: Rescanner Deprecation

Phase 4: Cleanup

8) Testing Strategy

9) Observability

10) Risks and Mitigations

11) Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions