1) Summary
Refactor RPC infrastructure into a cleaner package layout and replace rescanner with a retry worker that reuses the same queue engine as manual, backed by an additional dedicated retry queue.
2) Why We Need This
Current pain points in main:
- RPC construction and provider wiring are duplicated across chain builder functions in
internal/worker/factory.go.
- Failover, transport, auth, and chain-specific logic are too tightly coupled in
internal/rpc.
- Rescanner depends on a shared
failedChan fan-in/fan-out model.
failedChan is non-blocking and drops when full (failedChan full, dropping block event), which can delay retries.
FailedBlockEvent includes Chain, but rescanner listener currently consumes from shared channel without chain filtering, creating cross-chain contamination risk.
3) Goals
- Standardize RPC composition (
transport + failover + chain manager).
- Remove channel-driven rescanner path and replace with deterministic queue-driven retry.
- Reuse one queue worker implementation for
manual and retry.
- Preserve operational behavior for regular/catchup/manual/mempool workers.
- Provide safe migration from legacy failed-block storage.
4) Non-Goals
- No change to business event payload schema.
- No change to chain parser semantics.
- No hard cutover requiring immediate config rewrite by operators.
5) Proposed Architecture
5.1 RPC Package Refactor
Target package layout:
pkg/rpc/failover
pkg/rpc/transport/httpx
pkg/rpc/transport/jsonrpc
pkg/rpc/{evm,tron,bitcoin,solana,sui,cosmos,aptos,ton}
pkg/rpc/bootstrap
Key design:
- Each chain package exposes
NewProviderManager(chainName, chainCfg).
bootstrap has generic builders:
BuildHTTPFailover(...)
BuildGRPCFailover(...)
ProviderManager[T] owns provider selection, retries, blacklisting, and metrics.
- Transport concerns (
auth, request/response handling, JSON-RPC batching) are isolated from failover policy.
Notes:
- Branch
refactor-rpc currently uses internal/rpc/* with this architecture already applied.
- We can land behavior refactor first under
internal/rpc, then move to pkg/rpc in a follow-up rename if we want a lower-risk rollout.
5.2 Worker Refactor: Rescanner -> Retry Queue
Replace rescanner with a queue worker model:
- Keep one generic queue worker runtime:
internal/worker/queue/worker.go
ManualWorker and RetryWorker both wrap queue worker core.
- Two queue stores:
- Manual queue:
missing_blocks:*
- Retry queue:
retry_blocks:*
- Retry queue uses small range granularity (
MaxBlocksPerRange = 1) for per-block retries.
Block failure flow:
- Regular/Catchup/Manual processing hits block error.
- Runtime processor enqueues
result.Number to retry queue (AddRange(network, n, n)).
- Retry worker consumes queue and reprocesses.
- On success, queue progress is updated and range removed.
Result:
- Remove
failedChan and FailedBlockEvent from worker runtime path.
- Eliminate shared channel race/cross-chain misrouting.
- Retry behavior becomes observable and deterministic through Redis queue state.
6) Data Migration and Compatibility
6.1 Legacy Failed Block Migration
At manager bootstrap (per chain):
- Read legacy failed blocks from
blockStore.GetFailedBlocks(internalCode).
- Enqueue each block into retry queue.
- Remove migrated entries from legacy failed block store.
- Log migrated count.
6.2 Config Compatibility
Short-term compatibility policy:
- Keep parsing
services.worker.rescanner.enabled.
- Mark it deprecated and ignore at runtime.
- Add
services.worker.retry.enabled (recommended) or keep retry always-on (if queue idle cost is acceptable).
Recommended:
- Introduce explicit
retry config flag with default true.
- Keep
rescanner key accepted for at least 2 release cycles with warning logs.
6.3 CLI / UX
- Replace user-facing wording from
rescanner to automatic retry worker.
- No extra flag required in phase 1 if retry is default-enabled.
7) Rollout Plan
Phase 0: Preparation
- Add queue store abstraction for retry (
pkg/store/blockrangestore).
- Add metrics names for retry queue depth/throughput.
Phase 1: RPC Layer Refactor (No Behavior Change)
- Introduce provider-manager builders per chain.
- Keep existing indexer behavior untouched.
- Add unit tests for failover manager and transport helpers.
Phase 2: Queue Runtime Unification
- Introduce runtime core + queue worker.
- Switch manual worker to queue worker implementation.
- Add retry worker using same queue worker engine.
Phase 3: Rescanner Deprecation
- Remove
failedChan writes and listeners from worker flow.
- Migrate legacy failed blocks at startup.
- Keep config backward compatibility warning for
rescanner.
Phase 4: Cleanup
- Remove dead rescanner code paths.
- Finalize docs and runbook updates.
- Optionally rename
internal/rpc to pkg/rpc if not done yet.
8) Testing Strategy
- Unit tests:
- Queue add/merge/claim/remove semantics.
- Retry enqueue on block error in processor.
- Retry worker range progress + removal behavior.
- Failover provider switching and blacklist recovery.
- Integration tests:
- Multi-chain run with injected RPC failures to verify no cross-chain retry pollution.
- Migration test from legacy failed blocks into retry queue.
- Regression tests:
- Event emission path unchanged for successful blocks.
- Catchup/manual behavior unchanged.
9) Observability
Add or expose:
- Retry queue depth per chain.
- Retry processed/success/failure counts.
- Retry enqueue rate by worker mode source.
- Failover metrics snapshot per chain/provider.
- Warning count for deprecated rescanner config usage.
10) Risks and Mitigations
- Risk: Duplicate retries if legacy store and retry queue are both active.
- Mitigation: one-time migration then delete legacy entries.
- Risk: Redis queue growth under persistent RPC outage.
- Mitigation: queue depth alerting + provider failover tuning + retry backoff.
- Risk: Behavior drift during package move (
internal/rpc -> pkg/rpc).
- Mitigation: split into separate PRs (behavior first, path rename second).
11) Acceptance Criteria
- No worker path writes to or depends on
failedChan.
- Failed blocks are retried only through retry queue.
- Cross-chain retry contamination is impossible by design.
- Manual and retry workers share one queue runtime implementation.
- RPC builders are standardized by chain manager and failover abstraction.
- Existing production config still boots with deprecation warnings only.
1) Summary
Refactor RPC infrastructure into a cleaner package layout and replace
rescannerwith aretryworker that reuses the same queue engine asmanual, backed by an additional dedicated retry queue.2) Why We Need This
Current pain points in
main:internal/worker/factory.go.internal/rpc.failedChanfan-in/fan-out model.failedChanis non-blocking and drops when full (failedChan full, dropping block event), which can delay retries.FailedBlockEventincludesChain, but rescanner listener currently consumes from shared channel without chain filtering, creating cross-chain contamination risk.3) Goals
transport+failover+ chain manager).manualandretry.4) Non-Goals
5) Proposed Architecture
5.1 RPC Package Refactor
Target package layout:
pkg/rpc/failoverpkg/rpc/transport/httpxpkg/rpc/transport/jsonrpcpkg/rpc/{evm,tron,bitcoin,solana,sui,cosmos,aptos,ton}pkg/rpc/bootstrapKey design:
NewProviderManager(chainName, chainCfg).bootstraphas generic builders:BuildHTTPFailover(...)BuildGRPCFailover(...)ProviderManager[T]owns provider selection, retries, blacklisting, and metrics.auth, request/response handling, JSON-RPC batching) are isolated from failover policy.Notes:
refactor-rpccurrently usesinternal/rpc/*with this architecture already applied.internal/rpc, then move topkg/rpcin a follow-up rename if we want a lower-risk rollout.5.2 Worker Refactor: Rescanner -> Retry Queue
Replace rescanner with a queue worker model:
internal/worker/queue/worker.goManualWorkerandRetryWorkerboth wrap queue worker core.missing_blocks:*retry_blocks:*MaxBlocksPerRange = 1) for per-block retries.Block failure flow:
result.Numberto retry queue (AddRange(network, n, n)).Result:
failedChanandFailedBlockEventfrom worker runtime path.6) Data Migration and Compatibility
6.1 Legacy Failed Block Migration
At manager bootstrap (per chain):
blockStore.GetFailedBlocks(internalCode).6.2 Config Compatibility
Short-term compatibility policy:
services.worker.rescanner.enabled.services.worker.retry.enabled(recommended) or keep retry always-on (if queue idle cost is acceptable).Recommended:
retryconfig flag with defaulttrue.rescannerkey accepted for at least 2 release cycles with warning logs.6.3 CLI / UX
rescannertoautomatic retry worker.7) Rollout Plan
Phase 0: Preparation
pkg/store/blockrangestore).Phase 1: RPC Layer Refactor (No Behavior Change)
Phase 2: Queue Runtime Unification
Phase 3: Rescanner Deprecation
failedChanwrites and listeners from worker flow.rescanner.Phase 4: Cleanup
internal/rpctopkg/rpcif not done yet.8) Testing Strategy
9) Observability
Add or expose:
10) Risks and Mitigations
internal/rpc->pkg/rpc).11) Acceptance Criteria
failedChan.