-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Rough, advisory sketch of how to take a stab at the DB vs On-Chain state inconsistency problem, written by @rvagg:
Curio PDP Doctor Command Design
A diagnostic tool to check consistency between local DB state and on-chain state across PDPVerifier, FWSS, and FilecoinPay contracts.
Project Structure
Initial approach: Start as a standalone external project/tool. This allows rapid iteration without curio release cycles and keeps the diagnostic tooling decoupled from production code.
Future integration: May eventually be integrated into the curio CLI as a subcommand (e.g., curio pdp doctor) depending on quality and maintainer preference. Design with this in mind - use curio's existing contract bindings and database utilities where possible.
Dependencies (for standalone version):
github.com/filecoin-project/curio/pdp/contract- Contract bindings (PDPVerifier, FWSS, FilecoinPay)github.com/filecoin-project/curio/harmony/harmonydb- Database connection utilitiesgithub.com/ethereum/go-ethereum- Ethereum client- PostgreSQL driver for Yugabyte/Postgres connection
Context: Where This Fits
Repository: Initially standalone; eventual integration target is github.com/filecoin-project/curio (branch: pdpv0)
Related Files:
tasks/pdp/- All PDP task implementationstasks/pay/- Settlement task (settle_task.go), watcherpdp/- HTTP API handlerspdp/contract/- Generated contract bindingspdp/contract/addresses.go- Contract addresses per networklib/chainsched/- Chain event systemlib/filecoinpayment/- Payment utilities, Multicall3 batchingharmony/harmonydb/- Database utilities
Contract Addresses (from pdp/contract/addresses.go):
| Contract | Calibnet | Mainnet |
|---|---|---|
| PDPVerifier | 0x85e366Cf9DD2c0aE37E963d9556F5f4718d6417C |
0xBADd0B92C1c71d02E7d520f64c0876538fa2557F |
| FWSS | 0x02925630df557F957f70E112bA06e50965417CA0 |
0x8408502033C418E1bbC97cE9ac48E5528F371A9f |
| ServiceRegistry | 0x839e5c9988e4e9977d40708d0094103c0839Ac9D |
0xf55dDbf63F1b55c3F1D4FA7e339a68AB7b64A5eB |
| USDFC | 0xb3042734b608a1B16e9e86B374A3f3e389B4cDf0 |
0x80B98d3aa09ffff255c3ba4A241111Ff1262F045 |
For devnet: addresses come from env vars (CURIO_DEVNET_PDP_VERIFIER_ADDRESS, etc.) or ~/.foc-localnet/state/latest/contract_addresses.json
FilecoinPay address: Discovered from FWSS via paymentsContractAddress() method
Data Sources
Local Database Tables
pdp_data_sets (Main data set tracking)
-- Schema (after all migrations)
CREATE TABLE pdp_data_sets (
id BIGINT PRIMARY KEY, -- On-chain data set ID from PDPVerifier
prev_challenge_request_epoch BIGINT, -- Last challenge request
challenge_request_task_id BIGINT, -- FK to harmony_task
challenge_request_msg_hash TEXT, -- Pending nextProvingPeriod tx
proving_period BIGINT, -- Epochs between proofs (e.g., 2880 = 1 day)
challenge_window BIGINT, -- Epochs to submit proof
prove_at_epoch BIGINT, -- Next challenge window start
init_ready BOOLEAN DEFAULT FALSE, -- First piece added, can init proving
create_message_hash TEXT NOT NULL, -- Data set creation tx
service TEXT NOT NULL, -- FK to pdp_services.service_label
-- Added by 20260110-pdp-termination-handling.sql:
terminated_at_epoch BIGINT, -- Block when termination detected (NULL = active)
consecutive_prove_failures INT DEFAULT 0, -- Backoff counter
next_prove_attempt_at BIGINT -- Backoff delay (don't prove before this epoch)
);
-- Key query: Active data sets for proving
SELECT * FROM pdp_data_sets
WHERE terminated_at_epoch IS NULL
AND init_ready = TRUE
AND (next_prove_attempt_at IS NULL OR next_prove_attempt_at <= $current_epoch);pdp_data_set_pieces (Pieces in data sets)
CREATE TABLE pdp_data_set_pieces (
data_set BIGINT NOT NULL, -- FK to pdp_data_sets.id
piece TEXT NOT NULL, -- PieceCID v2 string
add_message_hash TEXT NOT NULL, -- TX that added this piece
add_message_index BIGINT NOT NULL, -- Index in batch add
piece_id BIGINT NOT NULL, -- On-chain piece ID within data set
sub_piece TEXT NOT NULL, -- Sub-piece CID (same as piece if no aggregation)
sub_piece_offset BIGINT NOT NULL, -- Byte offset in aggregated piece
sub_piece_size BIGINT NOT NULL, -- Padded size in bytes
pdp_pieceref BIGINT NOT NULL, -- FK to pdp_piecerefs.id
rm_message_hash TEXT DEFAULT NULL, -- Removal tx hash (if scheduled)
removed BOOLEAN DEFAULT FALSE, -- Scheduled for removal
PRIMARY KEY (data_set, piece_id, sub_piece_offset)
);
-- Key query: Pieces in a data set
SELECT * FROM pdp_data_set_pieces WHERE data_set = $data_set_id AND removed = FALSE;
-- Key query: Pieces pending removal
SELECT * FROM pdp_data_set_pieces WHERE removed = TRUE AND rm_message_hash IS NOT NULL;pdp_piecerefs (Local piece storage references)
CREATE TABLE pdp_piecerefs (
id BIGSERIAL PRIMARY KEY,
service TEXT NOT NULL, -- FK to pdp_services
piece_cid TEXT NOT NULL, -- PieceCID v2
piece_ref BIGINT NOT NULL, -- FK to parked_piece_refs.ref_id (actual storage)
created_at TIMESTAMP WITH TIME ZONE,
data_set_refcount BIGINT DEFAULT 0, -- Maintained by triggers (how many data sets use this)
-- Added by 20251004-pdp-indexing.sql:
indexing_task_id BIGINT DEFAULT NULL, -- Current indexing task
needs_indexing BOOLEAN DEFAULT FALSE, -- Queued for indexing
ipni_task_id BIGINT DEFAULT NULL, -- Current IPNI announcement task
needs_ipni BOOLEAN DEFAULT FALSE -- Queued for IPNI
);
-- Key query: Orphaned pieces (refcount 0, can be cleaned)
SELECT * FROM pdp_piecerefs WHERE data_set_refcount = 0;
-- Key query: Pieces needing indexing
SELECT * FROM pdp_piecerefs WHERE needs_indexing = TRUE AND indexing_task_id IS NULL;pdp_delete_data_set (Cleanup workflow state)
CREATE TABLE pdp_delete_data_set (
id BIGINT PRIMARY KEY, -- Data set ID being deleted
terminate_service_task_id BIGINT, -- FK to harmony_task for TerminateFWSSTask
after_terminate_service BOOLEAN DEFAULT FALSE, -- FWSS.terminateService() completed
terminate_tx_hash TEXT, -- Termination tx
service_termination_epoch BIGINT, -- When termination happened (pdpEndEpoch)
delete_data_set_task_id BIGINT NOT NULL, -- FK to harmony_task for DeleteDataSetTask
after_delete_data_set BOOLEAN DEFAULT FALSE, -- PDPVerifier.deleteDataSet() completed
delete_tx_hash TEXT, -- Deletion tx
terminated BOOLEAN DEFAULT FALSE -- Full cleanup done (cascade delete triggered)
);
-- Lifecycle states:
-- 1. Entry created: after_terminate_service=FALSE, after_delete_data_set=FALSE
-- 2. FWSS terminated: after_terminate_service=TRUE (or skipped if pdpEndEpoch != 0)
-- 3. PDPVerifier deleted: after_delete_data_set=TRUE, delete_tx_hash set
-- 4. Cascade complete: terminated=TRUE (then row can be deleted)
-- Key query: Stuck in termination phase
SELECT * FROM pdp_delete_data_set
WHERE after_terminate_service = FALSE
AND terminate_service_task_id IS NULL;
-- Key query: Stuck in deletion phase
SELECT * FROM pdp_delete_data_set
WHERE after_terminate_service = TRUE
AND after_delete_data_set = FALSE
AND delete_data_set_task_id IS NULL;message_waits_eth (Transaction tracking)
-- Tracks all ETH transactions sent by Curio
CREATE TABLE message_waits_eth (
signed_tx_hash TEXT PRIMARY KEY,
tx_status TEXT, -- 'pending', 'confirmed', 'failed'
tx_success BOOLEAN, -- Did execution succeed
-- ... other fields
);
-- Key query: Stale pending transactions
SELECT * FROM message_waits_eth
WHERE tx_status = 'pending'
AND created_at < NOW() - INTERVAL '1 hour';On-Chain: PDPVerifier
Contract binding: pdp/contract/PDPVerifier.go (generated from ABI)
// Instantiation
pdpAddr := contract.ContractAddresses().PDPVerifier
verifier, err := contract.NewPDPVerifier(pdpAddr, ethClient)
// Key read methods (all use &bind.CallOpts{Context: ctx})
verifier.DataSetLive(opts, big.NewInt(setId)) // bool - Is data set active
verifier.GetDataSetLeafCount(opts, big.NewInt(setId)) // *big.Int - Total 32-byte leaves
verifier.GetNextPieceId(opts, big.NewInt(setId)) // *big.Int - Next piece slot
verifier.GetNextChallengeEpoch(opts, big.NewInt(setId)) // *big.Int - When to prove next
verifier.GetDataSetListener(opts, big.NewInt(setId)) // common.Address - Listener (FWSS?)
verifier.GetDataSetStorageProvider(opts, big.NewInt(setId)) // (address, address) - SP, proposed
verifier.GetDataSetLastProvenEpoch(opts, big.NewInt(setId)) // *big.Int - Last proven epoch
verifier.GetPieceCid(opts, big.NewInt(setId), big.NewInt(pieceId)) // Cid struct
verifier.GetPieceLeafCount(opts, big.NewInt(setId), big.NewInt(pieceId)) // *big.Int - Leaves in piece
verifier.PieceLive(opts, big.NewInt(setId), big.NewInt(pieceId)) // bool - Is piece active
verifier.GetScheduledRemovals(opts, big.NewInt(setId)) // []*big.Int - Pending piece removalsLeaf count to bytes: Each leaf is 32 bytes, so leafCount * 32 = bytes.
On-Chain: FWSS (if listener is FWSS)
Contract binding: pdp/contract/FilecoinWarmStorageService.go (main) and FilecoinWarmStorageServiceStateView.go (read ops)
// Instantiation (use StateView for read operations)
fwssAddr := contract.ContractAddresses().AllowedPublicRecordKeepers.FWSService
fwssv, err := contract.NewFilecoinWarmStorageServiceStateView(fwssAddr, ethClient)
// Key read methods
fwssv.ProvenThisPeriod(opts, big.NewInt(dataSetId)) // bool - Proven current period
fwssv.ProvenPeriods(opts, big.NewInt(dataSetId), big.NewInt(periodId)) // bool - Proven specific
fwssv.ProvingActivationEpoch(opts, big.NewInt(dataSetId)) // *big.Int - When proving started
fwssv.ProvingDeadline(opts, big.NewInt(setId)) // *big.Int - Deadline for current period
fwssv.RailToDataSet(opts, big.NewInt(railId)) // *big.Int - Reverse lookup
// Get proving configuration (don't hardcode!)
config, _ := fwssv.GetPDPConfig(opts)
// Returns: maxProvingPeriod, challengeWindow, challengesPerProof
// Check if listener is FWSS
listener := verifier.GetDataSetListener(opts, big.NewInt(setId))
isFWSS := listener == fwssAddrDetecting FWSS: Compare listener address to known FWSS address from contract.ContractAddresses().
On-Chain: FilecoinPay
Contract binding: lib/filecoinpayment/Payments.go (generated from ABI)
// Instantiation
paymentAddr, err := filecoinpayment.PaymentContractAddress()
payment, err := filecoinpayment.NewPayments(paymentAddr, ethClient)
// Get payment contract address from FWSS (if needed)
// paymentsAddr := fwss.PaymentsContractAddress(opts)
// Key read methods
payment.GetRail(opts, big.NewInt(railId)) // RailView struct
// RailView struct fields:
type RailView struct {
Token common.Address // ERC20 token (USDFC)
From common.Address // Payer (client)
To common.Address // Payee (SP)
Rate *big.Int // Tokens per epoch
SettledUpTo *big.Int // Last settled epoch
EndEpoch *big.Int // 0 if active, >0 if terminated
LockupPeriod *big.Int // Lockup duration in epochs
LockupFixed *big.Int // Fixed lockup amount
LockupRate *big.Int // Rate-based lockup
Operator common.Address // Who controls the rail (FWSS)
Validator common.Address // Payment validator (FWSS or 0x0)
}
// Get rails for payee
tokenAddr, _ := contract.USDFCAddress()
rails, err := payment.GetRailsForPayeeAndToken(opts, payeeAddr, tokenAddr, big.NewInt(0), big.NewInt(0))
// Returns: RailInfo[] with {RailId, IsTerminated, EndEpoch}Rail states:
EndEpoch == 0: Active rail (streaming payments)EndEpoch > 0 && SettledUpTo < EndEpoch: Terminated but not fully settledEndEpoch > 0 && SettledUpTo >= EndEpoch: Terminated and fully settled (ready for cleanup)
Doctor Checks
1. Data Set Liveness
Local vs On-Chain Liveness:
For each local data set in pdp_data_sets:
- If dataSetLive(id) = false AND terminated_at_epoch IS NULL:
→ STALE: On-chain dead but local doesn't know
- If dataSetLive(id) = true AND terminated_at_epoch IS NOT NULL:
→ PREMATURE: Local thinks terminated but on-chain is live
- If dataSetLive(id) = false AND not in pdp_delete_data_set:
→ STUCK: Dead on-chain but cleanup not triggered
Orphaned On-Chain Data Sets:
Query PDPVerifier for all data sets owned by SP
Compare against pdp_data_sets
Any on-chain data set not in local DB → ORPHAN (may be from another node)
2. Piece Consistency
Piece-by-Piece Verification:
For each piece in pdp_data_set_pieces:
- pieceLive(setId, pieceId) should match (not removed)
- getPieceCid(setId, pieceId) should match local piece column
- getPieceLeafCount(setId, pieceId) should match sub_piece_size / 32
Scheduled Removals:
getScheduledRemovals(setId) from chain
Compare against local pieces with removed = TRUE
- On-chain has removal not in local → LOCAL_BEHIND
- Local has removal not on-chain → TX_PENDING or FAILED
Reference Count Integrity:
For each pdp_pieceref:
Actual count = SELECT COUNT(*) FROM pdp_data_set_pieces WHERE pdp_pieceref = ?
If data_set_refcount != actual count → REFCOUNT_MISMATCH
3. Proving State
Challenge Window Sync:
For each active data set:
on_chain_next = getNextChallengeEpoch(setId)
local_next = prove_at_epoch
If significantly different → CHALLENGE_DESYNC
Proving History (if FWSS):
last_proven = getDataSetLastProvenEpoch(setId)
current_deadline = provingDeadline(setId)
proven_this_period = provenThisPeriod(dataSetId)
If current_epoch > current_deadline AND !proven_this_period:
→ MISSED_DEADLINE (data set at risk)
If consecutive_prove_failures > 0 but proven_this_period:
→ STALE_FAILURE_COUNT (should have been reset)
Recent Proving History Check (FWSS stores proven periods as bitmap):
Implementation Note: The provenPeriods mapping is marked private in FilecoinWarmStorageService.sol, but IS accessible via FilecoinWarmStorageServiceStateView. The StateView uses extsload (external storage load) to read directly from the main contract's storage slots, bypassing Solidity's visibility restrictions. See FilecoinWarmStorageServiceStateLibrary.sol:161-173 for the implementation.
// Get config from FWSS (don't hardcode!)
config, _ := fwssv.GetPDPConfig(opts)
maxProvingPeriod := config.MaxProvingPeriod // e.g., 2880 mainnet, 480 calibnet
// Period ID calculation
activationEpoch := fwssv.ProvingActivationEpoch(opts, big.NewInt(dataSetId))
currentPeriod := (currentBlock - activationEpoch) / maxProvingPeriod
// Check last N periods (via StateView.provenPeriods)
for periodId := currentPeriod - 5; periodId < currentPeriod; periodId++ {
proven := fwssv.ProvenPeriods(opts, big.NewInt(dataSetId), big.NewInt(periodId))
if !proven {
→ MISSED_PERIOD (periodId)
}
}Period Deadline: activationEpoch + (periodId + 1) * maxProvingPeriod
4. Settlement Health (FWSS data sets)
Background: Lockup period is 30 days (DEFAULT_LOCKUP_PERIOD = 86400 epochs). Settlement task (tasks/pay/settle_task.go) runs every 12 hours (time.Hour*12), but only settles rails that need it.
Settlement Logic (from lib/filecoinpayment/utils.go:SettleLockupPeriod()):
// Active rails: settle if approaching lockup limit
threshold := settledUpTo + lockupPeriod
thresholdWithGrace := threshold - EpochsInDay // 1 day buffer
if thresholdWithGrace < currentBlock {
toSettle = append(toSettle, rail)
}
// Terminated rails: settle if not fully settled
if rail.endEpoch > 0 && settledUpTo < endEpoch {
toSettle = append(toSettle, rail)
}What This Means: Active rails get settled when ~29 days behind (lockup - 1day buffer). This is normal and by design - no point settling more frequently.
Rail Settlement Status (for doctor):
rail = getRail(railId)
lockup_period = 30 days = 86400 epochs
If rail.endEpoch == 0 (active rail):
settlement_lag = current_epoch - rail.settledUpTo
threshold = lockup_period - EpochsInDay # ~29 days
# Expected: lag can be up to ~29 days before settlement triggers
If settlement_lag > lockup_period → SETTLEMENT_CRITICAL (should have been settled)
If settlement_lag > lockup_period - EpochsInDay → SETTLEMENT_DUE (should settle soon)
Otherwise → OK (lag is normal)
If rail.endEpoch != 0 (terminated rail):
If rail.settledUpTo < rail.endEpoch → UNSETTLED_TERMINATED (needs settlement)
If rail.settledUpTo >= rail.endEpoch → READY_FOR_CLEANUP
Settlement Batching: Uses Multicall3 to batch up to 10 settleRail() calls per transaction for gas efficiency. If one rail in batch fails, entire batch reverts (the terminated dataset problem).
5. IPNI Indexing Consistency
How Metadata Works (from pdp/indexing.go and pdp/contract/utils.go):
// Check if data set has withIPFSIndexing metadata
// Uses CheckIfIndexingNeeded() from pdp/indexing.go
func CheckIfIndexingNeeded(ethClient *ethclient.Client, dataSetId uint64) (bool, error) {
pdpVerifier, _ := contract.NewPDPVerifier(contract.ContractAddresses().PDPVerifier, ethClient)
listenerAddr, _ := pdpVerifier.GetDataSetListener(nil, big.NewInt(dataSetId))
// GetDataSetMetadataAtKey handles FWSS StateView contract
mustIndex, _, _ := contract.GetDataSetMetadataAtKey(listenerAddr, ethClient, dataSetId, "withIPFSIndexing")
return mustIndex, nil
}Metadata Storage: Stored on-chain in listener contract (FWSS). Retrieved via getDataSetMetadata(dataSetId, key) on the StateView contract.
Metadata Key: "withIPFSIndexing" (from synapse-sdk/src/utils/constants.ts:METADATA_KEYS.WITH_IPFS_INDEXING)
Doctor Checks:
For each data set:
has_indexing_flag = CheckIfIndexingNeeded(ethClient, dataSetId)
For each piece in data set (pdp_data_set_pieces):
Get pdp_pieceref via pdp_pieceref FK
If has_indexing_flag AND needs_indexing = FALSE AND indexing_task_id IS NULL:
→ INDEXING_NOT_TRIGGERED
If needs_indexing = TRUE for > N hours:
→ INDEXING_STALLED
If needs_ipni = TRUE for > N hours:
→ IPNI_ANNOUNCEMENT_STALLED
Future Enhancement: Query IPNI infrastructure directly to verify content is actually discoverable. IPNI announcements are dual:
- PieceCIDs - the piece identifiers themselves
- IPLD blocks - content blocks extracted from CAR files within pieces (when
withIPFSIndexingis set)
Verification would query IPNI endpoints (e.g., cid.contact) to confirm both PieceCIDs and extracted block CIDs appear in the index. More complex but would catch cases where announcements succeed locally but fail to propagate.
6. Cleanup Flow Health
Stuck Deletions:
For each entry in pdp_delete_data_set:
If after_terminate_service = FALSE:
If terminate_service_task_id IS NULL:
→ TERMINATE_NOT_SCHEDULED
If task exists but old:
→ TERMINATE_TASK_STALLED
If after_terminate_service = TRUE AND after_delete_data_set = FALSE:
If delete_data_set_task_id IS NULL:
→ DELETE_NOT_SCHEDULED
If delete_tx_hash IS NOT NULL but after_delete_data_set = FALSE:
→ DELETE_TX_PENDING (check message_waits_eth)
If after_delete_data_set = TRUE AND terminated = FALSE:
→ CASCADE_DELETE_PENDING
If dataSetLive(id) = false AND terminated = FALSE:
elapsed = current_epoch - service_termination_epoch
If elapsed > EXPECTED_CLEANUP_TIME:
→ CLEANUP_STALLED
7. Transaction State
Pending Transactions:
SELECT * FROM message_waits_eth
WHERE tx_status = 'pending'
AND created_at < NOW() - INTERVAL '1 hour'
For each:
→ STALE_PENDING_TX (may need retry or investigation)
Failed Create/Add Messages:
SELECT * FROM pdp_data_set_creates WHERE ok = FALSE
SELECT * FROM pdp_data_set_piece_adds WHERE add_message_ok = FALSE
For each:
→ FAILED_TX (needs attention)
8. Wallet/Gas Health
PDP Wallet Balance:
// Get PDP wallet address from DB
var privateKey []byte
db.QueryRow(ctx, `SELECT private_key FROM eth_keys WHERE role = 'pdp'`).Scan(&privateKey)
wallet := crypto.PubkeyToAddress(crypto.ToECDSA(privateKey).PublicKey)
// Check balance
balance, _ := ethClient.BalanceAt(ctx, wallet, nil)
// Warn if balance is getting low
// Note: CreateDataSet has sybil fee (0.1 FIL), plus gas for proving, settlements, etc.
// Don't hardcode threshold - make configurable or just report balance
if balance < configuredThreshold {
→ LOW_GAS_BALANCE (balance: {balance}, may affect operations)Pending Nonce Gap:
pendingNonce, _ := ethClient.PendingNonceAt(ctx, wallet)
confirmedNonce, _ := ethClient.NonceAt(ctx, wallet, nil)
if pendingNonce - confirmedNonce > 10 {
→ NONCE_GAP (many pending transactions, possible stuck queue)9. Harmony Task Health
Schema Reference (harmony_task_history):
-- Columns: id, task_id, name, posted, work_start, work_end, result, err, completed_by_host_and_port
-- result = TRUE means success, FALSE means failure
-- err contains error message on failureStuck Tasks (in harmony_task, not yet completed):
SELECT name, COUNT(*) as stuck_count,
MIN(posted_time) as oldest_stuck
FROM harmony_task
WHERE owner_id IS NOT NULL
AND posted_time < NOW() - INTERVAL '2 hours'
GROUP BY name
-- Tasks stuck for hours may indicate node crash or deadlock
For each with stuck_count > 0:
→ STUCK_TASK (task type: {name}, count: {stuck_count}, oldest: {oldest_stuck})Recent Failure Summary (from harmony_task_history):
-- PDP-related task failure counts in last 24h
SELECT name,
COUNT(*) FILTER (WHERE result = FALSE) as failures,
COUNT(*) FILTER (WHERE result = TRUE) as successes,
COUNT(*) as total
FROM harmony_task_history
WHERE name IN ('Prove', 'InitPP', 'NextPP', 'TermFWSS', 'DelDS')
AND work_end > NOW() - INTERVAL '24 hours'
GROUP BY name
-- High failure rate is concerning
For each where failures > successes:
→ HIGH_FAILURE_RATE (task: {name}, failures: {failures}/{total})Repeated Error Patterns:
SELECT name, err, COUNT(*) as fail_count,
MAX(work_end) as last_failure
FROM harmony_task_history
WHERE result = FALSE
AND err IS NOT NULL
AND work_end > NOW() - INTERVAL '24 hours'
GROUP BY name, err
HAVING COUNT(*) > 3
ORDER BY fail_count DESC
For each:
→ REPEATED_FAILURE (task: {name}, count: {fail_count}, error: {err})Task Timing Analysis (spot slow tasks):
SELECT name,
AVG(EXTRACT(EPOCH FROM (work_end - work_start))) as avg_duration_sec,
MAX(EXTRACT(EPOCH FROM (work_end - work_start))) as max_duration_sec
FROM harmony_task_history
WHERE result = TRUE
AND work_end > NOW() - INTERVAL '24 hours'
AND name IN ('Prove', 'InitPP', 'NextPP')
GROUP BY name
-- Prove tasks taking >5 min may indicate storage issues
For each where avg_duration_sec > 300:
→ SLOW_TASK (task: {name}, avg: {avg_duration_sec}s)Per-Node Health (if multi-node cluster):
SELECT completed_by_host_and_port as node,
COUNT(*) FILTER (WHERE result = FALSE) as failures,
COUNT(*) as total
FROM harmony_task_history
WHERE work_end > NOW() - INTERVAL '24 hours'
GROUP BY completed_by_host_and_port
-- One node with many more failures may have local issues10. Provider Registration Health
Registry Status (optional but useful):
registryAddr, _ := contract.ServiceRegistryAddress()
registry, _ := contract.NewServiceProviderRegistry(registryAddr, ethClient)
// Get provider info for our wallet
providerInfo, err := registry.GetProviderByOwner(opts, wallet)
if err != nil {
→ NOT_REGISTERED (SP not registered in ServiceProviderRegistry)
}
// Check capabilities match expectations
capabilities := registry.GetProductCapabilities(opts, providerInfo.ProviderId, 0) // ProductType 0 = PDP
// Compare serviceURL, pricing, etc. against local config11. Piece Storage Spot Check
Verify Random Pieces Readable (optional, slower):
// Sample a few pieces to verify storage backend health
pieces := db.Query(`SELECT piece_cid, piece_ref FROM pdp_piecerefs
ORDER BY RANDOM() LIMIT 5`)
for each piece:
// Try to open piece from storage
reader, err := pieceStore.GetReader(pieceCid)
if err != nil {
→ PIECE_UNREADABLE (cid: {piece_cid}, error: {err})
}
// Optionally verify first/last bytes or CommPOutput Format
PDP Doctor Report
=================
Network: calibnet (chain ID 314159)
PDP Wallet: 0x1234...5678
Wallet Balance: 4.2 FIL
Data Sets Checked: 42
Pieces Checked: 1,847
Rails Checked: 38
CRITICAL (requires immediate attention):
- WALLET: LOW_GAS_BALANCE - 0.3 FIL remaining (operations may fail)
- Data Set 123: MISSED_DEADLINE - proving deadline passed, not proven
- Data Set 456: STALE - dead on-chain but local DB unaware
- TASK: STUCK_TASK - ProveTask stuck for 3 hours (count: 2)
WARNINGS:
- Data Set 789: SETTLEMENT_DUE - approaching lockup limit (28 days unsettled)
- Data Set 555: MISSED_PERIOD - period 42 not proven
- Piece xyz (Data Set 123): INDEXING_STALLED - needs_indexing=true for 48h
- Rail 42: UNSETTLED_TERMINATED - terminated but not fully settled
- TASK: REPEATED_FAILURE - NextProvingPeriodTask failed 7x in 24h
INFO:
- Data Set 111: Cleanup in progress (after_terminate_service=true)
- 3 pieces pending IPNI announcement
- Provider registered: ID 42, serviceURL matches config
Reference Count Issues:
- pdp_pieceref 999: expected 2, actual 1
Summary:
Critical: 4
Warnings: 5
Info: 3
Implementation Notes
Contract Calls
Simplest approach: Goroutines with bounded concurrency. For a diagnostic tool, this is sufficient:
sem := make(chan struct{}, 10) // limit concurrent RPC calls
var wg sync.WaitGroup
for _, setId := range dataSetIds {
wg.Add(1)
go func(id uint64) {
defer wg.Done()
sem <- struct{}{}
defer func() { <-sem }()
live, _ := pdpVerifier.DataSetLive(opts, big.NewInt(int64(id)))
// ... process result
}(setId)
}
wg.Wait()Alternative: go-ethereum's rpc.Client supports BatchCall() for JSON-RPC level batching (multiple calls in one HTTP request) without Multicall3 complexity. Available via ethclient.Client() to get underlying RPC client.
FWSS Detection and StateView Resolution
The FWSS main contract exposes read-only state via a separate StateView contract. To access StateView methods like provenPeriods(), getDataSet(), etc., you must first resolve the StateView address.
Pattern (from pdp/contract/utils.go:GetProvingScheduleFromListener()):
import "github.com/filecoin-project/curio/pdp/contract"
// Get listener address for a data set
listenerAddr, _ := pdpVerifier.GetDataSetListener(opts, big.NewInt(setId))
// Try to get the view contract address from the listener
viewAddr := listenerAddr // fallback
// Check if listener supports viewContractAddress method (FWSS does)
listenerService, err := contract.NewListenerServiceWithViewContract(listenerAddr, ethClient)
if err == nil {
va, err := listenerService.ViewContractAddress(nil)
if err == nil && va != (common.Address{}) {
viewAddr = va // Use the StateView contract
}
}
// Now use viewAddr for read operations
fwssv, _ := FWSS.NewFilecoinWarmStorageServiceStateView(viewAddr, ethClient)
ds, _ := fwssv.GetDataSet(opts, big.NewInt(setId))
proven, _ := fwssv.ProvenPeriods(opts, big.NewInt(setId), big.NewInt(periodId))Helper functions in curio (from pdp/contract/utils.go):
GetProvingScheduleFromListener()- ReturnsIPDPProvingScheduleinterface bound to correct addressGetDataSetMetadataAtKey()- Resolves view contract and queries metadata
Detecting FWSS vs other listeners:
knownFWSSAddr := contract.ContractAddresses().AllowedPublicRecordKeepers.FWSService
listener, _ := pdpVerifier.GetDataSetListener(opts, big.NewInt(setId))
if listener == knownFWSSAddr {
// FWSS - can use StateView for provenPeriods, pricing, etc.
} else {
// Generic PDPListener - skip FWSS-specific checks
}Rail Discovery
FWSS doesn't expose direct rail lookup by data set. Options:
- Store rail IDs locally when created
- Query FilecoinPay
getRailsForPayerAndToken()and filter - Add
dataSetToRail()view to FWSS contract
CLI Interface
Standalone version (initial implementation):
# Full check (requires DB and RPC connection)
pdp-doctor --db "postgres://..." --rpc "https://api.calibration.node.glif.io/rpc/v1"
# Check specific data set
pdp-doctor --db "..." --rpc "..." --data-set 123
# JSON output for automation
pdp-doctor --db "..." --rpc "..." --json
# Only show problems
pdp-doctor --db "..." --rpc "..." --problems-only
# Check specific subsystem
pdp-doctor --db "..." --rpc "..." --check settlement
pdp-doctor --db "..." --rpc "..." --check proving
pdp-doctor --db "..." --rpc "..." --check cleanupIntegrated version (future, if merged into curio):
# Uses curio's existing config for DB/RPC
curio pdp doctor
# Same flags as standalone
curio pdp doctor --data-set 123
curio pdp doctor --json
curio pdp doctor --problems-only
curio pdp doctor --check settlementConfiguration:
- DB connection: Same Yugabyte/PostgreSQL used by curio
- RPC endpoint: Filecoin RPC (Glif, local lotus, etc.)
- Network auto-detected from chain ID (314159 = calibnet, 314 = mainnet)
Priority Order
- Wallet/gas health - no gas = nothing works
- Data set liveness - most critical, affects proving
- Proving state - missed deadlines mean payment issues
- Harmony task health - stuck tasks block everything
- Settlement health - stale settlement means delayed payment
- Cleanup flow - stuck cleanups waste resources
- Provider registration - affects discoverability
- IPNI indexing - affects discoverability but not payments
- Piece storage - spot check, slow but catches storage issues
- Reference counts - internal consistency, less urgent