Skip to content

PDP Doctor CMD DB state with On-Chain State #889

@rjan90

Description

@rjan90

Rough, advisory sketch of how to take a stab at the DB vs On-Chain state inconsistency problem, written by @rvagg:

Curio PDP Doctor Command Design

A diagnostic tool to check consistency between local DB state and on-chain state across PDPVerifier, FWSS, and FilecoinPay contracts.

Project Structure

Initial approach: Start as a standalone external project/tool. This allows rapid iteration without curio release cycles and keeps the diagnostic tooling decoupled from production code.

Future integration: May eventually be integrated into the curio CLI as a subcommand (e.g., curio pdp doctor) depending on quality and maintainer preference. Design with this in mind - use curio's existing contract bindings and database utilities where possible.

Dependencies (for standalone version):

  • github.com/filecoin-project/curio/pdp/contract - Contract bindings (PDPVerifier, FWSS, FilecoinPay)
  • github.com/filecoin-project/curio/harmony/harmonydb - Database connection utilities
  • github.com/ethereum/go-ethereum - Ethereum client
  • PostgreSQL driver for Yugabyte/Postgres connection

Context: Where This Fits

Repository: Initially standalone; eventual integration target is github.com/filecoin-project/curio (branch: pdpv0)

Related Files:

  • tasks/pdp/ - All PDP task implementations
  • tasks/pay/ - Settlement task (settle_task.go), watcher
  • pdp/ - HTTP API handlers
  • pdp/contract/ - Generated contract bindings
  • pdp/contract/addresses.go - Contract addresses per network
  • lib/chainsched/ - Chain event system
  • lib/filecoinpayment/ - Payment utilities, Multicall3 batching
  • harmony/harmonydb/ - Database utilities

Contract Addresses (from pdp/contract/addresses.go):

Contract Calibnet Mainnet
PDPVerifier 0x85e366Cf9DD2c0aE37E963d9556F5f4718d6417C 0xBADd0B92C1c71d02E7d520f64c0876538fa2557F
FWSS 0x02925630df557F957f70E112bA06e50965417CA0 0x8408502033C418E1bbC97cE9ac48E5528F371A9f
ServiceRegistry 0x839e5c9988e4e9977d40708d0094103c0839Ac9D 0xf55dDbf63F1b55c3F1D4FA7e339a68AB7b64A5eB
USDFC 0xb3042734b608a1B16e9e86B374A3f3e389B4cDf0 0x80B98d3aa09ffff255c3ba4A241111Ff1262F045

For devnet: addresses come from env vars (CURIO_DEVNET_PDP_VERIFIER_ADDRESS, etc.) or ~/.foc-localnet/state/latest/contract_addresses.json

FilecoinPay address: Discovered from FWSS via paymentsContractAddress() method

Data Sources

Local Database Tables

pdp_data_sets (Main data set tracking)

-- Schema (after all migrations)
CREATE TABLE pdp_data_sets (
    id BIGINT PRIMARY KEY,                          -- On-chain data set ID from PDPVerifier
    prev_challenge_request_epoch BIGINT,            -- Last challenge request
    challenge_request_task_id BIGINT,               -- FK to harmony_task
    challenge_request_msg_hash TEXT,                -- Pending nextProvingPeriod tx
    proving_period BIGINT,                          -- Epochs between proofs (e.g., 2880 = 1 day)
    challenge_window BIGINT,                        -- Epochs to submit proof
    prove_at_epoch BIGINT,                          -- Next challenge window start
    init_ready BOOLEAN DEFAULT FALSE,               -- First piece added, can init proving
    create_message_hash TEXT NOT NULL,              -- Data set creation tx
    service TEXT NOT NULL,                          -- FK to pdp_services.service_label
    -- Added by 20260110-pdp-termination-handling.sql:
    terminated_at_epoch BIGINT,                     -- Block when termination detected (NULL = active)
    consecutive_prove_failures INT DEFAULT 0,       -- Backoff counter
    next_prove_attempt_at BIGINT                    -- Backoff delay (don't prove before this epoch)
);

-- Key query: Active data sets for proving
SELECT * FROM pdp_data_sets
WHERE terminated_at_epoch IS NULL
  AND init_ready = TRUE
  AND (next_prove_attempt_at IS NULL OR next_prove_attempt_at <= $current_epoch);

pdp_data_set_pieces (Pieces in data sets)

CREATE TABLE pdp_data_set_pieces (
    data_set BIGINT NOT NULL,                       -- FK to pdp_data_sets.id
    piece TEXT NOT NULL,                            -- PieceCID v2 string
    add_message_hash TEXT NOT NULL,                 -- TX that added this piece
    add_message_index BIGINT NOT NULL,              -- Index in batch add
    piece_id BIGINT NOT NULL,                       -- On-chain piece ID within data set
    sub_piece TEXT NOT NULL,                        -- Sub-piece CID (same as piece if no aggregation)
    sub_piece_offset BIGINT NOT NULL,               -- Byte offset in aggregated piece
    sub_piece_size BIGINT NOT NULL,                 -- Padded size in bytes
    pdp_pieceref BIGINT NOT NULL,                   -- FK to pdp_piecerefs.id
    rm_message_hash TEXT DEFAULT NULL,              -- Removal tx hash (if scheduled)
    removed BOOLEAN DEFAULT FALSE,                  -- Scheduled for removal

    PRIMARY KEY (data_set, piece_id, sub_piece_offset)
);

-- Key query: Pieces in a data set
SELECT * FROM pdp_data_set_pieces WHERE data_set = $data_set_id AND removed = FALSE;

-- Key query: Pieces pending removal
SELECT * FROM pdp_data_set_pieces WHERE removed = TRUE AND rm_message_hash IS NOT NULL;

pdp_piecerefs (Local piece storage references)

CREATE TABLE pdp_piecerefs (
    id BIGSERIAL PRIMARY KEY,
    service TEXT NOT NULL,                          -- FK to pdp_services
    piece_cid TEXT NOT NULL,                        -- PieceCID v2
    piece_ref BIGINT NOT NULL,                      -- FK to parked_piece_refs.ref_id (actual storage)
    created_at TIMESTAMP WITH TIME ZONE,
    data_set_refcount BIGINT DEFAULT 0,             -- Maintained by triggers (how many data sets use this)
    -- Added by 20251004-pdp-indexing.sql:
    indexing_task_id BIGINT DEFAULT NULL,           -- Current indexing task
    needs_indexing BOOLEAN DEFAULT FALSE,           -- Queued for indexing
    ipni_task_id BIGINT DEFAULT NULL,               -- Current IPNI announcement task
    needs_ipni BOOLEAN DEFAULT FALSE                -- Queued for IPNI
);

-- Key query: Orphaned pieces (refcount 0, can be cleaned)
SELECT * FROM pdp_piecerefs WHERE data_set_refcount = 0;

-- Key query: Pieces needing indexing
SELECT * FROM pdp_piecerefs WHERE needs_indexing = TRUE AND indexing_task_id IS NULL;

pdp_delete_data_set (Cleanup workflow state)

CREATE TABLE pdp_delete_data_set (
    id BIGINT PRIMARY KEY,                          -- Data set ID being deleted

    terminate_service_task_id BIGINT,               -- FK to harmony_task for TerminateFWSSTask
    after_terminate_service BOOLEAN DEFAULT FALSE,  -- FWSS.terminateService() completed
    terminate_tx_hash TEXT,                         -- Termination tx

    service_termination_epoch BIGINT,               -- When termination happened (pdpEndEpoch)

    delete_data_set_task_id BIGINT NOT NULL,        -- FK to harmony_task for DeleteDataSetTask
    after_delete_data_set BOOLEAN DEFAULT FALSE,    -- PDPVerifier.deleteDataSet() completed
    delete_tx_hash TEXT,                            -- Deletion tx

    terminated BOOLEAN DEFAULT FALSE                -- Full cleanup done (cascade delete triggered)
);

-- Lifecycle states:
-- 1. Entry created: after_terminate_service=FALSE, after_delete_data_set=FALSE
-- 2. FWSS terminated: after_terminate_service=TRUE (or skipped if pdpEndEpoch != 0)
-- 3. PDPVerifier deleted: after_delete_data_set=TRUE, delete_tx_hash set
-- 4. Cascade complete: terminated=TRUE (then row can be deleted)

-- Key query: Stuck in termination phase
SELECT * FROM pdp_delete_data_set
WHERE after_terminate_service = FALSE
  AND terminate_service_task_id IS NULL;

-- Key query: Stuck in deletion phase
SELECT * FROM pdp_delete_data_set
WHERE after_terminate_service = TRUE
  AND after_delete_data_set = FALSE
  AND delete_data_set_task_id IS NULL;

message_waits_eth (Transaction tracking)

-- Tracks all ETH transactions sent by Curio
CREATE TABLE message_waits_eth (
    signed_tx_hash TEXT PRIMARY KEY,
    tx_status TEXT,                                 -- 'pending', 'confirmed', 'failed'
    tx_success BOOLEAN,                             -- Did execution succeed
    -- ... other fields
);

-- Key query: Stale pending transactions
SELECT * FROM message_waits_eth
WHERE tx_status = 'pending'
  AND created_at < NOW() - INTERVAL '1 hour';

On-Chain: PDPVerifier

Contract binding: pdp/contract/PDPVerifier.go (generated from ABI)

// Instantiation
pdpAddr := contract.ContractAddresses().PDPVerifier
verifier, err := contract.NewPDPVerifier(pdpAddr, ethClient)

// Key read methods (all use &bind.CallOpts{Context: ctx})
verifier.DataSetLive(opts, big.NewInt(setId))           // bool - Is data set active
verifier.GetDataSetLeafCount(opts, big.NewInt(setId))   // *big.Int - Total 32-byte leaves
verifier.GetNextPieceId(opts, big.NewInt(setId))        // *big.Int - Next piece slot
verifier.GetNextChallengeEpoch(opts, big.NewInt(setId)) // *big.Int - When to prove next
verifier.GetDataSetListener(opts, big.NewInt(setId))    // common.Address - Listener (FWSS?)
verifier.GetDataSetStorageProvider(opts, big.NewInt(setId)) // (address, address) - SP, proposed
verifier.GetDataSetLastProvenEpoch(opts, big.NewInt(setId)) // *big.Int - Last proven epoch
verifier.GetPieceCid(opts, big.NewInt(setId), big.NewInt(pieceId)) // Cid struct
verifier.GetPieceLeafCount(opts, big.NewInt(setId), big.NewInt(pieceId)) // *big.Int - Leaves in piece
verifier.PieceLive(opts, big.NewInt(setId), big.NewInt(pieceId)) // bool - Is piece active
verifier.GetScheduledRemovals(opts, big.NewInt(setId))  // []*big.Int - Pending piece removals

Leaf count to bytes: Each leaf is 32 bytes, so leafCount * 32 = bytes.

On-Chain: FWSS (if listener is FWSS)

Contract binding: pdp/contract/FilecoinWarmStorageService.go (main) and FilecoinWarmStorageServiceStateView.go (read ops)

// Instantiation (use StateView for read operations)
fwssAddr := contract.ContractAddresses().AllowedPublicRecordKeepers.FWSService
fwssv, err := contract.NewFilecoinWarmStorageServiceStateView(fwssAddr, ethClient)

// Key read methods
fwssv.ProvenThisPeriod(opts, big.NewInt(dataSetId))    // bool - Proven current period
fwssv.ProvenPeriods(opts, big.NewInt(dataSetId), big.NewInt(periodId)) // bool - Proven specific
fwssv.ProvingActivationEpoch(opts, big.NewInt(dataSetId)) // *big.Int - When proving started
fwssv.ProvingDeadline(opts, big.NewInt(setId))         // *big.Int - Deadline for current period
fwssv.RailToDataSet(opts, big.NewInt(railId))          // *big.Int - Reverse lookup

// Get proving configuration (don't hardcode!)
config, _ := fwssv.GetPDPConfig(opts)
// Returns: maxProvingPeriod, challengeWindow, challengesPerProof

// Check if listener is FWSS
listener := verifier.GetDataSetListener(opts, big.NewInt(setId))
isFWSS := listener == fwssAddr

Detecting FWSS: Compare listener address to known FWSS address from contract.ContractAddresses().

On-Chain: FilecoinPay

Contract binding: lib/filecoinpayment/Payments.go (generated from ABI)

// Instantiation
paymentAddr, err := filecoinpayment.PaymentContractAddress()
payment, err := filecoinpayment.NewPayments(paymentAddr, ethClient)

// Get payment contract address from FWSS (if needed)
// paymentsAddr := fwss.PaymentsContractAddress(opts)

// Key read methods
payment.GetRail(opts, big.NewInt(railId)) // RailView struct

// RailView struct fields:
type RailView struct {
    Token        common.Address  // ERC20 token (USDFC)
    From         common.Address  // Payer (client)
    To           common.Address  // Payee (SP)
    Rate         *big.Int        // Tokens per epoch
    SettledUpTo  *big.Int        // Last settled epoch
    EndEpoch     *big.Int        // 0 if active, >0 if terminated
    LockupPeriod *big.Int        // Lockup duration in epochs
    LockupFixed  *big.Int        // Fixed lockup amount
    LockupRate   *big.Int        // Rate-based lockup
    Operator     common.Address  // Who controls the rail (FWSS)
    Validator    common.Address  // Payment validator (FWSS or 0x0)
}

// Get rails for payee
tokenAddr, _ := contract.USDFCAddress()
rails, err := payment.GetRailsForPayeeAndToken(opts, payeeAddr, tokenAddr, big.NewInt(0), big.NewInt(0))
// Returns: RailInfo[] with {RailId, IsTerminated, EndEpoch}

Rail states:

  • EndEpoch == 0: Active rail (streaming payments)
  • EndEpoch > 0 && SettledUpTo < EndEpoch: Terminated but not fully settled
  • EndEpoch > 0 && SettledUpTo >= EndEpoch: Terminated and fully settled (ready for cleanup)

Doctor Checks

1. Data Set Liveness

Local vs On-Chain Liveness:

For each local data set in pdp_data_sets:
  - If dataSetLive(id) = false AND terminated_at_epoch IS NULL:
    → STALE: On-chain dead but local doesn't know
  - If dataSetLive(id) = true AND terminated_at_epoch IS NOT NULL:
    → PREMATURE: Local thinks terminated but on-chain is live
  - If dataSetLive(id) = false AND not in pdp_delete_data_set:
    → STUCK: Dead on-chain but cleanup not triggered

Orphaned On-Chain Data Sets:

Query PDPVerifier for all data sets owned by SP
Compare against pdp_data_sets
Any on-chain data set not in local DB → ORPHAN (may be from another node)

2. Piece Consistency

Piece-by-Piece Verification:

For each piece in pdp_data_set_pieces:
  - pieceLive(setId, pieceId) should match (not removed)
  - getPieceCid(setId, pieceId) should match local piece column
  - getPieceLeafCount(setId, pieceId) should match sub_piece_size / 32

Scheduled Removals:

getScheduledRemovals(setId) from chain
Compare against local pieces with removed = TRUE
- On-chain has removal not in local → LOCAL_BEHIND
- Local has removal not on-chain → TX_PENDING or FAILED

Reference Count Integrity:

For each pdp_pieceref:
  Actual count = SELECT COUNT(*) FROM pdp_data_set_pieces WHERE pdp_pieceref = ?
  If data_set_refcount != actual count → REFCOUNT_MISMATCH

3. Proving State

Challenge Window Sync:

For each active data set:
  on_chain_next = getNextChallengeEpoch(setId)
  local_next = prove_at_epoch
  If significantly different → CHALLENGE_DESYNC

Proving History (if FWSS):

last_proven = getDataSetLastProvenEpoch(setId)
current_deadline = provingDeadline(setId)
proven_this_period = provenThisPeriod(dataSetId)

If current_epoch > current_deadline AND !proven_this_period:
  → MISSED_DEADLINE (data set at risk)

If consecutive_prove_failures > 0 but proven_this_period:
  → STALE_FAILURE_COUNT (should have been reset)

Recent Proving History Check (FWSS stores proven periods as bitmap):

Implementation Note: The provenPeriods mapping is marked private in FilecoinWarmStorageService.sol, but IS accessible via FilecoinWarmStorageServiceStateView. The StateView uses extsload (external storage load) to read directly from the main contract's storage slots, bypassing Solidity's visibility restrictions. See FilecoinWarmStorageServiceStateLibrary.sol:161-173 for the implementation.

// Get config from FWSS (don't hardcode!)
config, _ := fwssv.GetPDPConfig(opts)
maxProvingPeriod := config.MaxProvingPeriod  // e.g., 2880 mainnet, 480 calibnet

// Period ID calculation
activationEpoch := fwssv.ProvingActivationEpoch(opts, big.NewInt(dataSetId))
currentPeriod := (currentBlock - activationEpoch) / maxProvingPeriod

// Check last N periods (via StateView.provenPeriods)
for periodId := currentPeriod - 5; periodId < currentPeriod; periodId++ {
    proven := fwssv.ProvenPeriods(opts, big.NewInt(dataSetId), big.NewInt(periodId))
    if !proven {
        → MISSED_PERIOD (periodId)
    }
}

Period Deadline: activationEpoch + (periodId + 1) * maxProvingPeriod

4. Settlement Health (FWSS data sets)

Background: Lockup period is 30 days (DEFAULT_LOCKUP_PERIOD = 86400 epochs). Settlement task (tasks/pay/settle_task.go) runs every 12 hours (time.Hour*12), but only settles rails that need it.

Settlement Logic (from lib/filecoinpayment/utils.go:SettleLockupPeriod()):

// Active rails: settle if approaching lockup limit
threshold := settledUpTo + lockupPeriod
thresholdWithGrace := threshold - EpochsInDay  // 1 day buffer
if thresholdWithGrace < currentBlock {
    toSettle = append(toSettle, rail)
}

// Terminated rails: settle if not fully settled
if rail.endEpoch > 0 && settledUpTo < endEpoch {
    toSettle = append(toSettle, rail)
}

What This Means: Active rails get settled when ~29 days behind (lockup - 1day buffer). This is normal and by design - no point settling more frequently.

Rail Settlement Status (for doctor):

rail = getRail(railId)
lockup_period = 30 days = 86400 epochs

If rail.endEpoch == 0 (active rail):
  settlement_lag = current_epoch - rail.settledUpTo
  threshold = lockup_period - EpochsInDay  # ~29 days

  # Expected: lag can be up to ~29 days before settlement triggers
  If settlement_lag > lockup_period → SETTLEMENT_CRITICAL (should have been settled)
  If settlement_lag > lockup_period - EpochsInDay → SETTLEMENT_DUE (should settle soon)
  Otherwise → OK (lag is normal)

If rail.endEpoch != 0 (terminated rail):
  If rail.settledUpTo < rail.endEpoch → UNSETTLED_TERMINATED (needs settlement)
  If rail.settledUpTo >= rail.endEpoch → READY_FOR_CLEANUP

Settlement Batching: Uses Multicall3 to batch up to 10 settleRail() calls per transaction for gas efficiency. If one rail in batch fails, entire batch reverts (the terminated dataset problem).

5. IPNI Indexing Consistency

How Metadata Works (from pdp/indexing.go and pdp/contract/utils.go):

// Check if data set has withIPFSIndexing metadata
// Uses CheckIfIndexingNeeded() from pdp/indexing.go
func CheckIfIndexingNeeded(ethClient *ethclient.Client, dataSetId uint64) (bool, error) {
    pdpVerifier, _ := contract.NewPDPVerifier(contract.ContractAddresses().PDPVerifier, ethClient)
    listenerAddr, _ := pdpVerifier.GetDataSetListener(nil, big.NewInt(dataSetId))
    // GetDataSetMetadataAtKey handles FWSS StateView contract
    mustIndex, _, _ := contract.GetDataSetMetadataAtKey(listenerAddr, ethClient, dataSetId, "withIPFSIndexing")
    return mustIndex, nil
}

Metadata Storage: Stored on-chain in listener contract (FWSS). Retrieved via getDataSetMetadata(dataSetId, key) on the StateView contract.

Metadata Key: "withIPFSIndexing" (from synapse-sdk/src/utils/constants.ts:METADATA_KEYS.WITH_IPFS_INDEXING)

Doctor Checks:

For each data set:
  has_indexing_flag = CheckIfIndexingNeeded(ethClient, dataSetId)

  For each piece in data set (pdp_data_set_pieces):
    Get pdp_pieceref via pdp_pieceref FK

    If has_indexing_flag AND needs_indexing = FALSE AND indexing_task_id IS NULL:
      → INDEXING_NOT_TRIGGERED
    If needs_indexing = TRUE for > N hours:
      → INDEXING_STALLED
    If needs_ipni = TRUE for > N hours:
      → IPNI_ANNOUNCEMENT_STALLED

Future Enhancement: Query IPNI infrastructure directly to verify content is actually discoverable. IPNI announcements are dual:

  1. PieceCIDs - the piece identifiers themselves
  2. IPLD blocks - content blocks extracted from CAR files within pieces (when withIPFSIndexing is set)

Verification would query IPNI endpoints (e.g., cid.contact) to confirm both PieceCIDs and extracted block CIDs appear in the index. More complex but would catch cases where announcements succeed locally but fail to propagate.

6. Cleanup Flow Health

Stuck Deletions:

For each entry in pdp_delete_data_set:
  If after_terminate_service = FALSE:
    If terminate_service_task_id IS NULL:
      → TERMINATE_NOT_SCHEDULED
    If task exists but old:
      → TERMINATE_TASK_STALLED

  If after_terminate_service = TRUE AND after_delete_data_set = FALSE:
    If delete_data_set_task_id IS NULL:
      → DELETE_NOT_SCHEDULED
    If delete_tx_hash IS NOT NULL but after_delete_data_set = FALSE:
      → DELETE_TX_PENDING (check message_waits_eth)

  If after_delete_data_set = TRUE AND terminated = FALSE:
    → CASCADE_DELETE_PENDING

  If dataSetLive(id) = false AND terminated = FALSE:
    elapsed = current_epoch - service_termination_epoch
    If elapsed > EXPECTED_CLEANUP_TIME:
      → CLEANUP_STALLED

7. Transaction State

Pending Transactions:

SELECT * FROM message_waits_eth
WHERE tx_status = 'pending'
AND created_at < NOW() - INTERVAL '1 hour'

For each:
  → STALE_PENDING_TX (may need retry or investigation)

Failed Create/Add Messages:

SELECT * FROM pdp_data_set_creates WHERE ok = FALSE
SELECT * FROM pdp_data_set_piece_adds WHERE add_message_ok = FALSE

For each:
  → FAILED_TX (needs attention)

8. Wallet/Gas Health

PDP Wallet Balance:

// Get PDP wallet address from DB
var privateKey []byte
db.QueryRow(ctx, `SELECT private_key FROM eth_keys WHERE role = 'pdp'`).Scan(&privateKey)
wallet := crypto.PubkeyToAddress(crypto.ToECDSA(privateKey).PublicKey)

// Check balance
balance, _ := ethClient.BalanceAt(ctx, wallet, nil)

// Warn if balance is getting low
// Note: CreateDataSet has sybil fee (0.1 FIL), plus gas for proving, settlements, etc.
// Don't hardcode threshold - make configurable or just report balance
if balance < configuredThreshold {
    → LOW_GAS_BALANCE (balance: {balance}, may affect operations)

Pending Nonce Gap:

pendingNonce, _ := ethClient.PendingNonceAt(ctx, wallet)
confirmedNonce, _ := ethClient.NonceAt(ctx, wallet, nil)

if pendingNonce - confirmedNonce > 10 {
    → NONCE_GAP (many pending transactions, possible stuck queue)

9. Harmony Task Health

Schema Reference (harmony_task_history):

-- Columns: id, task_id, name, posted, work_start, work_end, result, err, completed_by_host_and_port
-- result = TRUE means success, FALSE means failure
-- err contains error message on failure

Stuck Tasks (in harmony_task, not yet completed):

SELECT name, COUNT(*) as stuck_count,
       MIN(posted_time) as oldest_stuck
FROM harmony_task
WHERE owner_id IS NOT NULL
AND posted_time < NOW() - INTERVAL '2 hours'
GROUP BY name

-- Tasks stuck for hours may indicate node crash or deadlock
For each with stuck_count > 0:
  → STUCK_TASK (task type: {name}, count: {stuck_count}, oldest: {oldest_stuck})

Recent Failure Summary (from harmony_task_history):

-- PDP-related task failure counts in last 24h
SELECT name,
       COUNT(*) FILTER (WHERE result = FALSE) as failures,
       COUNT(*) FILTER (WHERE result = TRUE) as successes,
       COUNT(*) as total
FROM harmony_task_history
WHERE name IN ('Prove', 'InitPP', 'NextPP', 'TermFWSS', 'DelDS')
AND work_end > NOW() - INTERVAL '24 hours'
GROUP BY name

-- High failure rate is concerning
For each where failures > successes:
  → HIGH_FAILURE_RATE (task: {name}, failures: {failures}/{total})

Repeated Error Patterns:

SELECT name, err, COUNT(*) as fail_count,
       MAX(work_end) as last_failure
FROM harmony_task_history
WHERE result = FALSE
AND err IS NOT NULL
AND work_end > NOW() - INTERVAL '24 hours'
GROUP BY name, err
HAVING COUNT(*) > 3
ORDER BY fail_count DESC

For each:
  → REPEATED_FAILURE (task: {name}, count: {fail_count}, error: {err})

Task Timing Analysis (spot slow tasks):

SELECT name,
       AVG(EXTRACT(EPOCH FROM (work_end - work_start))) as avg_duration_sec,
       MAX(EXTRACT(EPOCH FROM (work_end - work_start))) as max_duration_sec
FROM harmony_task_history
WHERE result = TRUE
AND work_end > NOW() - INTERVAL '24 hours'
AND name IN ('Prove', 'InitPP', 'NextPP')
GROUP BY name

-- Prove tasks taking >5 min may indicate storage issues
For each where avg_duration_sec > 300:
  → SLOW_TASK (task: {name}, avg: {avg_duration_sec}s)

Per-Node Health (if multi-node cluster):

SELECT completed_by_host_and_port as node,
       COUNT(*) FILTER (WHERE result = FALSE) as failures,
       COUNT(*) as total
FROM harmony_task_history
WHERE work_end > NOW() - INTERVAL '24 hours'
GROUP BY completed_by_host_and_port

-- One node with many more failures may have local issues

10. Provider Registration Health

Registry Status (optional but useful):

registryAddr, _ := contract.ServiceRegistryAddress()
registry, _ := contract.NewServiceProviderRegistry(registryAddr, ethClient)

// Get provider info for our wallet
providerInfo, err := registry.GetProviderByOwner(opts, wallet)
if err != nil {
    → NOT_REGISTERED (SP not registered in ServiceProviderRegistry)
}

// Check capabilities match expectations
capabilities := registry.GetProductCapabilities(opts, providerInfo.ProviderId, 0) // ProductType 0 = PDP
// Compare serviceURL, pricing, etc. against local config

11. Piece Storage Spot Check

Verify Random Pieces Readable (optional, slower):

// Sample a few pieces to verify storage backend health
pieces := db.Query(`SELECT piece_cid, piece_ref FROM pdp_piecerefs
                    ORDER BY RANDOM() LIMIT 5`)

for each piece:
    // Try to open piece from storage
    reader, err := pieceStore.GetReader(pieceCid)
    if err != nil {
        → PIECE_UNREADABLE (cid: {piece_cid}, error: {err})
    }
    // Optionally verify first/last bytes or CommP

Output Format

PDP Doctor Report
=================
Network: calibnet (chain ID 314159)
PDP Wallet: 0x1234...5678
Wallet Balance: 4.2 FIL

Data Sets Checked: 42
Pieces Checked: 1,847
Rails Checked: 38

CRITICAL (requires immediate attention):
  - WALLET: LOW_GAS_BALANCE - 0.3 FIL remaining (operations may fail)
  - Data Set 123: MISSED_DEADLINE - proving deadline passed, not proven
  - Data Set 456: STALE - dead on-chain but local DB unaware
  - TASK: STUCK_TASK - ProveTask stuck for 3 hours (count: 2)

WARNINGS:
  - Data Set 789: SETTLEMENT_DUE - approaching lockup limit (28 days unsettled)
  - Data Set 555: MISSED_PERIOD - period 42 not proven
  - Piece xyz (Data Set 123): INDEXING_STALLED - needs_indexing=true for 48h
  - Rail 42: UNSETTLED_TERMINATED - terminated but not fully settled
  - TASK: REPEATED_FAILURE - NextProvingPeriodTask failed 7x in 24h

INFO:
  - Data Set 111: Cleanup in progress (after_terminate_service=true)
  - 3 pieces pending IPNI announcement
  - Provider registered: ID 42, serviceURL matches config

Reference Count Issues:
  - pdp_pieceref 999: expected 2, actual 1

Summary:
  Critical: 4
  Warnings: 5
  Info: 3

Implementation Notes

Contract Calls

Simplest approach: Goroutines with bounded concurrency. For a diagnostic tool, this is sufficient:

sem := make(chan struct{}, 10) // limit concurrent RPC calls
var wg sync.WaitGroup
for _, setId := range dataSetIds {
    wg.Add(1)
    go func(id uint64) {
        defer wg.Done()
        sem <- struct{}{}
        defer func() { <-sem }()

        live, _ := pdpVerifier.DataSetLive(opts, big.NewInt(int64(id)))
        // ... process result
    }(setId)
}
wg.Wait()

Alternative: go-ethereum's rpc.Client supports BatchCall() for JSON-RPC level batching (multiple calls in one HTTP request) without Multicall3 complexity. Available via ethclient.Client() to get underlying RPC client.

FWSS Detection and StateView Resolution

The FWSS main contract exposes read-only state via a separate StateView contract. To access StateView methods like provenPeriods(), getDataSet(), etc., you must first resolve the StateView address.

Pattern (from pdp/contract/utils.go:GetProvingScheduleFromListener()):

import "github.com/filecoin-project/curio/pdp/contract"

// Get listener address for a data set
listenerAddr, _ := pdpVerifier.GetDataSetListener(opts, big.NewInt(setId))

// Try to get the view contract address from the listener
viewAddr := listenerAddr  // fallback

// Check if listener supports viewContractAddress method (FWSS does)
listenerService, err := contract.NewListenerServiceWithViewContract(listenerAddr, ethClient)
if err == nil {
    va, err := listenerService.ViewContractAddress(nil)
    if err == nil && va != (common.Address{}) {
        viewAddr = va  // Use the StateView contract
    }
}

// Now use viewAddr for read operations
fwssv, _ := FWSS.NewFilecoinWarmStorageServiceStateView(viewAddr, ethClient)
ds, _ := fwssv.GetDataSet(opts, big.NewInt(setId))
proven, _ := fwssv.ProvenPeriods(opts, big.NewInt(setId), big.NewInt(periodId))

Helper functions in curio (from pdp/contract/utils.go):

  • GetProvingScheduleFromListener() - Returns IPDPProvingSchedule interface bound to correct address
  • GetDataSetMetadataAtKey() - Resolves view contract and queries metadata

Detecting FWSS vs other listeners:

knownFWSSAddr := contract.ContractAddresses().AllowedPublicRecordKeepers.FWSService
listener, _ := pdpVerifier.GetDataSetListener(opts, big.NewInt(setId))

if listener == knownFWSSAddr {
    // FWSS - can use StateView for provenPeriods, pricing, etc.
} else {
    // Generic PDPListener - skip FWSS-specific checks
}

Rail Discovery

FWSS doesn't expose direct rail lookup by data set. Options:

  1. Store rail IDs locally when created
  2. Query FilecoinPay getRailsForPayerAndToken() and filter
  3. Add dataSetToRail() view to FWSS contract

CLI Interface

Standalone version (initial implementation):

# Full check (requires DB and RPC connection)
pdp-doctor --db "postgres://..." --rpc "https://api.calibration.node.glif.io/rpc/v1"

# Check specific data set
pdp-doctor --db "..." --rpc "..." --data-set 123

# JSON output for automation
pdp-doctor --db "..." --rpc "..." --json

# Only show problems
pdp-doctor --db "..." --rpc "..." --problems-only

# Check specific subsystem
pdp-doctor --db "..." --rpc "..." --check settlement
pdp-doctor --db "..." --rpc "..." --check proving
pdp-doctor --db "..." --rpc "..." --check cleanup

Integrated version (future, if merged into curio):

# Uses curio's existing config for DB/RPC
curio pdp doctor

# Same flags as standalone
curio pdp doctor --data-set 123
curio pdp doctor --json
curio pdp doctor --problems-only
curio pdp doctor --check settlement

Configuration:

  • DB connection: Same Yugabyte/PostgreSQL used by curio
  • RPC endpoint: Filecoin RPC (Glif, local lotus, etc.)
  • Network auto-detected from chain ID (314159 = calibnet, 314 = mainnet)

Priority Order

  1. Wallet/gas health - no gas = nothing works
  2. Data set liveness - most critical, affects proving
  3. Proving state - missed deadlines mean payment issues
  4. Harmony task health - stuck tasks block everything
  5. Settlement health - stale settlement means delayed payment
  6. Cleanup flow - stuck cleanups waste resources
  7. Provider registration - affects discoverability
  8. IPNI indexing - affects discoverability but not payments
  9. Piece storage - spot check, slow but catches storage issues
  10. Reference counts - internal consistency, less urgent

Metadata

Metadata

Assignees

No one assigned

    Labels

    team/fs-wgItems being worked on or tracked by the "FS Working Group". See FilOzone/github-mgmt #10

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions