Skip to content

feat: add more scan chains plus CEX-only pull with reliable multi-chain browser priming#116

Open
elitex45 wants to merge 2 commits intodawsbot:v1from
elitex45:cex-pull
Open

feat: add more scan chains plus CEX-only pull with reliable multi-chain browser priming#116
elitex45 wants to merge 2 commits intodawsbot:v1from
elitex45:cex-pull

Conversation

@elitex45
Copy link

@elitex45 elitex45 commented Mar 6, 2026

Summary

This PR adds a CEX-focused scraping flow and improves browser/origin handling for reliable multi-chain pulls.

What Changed

  • Added CEX-only pull script:

    • scripts/fetch-cex.ts
    • pulls account labels related to CEX
    • supports DefiLlama CEX keyword enrichment with fallback static keywords
    • excludes risky labels (exploit/hack/phish/scam/etc.) and known false positives
    • supports --no-10k-limit pagination mode
    • compact progress output with estimated totals
  • Added scripts in package.json:

    • pull:cex
    • pull:cex:all
    • smoke:labelcloud
  • Added scripts/smoke-labelcloud.ts for chain labelcloud reachability checks.

  • Added new chain integrations:

    • Avalanche, Polygon, Mantle, Scroll, Linea, Blast, Berachain, World Chain
    • updated chainId mapping
    • removed non-working Mode and zkSync Era from active config
  • Improved browser fetching reliability:

    • per-origin priming via setActiveOrigin()
    • navigation-based HTML fetch for cross-origin reliability
    • origin-aware POST retry behavior
  • Updated pull loops (fetch-all.ts, fetch-cex.ts) to prime origin per chain.

Data Artifacts Included

  • Updated:
    • data/db.sqlite3
    • data/json/accounts.json
    • data/csv/accounts.csv

Validation

  • Typecheck passed with:
    • npx tsc -p tsconfig.json

elitex45 added 2 commits March 7, 2026 02:21
Add a dedicated CEX scraping pipeline and improve explorer switching reliability across chains.

Changes in this commit:

- Add new CEX-only pull script with:\n  - DefiLlama CEX keyword ingestion (with fallback to built-in list)\n  - keyword normalization/boundary matching\n  - label/name-tag exclusions for known false positives\n  - risk-term exclusion filter (exploit/hack/phish/scam/etc.)\n  - optional --no-10k-limit pagination mode\n  - compact percent progress output based on estimated totals\n  - fetch retries + timeout handling

- Add npm scripts:\n  - pull:cex\n  - pull:cex:all\n  - smoke:labelcloud

- Add smoke-labelcloud preflight script for explorer availability checks

- Add missing chain integrations and chainId mappings:\n  Avalanche, Polygon, Mantle, Scroll, Linea, Blast, Berachain, World Chain\n  (remove non-working Mode and zkSync Era entries)

- Improve BrowserFetcher for multi-chain runs:\n  - per-origin Cloudflare priming via setActiveOrigin()\n  - navigation-based HTML fetch for cross-origin reliability\n  - origin-aware POST behavior with re-prime retry

- Prime active origin per chain in fetch-all and fetch-cex loops
@elitex45 elitex45 changed the title feat: add CEX-only pull flow and reliable multi-chain browser priming feat: add more scan chains plus CEX-only pull with reliable multi-chain browser priming Mar 6, 2026
@elitex45
Copy link
Author

elitex45 commented Mar 6, 2026

@dawsbot Great repo btw very much needed :)

@dawsbot
Copy link
Owner

dawsbot commented Mar 17, 2026

Thanks for the PR @elitex45 ! Sorry for my delay. What's the method you recommend for me to test and verify this all works?

@elitex45
Copy link
Author

Its exactly same as before start the chrome instance in dev mode and run the command to pull cex addresses.

start chrome instance with this
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-devtools-eth-labels

then run pull:cex to fetch cex addresses it follows exact steps as before but this specifically focuses on fetching cex related addresses.
npm run pull:cex

other fix is when we try to select multiple explorers at once the code wasn't requesting to login in that explorer which was causing it to fail, now its fixed by launching the explorer links to each and login to them to let script to crawl it.

also added other supported explorer links to extend the scope.

@dawsbot @claude

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants