Skip to content

make --upstream_sync sync gap analysis progressively (#534)#755

Open
PRAteek-singHWY wants to merge 2 commits intoOWASP:mainfrom
PRAteek-singHWY:issue-534-upstream-sync-gap-analysis-progressive
Open

make --upstream_sync sync gap analysis progressively (#534)#755
PRAteek-singHWY wants to merge 2 commits intoOWASP:mainfrom
PRAteek-singHWY:issue-534-upstream-sync-gap-analysis-progressive

Conversation

@PRAteek-singHWY
Copy link
Contributor

@PRAteek-singHWY PRAteek-singHWY commented Feb 22, 2026

Fixes

fixes #534

Summary

This PR makes --upstream_sync progressively sync gap-analysis data from upstream and store it in the local database, so local Gap Analysis can run without requiring a massive standalone graph download.

It also fixes a method-shadowing bug in local gap-analysis execution that caused much more aggressive pruning than intended.

Problem

Issue #534 reports that local gap-analysis data is too large (~9GB) and difficult to use reliably for local development.
Expected behavior in the issue: upstream sync should opportunistically fetch the data needed to run Gap Analysis locally and persist it.

What Changed

1) Progressive map-analysis sync during upstream sync

In application/cmd/cre_main.py:

  • Added _progressively_sync_gap_analysis_from_upstream(...)
  • During download_graph_from_upstream(...):
    • fetches upstream standards from /standards
    • iterates ordered standard pairs
    • skips pairs already cached locally
    • requests upstream /map_analysis?standard=A&standard=B
    • persists only successful payloads containing result

2) Progressive weak-link sync

  • Added _progressively_sync_weak_links_for_pair(...)
  • For map-analysis rows with extra > 0, sync requests:
    • /map_analysis_weak_links?standard=A&standard=B&key=<row-key>
  • Persists weak-link result payloads locally under subresource cache keys.

3) Config, URL, timeout, and safety controls

  • Added constants:
    • DEFAULT_UPSTREAM_API_URL
    • UPSTREAM_SYNC_REQUEST_TIMEOUT_SECONDS
    • UPSTREAM_SYNC_MAP_ANALYSIS_MAX_PAIRS_ENV
  • Added _upstream_api_url() to centralize upstream URL resolution.
  • Added CRE_UPSTREAM_SYNC_MAX_MAP_ANALYSIS_PAIRS env cap:
    • 0 => no limit (sync all missing pairs)
    • positive integer => stop after syncing that many pairs
  • Handles non-200/invalid JSON/errors gracefully and continues sync.

4) Neo4j population call in upstream sync flow

download_graph_from_upstream(...) now attempts populate_neo4j_db(...) (unless CRE_NO_NEO4J is set), then runs progressive gap-analysis sync.

5) Gap-analysis parity fix in local runtime

In application/database/db.py:

  • Removed duplicate later gap_analysis(...) definition that was unintentionally overriding the feature-toggle entrypoint.
  • Result: runtime now correctly uses the intended implementation:
    • default: original exhaustive mode
    • optimized mode only when explicitly enabled.

Tests Added/Updated

  • application/tests/cre_main_test.py
    • test_download_graph_from_upstream_syncs_gap_analysis_progressively
    • test_download_graph_from_upstream_skips_cached_gap_analysis_pairs
    • test_download_graph_from_upstream_syncs_weak_links_results
    • test_download_graph_from_upstream_respects_max_pairs_limit
  • application/tests/gap_analysis_db_test.py
    • pruning test now explicitly patches Config.GAP_ANALYSIS_OPTIMIZED=True
  • application/tests/spreadsheet_test.py
    • gspread OAuth-dependent test now skips when local credentials are unavailable

#Screenshot
image

How To Validate Locally

A) Start dependencies

make start-containers

B) Run upstream sync with progressive gap-analysis backfill

unset CRE_UPSTREAM_SYNC_MAX_MAP_ANALYSIS_PAIRS
make upstream-sync

C) Start app services (separate terminals)

REDIS_URL=redis://localhost:6379 REDIS_NO_SSL=1 make start-worker
REDIS_URL=redis://localhost:6379 REDIS_NO_SSL=1 make dev-flask
yarn start

D) Use UI

Open:

  • http://localhost:9001/map_analysis

or a direct pair:

  • http://localhost:9001/map_analysis?base=SAMM&compare=ASVS

E) API spot-check

curl -sS "http://127.0.0.1:5000/rest/v1/map_analysis?standard=SAMM&standard=ASVS" | jq 'keys'
  • ["result"] => cached result served locally
  • ["job_id"] => on-demand computation started (worker path), poll via /ma_job_results

F) Optional stale-pair reset for deterministic retest

python cre.py --delete_map_analysis_for 'SAMM >> ASVS'
docker exec cre-redis-stack redis-cli DEL 'SAMM >> ASVS'

Operational Notes

  • Progressive sync only stores upstream responses that already contain result.
  • If upstream returns only job_id for a pair, that pair is not prefilled and remains on-demand locally.

-Prefer make upstream-sync (repo-standard for contributors)
Optional one-liner note: “Equivalent CLI: python cre.py --upstream_sync”

@PRAteek-singHWY
Copy link
Contributor Author

PRAteek-singHWY commented Feb 22, 2026

Hi @northdpole 👋

This PR is primarily based on the implementation approach I had shared earlier in discussion #534 (comment) . I went ahead and raised it to concretely demonstrate the similar idea in code and make it easier to evaluate. I’m completely open to iterating or restructuring it if a different direction would be preferable as per your guidance.

Approach (Analogy + Technical Summary)

Before this change, running Gap Analysis locally was like needing one route on a map but being forced to download the entire atlas first (~9GB graph payload).

With this PR, we switch to a progressive “fetch only what is needed” model.

What this PR does:
1. make upstream-sync still syncs the core CRE graph structure.
2. It then progressively requests upstream map-analysis results per standard pair (/map_analysis) and stores returned result payloads in local cache.
3. For entries that advertise extra detail (extra > 0), it also syncs weak-link subresults (/map_analysis_weak_links) and caches them locally.
4. Already-cached pairs are skipped, and sync can be bounded via CRE_UPSTREAM_SYNC_MAX_MAP_ANALYSIS_PAIRS.
5. If a pair is not prefilled, existing on-demand local worker computation remains as fallback.

Net effect:
• No requirement to download a monolithic 9GB gap-analysis dataset up front.
• Local gap analysis becomes immediately practical for synced pairs.
• Data locality improves over time as more pairs are cached.
• Existing fallback behavior is preserved.

I’d really appreciate your feedback on whether this aligns with the intended long-term direction for upstream sync and gap analysis optimization. Happy to refine the structure, abstraction boundaries, or sync semantics if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

make --upstream_sync also sync the gap analysis graph progressively

1 participant