make --upstream_sync sync gap analysis progressively (#534)#755
make --upstream_sync sync gap analysis progressively (#534)#755PRAteek-singHWY wants to merge 2 commits intoOWASP:mainfrom
Conversation
…alysis method shadowing (OWASP#534)
|
Hi @northdpole 👋 This PR is primarily based on the implementation approach I had shared earlier in discussion #534 (comment) . I went ahead and raised it to concretely demonstrate the similar idea in code and make it easier to evaluate. I’m completely open to iterating or restructuring it if a different direction would be preferable as per your guidance. Approach (Analogy + Technical Summary) Before this change, running Gap Analysis locally was like needing one route on a map but being forced to download the entire atlas first (~9GB graph payload). With this PR, we switch to a progressive “fetch only what is needed” model. What this PR does: Net effect: I’d really appreciate your feedback on whether this aligns with the intended long-term direction for upstream sync and gap analysis optimization. Happy to refine the structure, abstraction boundaries, or sync semantics if needed. |
Fixes
fixes #534
Summary
This PR makes
--upstream_syncprogressively sync gap-analysis data from upstream and store it in the local database, so local Gap Analysis can run without requiring a massive standalone graph download.It also fixes a method-shadowing bug in local gap-analysis execution that caused much more aggressive pruning than intended.
Problem
Issue #534 reports that local gap-analysis data is too large (~9GB) and difficult to use reliably for local development.
Expected behavior in the issue: upstream sync should opportunistically fetch the data needed to run Gap Analysis locally and persist it.
What Changed
1) Progressive map-analysis sync during upstream sync
In
application/cmd/cre_main.py:_progressively_sync_gap_analysis_from_upstream(...)download_graph_from_upstream(...):/standards/map_analysis?standard=A&standard=Bresult2) Progressive weak-link sync
_progressively_sync_weak_links_for_pair(...)extra > 0, sync requests:/map_analysis_weak_links?standard=A&standard=B&key=<row-key>resultpayloads locally under subresource cache keys.3) Config, URL, timeout, and safety controls
DEFAULT_UPSTREAM_API_URLUPSTREAM_SYNC_REQUEST_TIMEOUT_SECONDSUPSTREAM_SYNC_MAP_ANALYSIS_MAX_PAIRS_ENV_upstream_api_url()to centralize upstream URL resolution.CRE_UPSTREAM_SYNC_MAX_MAP_ANALYSIS_PAIRSenv cap:0=> no limit (sync all missing pairs)4) Neo4j population call in upstream sync flow
download_graph_from_upstream(...)now attemptspopulate_neo4j_db(...)(unlessCRE_NO_NEO4Jis set), then runs progressive gap-analysis sync.5) Gap-analysis parity fix in local runtime
In
application/database/db.py:gap_analysis(...)definition that was unintentionally overriding the feature-toggle entrypoint.Tests Added/Updated
application/tests/cre_main_test.pytest_download_graph_from_upstream_syncs_gap_analysis_progressivelytest_download_graph_from_upstream_skips_cached_gap_analysis_pairstest_download_graph_from_upstream_syncs_weak_links_resultstest_download_graph_from_upstream_respects_max_pairs_limitapplication/tests/gap_analysis_db_test.pyConfig.GAP_ANALYSIS_OPTIMIZED=Trueapplication/tests/spreadsheet_test.py#Screenshot

How To Validate Locally
A) Start dependencies
B) Run upstream sync with progressive gap-analysis backfill
unset CRE_UPSTREAM_SYNC_MAX_MAP_ANALYSIS_PAIRS make upstream-syncC) Start app services (separate terminals)
D) Use UI
Open:
http://localhost:9001/map_analysisor a direct pair:
http://localhost:9001/map_analysis?base=SAMM&compare=ASVSE) API spot-check
["result"]=> cached result served locally["job_id"]=> on-demand computation started (worker path), poll via/ma_job_resultsF) Optional stale-pair reset for deterministic retest
Operational Notes
result.job_idfor a pair, that pair is not prefilled and remains on-demand locally.-Prefer make upstream-sync (repo-standard for contributors)
Optional one-liner note: “Equivalent CLI: python cre.py --upstream_sync”