Skip to content

Remove synthetic/structured data generation from diskann-providers#963

Merged
JordanMaples merged 5 commits intomainfrom
jordanmaples/synthetic_data
Apr 22, 2026
Merged

Remove synthetic/structured data generation from diskann-providers#963
JordanMaples merged 5 commits intomainfrom
jordanmaples/synthetic_data

Conversation

@JordanMaples
Copy link
Copy Markdown
Contributor

This is a follow up PR for #904, which has an outdated baseline and can be closed with this superseding it.
Essentially the same work was done here just without the provider/async_/caching changes. This closes #903.

When #953 merges, the from_dim definition will need to be updated to support the '2' case.

Replace generate_structured_data.rs and generate_synthetic_labels_utils.rs with diskann::graph::test::synthetic::Grid and diskann-tools' own label generator respectively.

  • Rewrite GenerateGrid trait impls to delegate to Grid::data/data_as
  • Replace all adj list generation calls with Grid::neighbors
  • Rewire generate_synthetic_labels binary to diskann-tools version
  • Delete both redundant utility files and clean up exports

Replace generate_structured_data.rs and generate_synthetic_labels_utils.rs
with diskann::graph::test::synthetic::Grid and diskann-tools' own label
generator respectively.

- Rewrite GenerateGrid trait impls to delegate to Grid::data/data_as
- Replace all adj list generation calls with Grid::neighbors
- Rewire generate_synthetic_labels binary to diskann-tools version
- Delete both redundant utility files and clean up exports

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@JordanMaples JordanMaples requested review from a team and Copilot April 21, 2026 17:07
@JordanMaples JordanMaples changed the title Jordanmaples/synthetic data Remove synthetic/structured data generation from diskann-providers Apr 21, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes redundant synthetic/structured data generation utilities from diskann-providers by migrating grid generation to diskann::graph::test::synthetic::Grid and rewiring the synthetic label generation CLI to use diskann-tools’ label generator.

Changes:

  • Add Grid::from_dim() helper (plus tests) to construct a supported Grid from a dimension count.
  • Update diskann-providers async/caching tests to generate grid vectors + adjacency lists via Grid::{data,data_as,neighbors}.
  • Rewire diskann-tools generate_synthetic_labels binary to call diskann_tools::utils::generate_labels and delete the old diskann-providers label/grid generator utilities + exports.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
diskann/src/graph/test/synthetic.rs Adds Grid::from_dim() and tests to support dimension→Grid construction for migrated callers.
diskann-tools/src/bin/generate_synthetic_labels.rs Switches the CLI to diskann-tools label generation util and passes a storage provider.
diskann-providers/src/utils/mod.rs Removes public exports for the deleted structured-data + synthetic-label utils.
diskann-providers/src/utils/generate_synthetic_labels_utils.rs Deletes the old label generator implementation (now sourced from diskann-tools).
diskann-providers/src/utils/generate_structured_data.rs Deletes the old grid/circle structured data generator implementation (now sourced from diskann::graph::test::synthetic::Grid).
diskann-providers/src/model/graph/provider/async_/caching/example.rs Updates adjacency generation to use Grid::neighbors via async_tests::grid_from_dim.
diskann-providers/src/index/diskann_async.rs Rewrites GenerateGrid impls + adjacency generation to delegate to Grid::data/data_as/neighbors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diskann-providers/src/model/graph/provider/async_/caching/example.rs Outdated
Comment thread diskann-providers/src/index/diskann_async.rs
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@JordanMaples JordanMaples force-pushed the jordanmaples/synthetic_data branch from b6846d0 to 0aa62ba Compare April 21, 2026 17:15
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 97.29730% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 90.47%. Comparing base (a0dfde6) to head (6fa8471).

Files with missing lines Patch % Lines
diskann-tools/src/bin/generate_synthetic_labels.rs 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #963      +/-   ##
==========================================
+ Coverage   89.31%   90.47%   +1.16%     
==========================================
  Files         448      446       -2     
  Lines       83329    83076     -253     
==========================================
+ Hits        74422    75160     +738     
+ Misses       8907     7916     -991     
Flag Coverage Δ
miri 90.47% <97.29%> (+1.16%) ⬆️
unittests 90.43% <97.29%> (+1.28%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-providers/src/index/diskann_async.rs 96.42% <100.00%> (+0.12%) ⬆️
diskann/src/graph/test/synthetic.rs 100.00% <100.00%> (ø)
diskann-tools/src/bin/generate_synthetic_labels.rs 0.00% <0.00%> (ø)

... and 39 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JordanMaples JordanMaples enabled auto-merge (squash) April 21, 2026 19:51
Comment thread diskann-providers/src/index/diskann_async.rs
@JordanMaples JordanMaples merged commit 443146e into main Apr 22, 2026
26 checks passed
@JordanMaples JordanMaples deleted the jordanmaples/synthetic_data branch April 22, 2026 16:35
@arkrishn94 arkrishn94 mentioned this pull request Apr 22, 2026
arkrishn94 added a commit that referenced this pull request Apr 22, 2026
Bumping to 0.50.1 to propagate changes to consumers.

Changes since previous bump: 

## What's Changed
* Add more agentic guard rails by @hildebrandmw in
#871
* Cleanup `diskann-benchmark-runner` and friends. by @hildebrandmw in
#865
* Use `--all-targets` for the no-default-features CI run. by
@hildebrandmw in #874
* Remove unused `normalizing_util.rs` from `diskann-providers` by
@Copilot in #902
* Benchmark Support for A/B Tests by @hildebrandmw in
#900
* [diskann-garnet] Bump diskann-garnet to 1.0.26 by @tiagonapoli in
#925
* Remove the `AdjacencyList` from `diskann-providers` by @hildebrandmw
in #915
* [PQ cleanup] Part 1: Move pq_scratch, quantizer_preprocess and
pq_dataset to `diskann-disk` by @arkrishn94 in
#930
* Forbid Debug in diskann-benchmark by @arrayka in
#914
* Remove DebugProvider by @JordanMaples in
#923
* [diskann-garnet] Create workflow to publish to nuget by @tiagonapoli
in #926
* Move k-means implementation from diskann-providers to diskann-disk by
@Copilot in #933
* Inline minmax distance evaluations by @arkrishn94 in
#935
* Use `rust-toolchain.toml` in CI by @hildebrandmw in
#934
* Add a globally blocking CI gate. by @hildebrandmw in
#932
* Remove `utils/math_util.rs` from `diskann-providers` by @Copilot in
#921
* Bump rand from 0.9.2 to 0.9.3 by @dependabot[bot] in
#945
* Remove OPQ and friends by @arkrishn94 in
#947
* Migrate test_flaky_consolidate from diskann_providers to diskann by
@JordanMaples in #942
* Remove GraphDataType from diskann-providers by @wuw92 in
#950
* Remove unused method extract_best_l_candidates in
NeighborPriorityQueue by @doliawu in
#951
* Add `Debug` bounds to `VectorRepr`'s distance GATs. by @hildebrandmw
in #948
* Add benchmark pipeline with Rust-native A/B validation by
@YuanyuanTian-hh in #912
* Remove unnecessary `Default` bound from `Neighbor`'s `VectorIdType` by
@doliawu in #956
* Replace `AlignedBoxWithSlice` with plain `Vec` / `Matrix` where
alignment is unused by @wuw92 in
#955
* [minmax] 8-bit benchmark by @arkrishn94 in
#959
* Add `MultiInsertStrategy` implementations for `BfTreeProvider` by
@hildebrandmw in #949
* Replace `AlignedBoxWithSlice` with `Vec` in PQScratch and disk fp
vector caches by @wuw92 in #960
* Adding unit tests for paged_search by @JordanMaples in
#962
* Remove AlignedBoxWithSlice wrapper and add alias to Poly<[T],
AlignedAllocator> by @JordanMaples in
#965
* Remove synthetic/structured data generation from diskann-providers by
@JordanMaples in #963
* added tests and some baselines for range_search by @JordanMaples in
#961

## New Contributors
* @JordanMaples made their first contribution in
#923
* @wuw92 made their first contribution in
#950
* @doliawu made their first contribution in
#951
* @YuanyuanTian-hh made their first contribution in
#912

**Full Changelog**:
v0.50.0...v0.50.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

move/Remove Synthetic/structured data from diskann-providers

5 participants