Skip to content

[BUG] Disk-index benchmark with PQ fails due to misread dimension field #966

@deniz-dilaverler

Description

@deniz-dilaverler

Expected Behavior

It is supposed to build an index and run a search benchmark test on my sift1M dataset

Actual Behavior

Getting the response

######################
# Running Job 1 of 1 #
######################

Disk Index Build
         Data Type: float32
         Data File: /home/deniz/dev/thesis/datasets/sift/sift_base.fvecs
          Distance: squared_l2
               Dim: 128
        Max Degree: 70
           L Build: 125
     Build Threads: 32
Build RAM Limit GB: 48
         PQ Chunks: 33
      Quantization: pq, chunks 32
         Save Path: data.index

Error: Dimension must be greater than zero

This doesn't make any sense as the dimension field is visibly 128

Example Code

create a run.json file as such

{
  "search_directories": [
    "<path to the directory with sift1M dataset>"
  ],

  "jobs": [
    {
      "type": "disk-index",
      "content": {
        "source": {
          "disk-index-source": "Build",

          "dim": 128,
          "data_type": "float32",
          "data": "sift_base.fvecs",
          "distance": "squared_l2",
          "max_degree": 70,
          "alpha": 2,
          "l_build": 125,
          "num_threads": 32,
          "build_ram_limit_gb": 48,
          "quantization_type": "PQ_32",
          "num_pq_chunks": 32,
          "save_path": "data.index"
        },
        "search_phase": {
          "queries": "sift_query.fvecs",
          "groundtruth": "sift_groundtruth.ivecs",
          "search_list": [10, 20, 40],
          "beam_width": 4,
          "recall_at": 10,
          "num_threads": 1,
          "is_flat_search": false,
          "distance": "squared_l2",
          "vector_filters_file": null
        }
      }
    }
  
    
  ]
}

and run cargo run --release --package diskann-benchmark --features "disk-index product-quantization" -- run --input-file ./run.json --output-file output.json

Dataset Description

Please tell us about the shape and datatype of your data, (e.g. 128 dimensions, 12.3 billion points, floats)

  • Dimensions: 128
  • Number of Points: 1M
  • Data type: float32

Error

######################
# Running Job 1 of 1 #
######################

Disk Index Build
         Data Type: float32
         Data File: $$REDACTED$$/datasets/sift/sift_base.fvecs
          Distance: squared_l2
               Dim: 128
        Max Degree: 70
           L Build: 125
     Build Threads: 32
Build RAM Limit GB: 48
         PQ Chunks: 33
      Quantization: pq, chunks 32
         Save Path: data.index

Error: Dimension must be greater than zero

Your Environment

  • Fedora 42 Work Station
  • commit: 779977c

Additional Details

I believe the program is trying to read the dimension value of from the metadata of my vector data file at diskann-providers/src/utils/file_util.rs:32 but for some reason it reads the dimension as 0. The base vector file I used can be downloaded from: ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions