Skip to content

VectorSimilarityFunction.HAMMING and KnnBitVectorField for bit vectors #16211

Description

@iprithv

#13288 added codec-level support for bit vectors via HnswBitVectorsFormat and FlatBitVectorsScorer in the codecs module. there's no document field for bit vectors and no similarity function that honestly represents Hamming distance.

if we want to index binary embeddings:

  1. Manually pack bits into byte[]
  2. Use KnnByteVectorField with an arbitrary VectorSimilarityFunction (e.g. DOT_PRODUCT) that the codec silently ignores
  3. Configure HnswBitVectorsFormat which internally uses FlatBitVectorsScorer regardless of what similarity the field declares

This works mechanically but the API, declared similarity doesn't match the actual scoring.

can we add,

  1. VectorSimilarityFunction.HAMMING - Hamming distance for byte-encoded bit vectors. Score = (totalBits - hammingDistance) / totalBits, producing [0, 1]. Float vectors throw UnsupportedOperationException.

  2. KnnBitVectorField - a document field for packed bit vectors that uses HAMMING similarity.

  3. Validation in FieldInfo rejecting HAMMING + FLOAT32 (nonsensical combination).

qq:

  1. Should HAMMING live in VectorSimilarityFunction (core enum) or be handled differently?
    Adding to the core enum means every FlatVectorsScorer, quantized scorer, and memory-segment scorer needs to handle it (even if just to reject it). It also affects any external code with exhaustive switches over the enum. The alternative would be a separate mechanism, but that seems like it would create a parallel API for what is fundamentally the same concept.

  2. Dimension semantics, bytes or bits?
    KnnByteVectorField reports vectorDimension() as the number of bytes. For bit vectors, each byte packs 8 bits, so a 128-bit embedding would report dimension=16. This is confusing but consistent with how the codec layer works. Should KnnBitVectorField report the bit count instead, or keep the byte count to match the storage layer?

  3. Should KnnBitVectorField extend KnnByteVectorField?
    Extending it gives free compatibility with IndexingChain and KnnByteVectorQuery. But the superclass has byte-vector-specific assumptions (e.g. cosine zero-vector checks, Javadoc saying "each byte represents a vector dimension"). A standalone class extending Field would be cleaner semantically but requires more plumbing.

  4. Backward compatibility, the similarity ordinal is persisted in two places
    The similarity function is written as an ordinal in both:
    The field infos file (.fnm) via Lucene94FieldInfosFormat - written as a byte via distFuncToOrd()
    The HNSW metadata file (.vem) via Lucene99HnswVectorsWriter - written as an int via distFuncToOrd()
    An older reader seeing ordinal 4 (HAMMING) in either file will throw IllegalArgumentException("invalid distance function: 4").
    so should we,
    (a) bump Lucene99HnswVectorsFormat.VERSION_CURRENT so old readers get a clean "unsupported version" error
    (b) bump Lucene94FieldInfosFormat's format version (like FORMAT_PARENT_FIELD was added)
    (c) do both above?
    (d) let old readers hit the "invalid distance function" error?
    What's the preferred approach?

  5. Test framework impact
    BaseKnnVectorsFormatTestCase.randomSimilarity() currently returns any similarity. With HAMMING, float vector tests would randomly hit the FLOAT32+HAMMING rejection. exclude HAMMING from randomSimilarity() and test it separately in bit-vector-specific tests. Is this acceptable, or should the base test framework be made encoding-aware?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions