VectorSimilarityFunction.HAMMING and KnnBitVectorField for bit vectors

#13288 added codec-level support for bit vectors via `HnswBitVectorsFormat` and `FlatBitVectorsScorer` in the codecs module. there's no document field for bit vectors and no similarity function that honestly represents Hamming distance.

if we want to index binary embeddings:
1. Manually pack bits into `byte[]`
2. Use `KnnByteVectorField` with an arbitrary `VectorSimilarityFunction` (e.g. `DOT_PRODUCT`) that the codec silently ignores
3. Configure `HnswBitVectorsFormat` which internally uses `FlatBitVectorsScorer` regardless of what similarity the field declares

This works mechanically but the API, declared similarity doesn't match the actual scoring.

can we add,

1. `VectorSimilarityFunction.HAMMING` - Hamming distance for byte-encoded bit vectors. Score = `(totalBits - hammingDistance) / totalBits`, producing [0, 1]. Float vectors throw `UnsupportedOperationException`.

2. `KnnBitVectorField` - a document field for packed bit vectors that uses HAMMING similarity.

3. Validation in `FieldInfo` rejecting HAMMING + FLOAT32 (nonsensical combination).

qq:
1. Should HAMMING live in `VectorSimilarityFunction` (core enum) or be handled differently?
Adding to the core enum means every `FlatVectorsScorer`, quantized scorer, and memory-segment scorer needs to handle it (even if just to reject it). It also affects any external code with exhaustive switches over the enum. The alternative would be a separate mechanism, but that seems like it would create a parallel API for what is fundamentally the same concept.

2. Dimension semantics, bytes or bits?
`KnnByteVectorField` reports `vectorDimension()` as the number of bytes. For bit vectors, each byte packs 8 bits, so a 128-bit embedding would report dimension=16. This is confusing but consistent with how the codec layer works. Should `KnnBitVectorField` report the bit count instead, or keep the byte count to match the storage layer?

3. Should `KnnBitVectorField` extend `KnnByteVectorField`?
Extending it gives free compatibility with `IndexingChain` and `KnnByteVectorQuery`. But the superclass has byte-vector-specific assumptions (e.g. cosine zero-vector checks, Javadoc saying "each byte represents a vector dimension"). A standalone class extending `Field` would be cleaner semantically but requires more plumbing.

4. Backward compatibility, the similarity ordinal is persisted in two places
The similarity function is written as an ordinal in both:
The field infos file (.fnm) via Lucene94FieldInfosFormat - written as a byte via distFuncToOrd()
The HNSW metadata file (.vem) via Lucene99HnswVectorsWriter - written as an int via distFuncToOrd()
An older reader seeing ordinal 4 (HAMMING) in either file will throw IllegalArgumentException("invalid distance function: 4"). 
so should we,
(a) bump Lucene99HnswVectorsFormat.VERSION_CURRENT so old readers get a clean "unsupported version" error
(b) bump Lucene94FieldInfosFormat's format version (like FORMAT_PARENT_FIELD was added)
(c) do both above?
(d) let old readers hit the "invalid distance function" error?
What's the preferred approach?

5. Test framework impact
`BaseKnnVectorsFormatTestCase.randomSimilarity()` currently returns any similarity. With HAMMING, float vector tests would randomly hit the FLOAT32+HAMMING rejection. exclude HAMMING from `randomSimilarity()` and test it separately in bit-vector-specific tests. Is this acceptable, or should the base test framework be made encoding-aware?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VectorSimilarityFunction.HAMMING and KnnBitVectorField for bit vectors #16211

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

VectorSimilarityFunction.HAMMING and KnnBitVectorField for bit vectors #16211

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions