#13288 added codec-level support for bit vectors via HnswBitVectorsFormat and FlatBitVectorsScorer in the codecs module. there's no document field for bit vectors and no similarity function that honestly represents Hamming distance.
if we want to index binary embeddings:
- Manually pack bits into
byte[]
- Use
KnnByteVectorField with an arbitrary VectorSimilarityFunction (e.g. DOT_PRODUCT) that the codec silently ignores
- Configure
HnswBitVectorsFormat which internally uses FlatBitVectorsScorer regardless of what similarity the field declares
This works mechanically but the API, declared similarity doesn't match the actual scoring.
can we add,
-
VectorSimilarityFunction.HAMMING - Hamming distance for byte-encoded bit vectors. Score = (totalBits - hammingDistance) / totalBits, producing [0, 1]. Float vectors throw UnsupportedOperationException.
-
KnnBitVectorField - a document field for packed bit vectors that uses HAMMING similarity.
-
Validation in FieldInfo rejecting HAMMING + FLOAT32 (nonsensical combination).
qq:
-
Should HAMMING live in VectorSimilarityFunction (core enum) or be handled differently?
Adding to the core enum means every FlatVectorsScorer, quantized scorer, and memory-segment scorer needs to handle it (even if just to reject it). It also affects any external code with exhaustive switches over the enum. The alternative would be a separate mechanism, but that seems like it would create a parallel API for what is fundamentally the same concept.
-
Dimension semantics, bytes or bits?
KnnByteVectorField reports vectorDimension() as the number of bytes. For bit vectors, each byte packs 8 bits, so a 128-bit embedding would report dimension=16. This is confusing but consistent with how the codec layer works. Should KnnBitVectorField report the bit count instead, or keep the byte count to match the storage layer?
-
Should KnnBitVectorField extend KnnByteVectorField?
Extending it gives free compatibility with IndexingChain and KnnByteVectorQuery. But the superclass has byte-vector-specific assumptions (e.g. cosine zero-vector checks, Javadoc saying "each byte represents a vector dimension"). A standalone class extending Field would be cleaner semantically but requires more plumbing.
-
Backward compatibility, the similarity ordinal is persisted in two places
The similarity function is written as an ordinal in both:
The field infos file (.fnm) via Lucene94FieldInfosFormat - written as a byte via distFuncToOrd()
The HNSW metadata file (.vem) via Lucene99HnswVectorsWriter - written as an int via distFuncToOrd()
An older reader seeing ordinal 4 (HAMMING) in either file will throw IllegalArgumentException("invalid distance function: 4").
so should we,
(a) bump Lucene99HnswVectorsFormat.VERSION_CURRENT so old readers get a clean "unsupported version" error
(b) bump Lucene94FieldInfosFormat's format version (like FORMAT_PARENT_FIELD was added)
(c) do both above?
(d) let old readers hit the "invalid distance function" error?
What's the preferred approach?
-
Test framework impact
BaseKnnVectorsFormatTestCase.randomSimilarity() currently returns any similarity. With HAMMING, float vector tests would randomly hit the FLOAT32+HAMMING rejection. exclude HAMMING from randomSimilarity() and test it separately in bit-vector-specific tests. Is this acceptable, or should the base test framework be made encoding-aware?
#13288 added codec-level support for bit vectors via
HnswBitVectorsFormatandFlatBitVectorsScorerin the codecs module. there's no document field for bit vectors and no similarity function that honestly represents Hamming distance.if we want to index binary embeddings:
byte[]KnnByteVectorFieldwith an arbitraryVectorSimilarityFunction(e.g.DOT_PRODUCT) that the codec silently ignoresHnswBitVectorsFormatwhich internally usesFlatBitVectorsScorerregardless of what similarity the field declaresThis works mechanically but the API, declared similarity doesn't match the actual scoring.
can we add,
VectorSimilarityFunction.HAMMING- Hamming distance for byte-encoded bit vectors. Score =(totalBits - hammingDistance) / totalBits, producing [0, 1]. Float vectors throwUnsupportedOperationException.KnnBitVectorField- a document field for packed bit vectors that uses HAMMING similarity.Validation in
FieldInforejecting HAMMING + FLOAT32 (nonsensical combination).qq:
Should HAMMING live in
VectorSimilarityFunction(core enum) or be handled differently?Adding to the core enum means every
FlatVectorsScorer, quantized scorer, and memory-segment scorer needs to handle it (even if just to reject it). It also affects any external code with exhaustive switches over the enum. The alternative would be a separate mechanism, but that seems like it would create a parallel API for what is fundamentally the same concept.Dimension semantics, bytes or bits?
KnnByteVectorFieldreportsvectorDimension()as the number of bytes. For bit vectors, each byte packs 8 bits, so a 128-bit embedding would report dimension=16. This is confusing but consistent with how the codec layer works. ShouldKnnBitVectorFieldreport the bit count instead, or keep the byte count to match the storage layer?Should
KnnBitVectorFieldextendKnnByteVectorField?Extending it gives free compatibility with
IndexingChainandKnnByteVectorQuery. But the superclass has byte-vector-specific assumptions (e.g. cosine zero-vector checks, Javadoc saying "each byte represents a vector dimension"). A standalone class extendingFieldwould be cleaner semantically but requires more plumbing.Backward compatibility, the similarity ordinal is persisted in two places
The similarity function is written as an ordinal in both:
The field infos file (.fnm) via Lucene94FieldInfosFormat - written as a byte via distFuncToOrd()
The HNSW metadata file (.vem) via Lucene99HnswVectorsWriter - written as an int via distFuncToOrd()
An older reader seeing ordinal 4 (HAMMING) in either file will throw IllegalArgumentException("invalid distance function: 4").
so should we,
(a) bump Lucene99HnswVectorsFormat.VERSION_CURRENT so old readers get a clean "unsupported version" error
(b) bump Lucene94FieldInfosFormat's format version (like FORMAT_PARENT_FIELD was added)
(c) do both above?
(d) let old readers hit the "invalid distance function" error?
What's the preferred approach?
Test framework impact
BaseKnnVectorsFormatTestCase.randomSimilarity()currently returns any similarity. With HAMMING, float vector tests would randomly hit the FLOAT32+HAMMING rejection. exclude HAMMING fromrandomSimilarity()and test it separately in bit-vector-specific tests. Is this acceptable, or should the base test framework be made encoding-aware?