feat: multi-modal embeddings + cross-modal retrieval (#117)#119
Merged
dcfocus merged 2 commits intoJun 28, 2026
Merged
Conversation
Reads previously hard-loaded `binary_payload` (and `text_payload` /
`embedding`) for every row on every `list` / `search` / `get`
(`batch_to_records` required the columns; the scanners didn't project). For
multi-modal records that means a metadata `list` or a vector `search` pulled
every image's bytes into memory.
This adds column projection on the read path plus an on-demand blob fetch:
- `ReadProjection { text, binary, embedding }` (default = load all, backward
compatible) with `metadata_only()` / `without_binary()` helpers.
- `batch_to_records` now tolerates those columns being projected out
(omitted fields decode as `None`).
- `ContextStore::list_filtered_projected` and `search_filtered_projected`
read only the requested columns; the existing `*_with_options` methods
delegate with the default (full) projection, so behavior is unchanged.
Search always reads the embedding internally to score, then drops it from
results if `projection.embedding` is false.
- `ContextStore::get_blob(id)` fetches a single record's `binary_payload` on
demand via a projected, id-filtered scan.
Python:
- `Context.list(..., include_binary=True, include_embedding=True)` and
`Context.search(..., include_binary=..., include_embedding=...)` — set
`include_binary=False` to query metadata/search without materializing media
bytes (omitted payloads come back as `None`).
- `Context.get_blob(id) -> bytes | None` for on-demand fetch.
- The new pyo3 args default to `true`, so existing callers are unaffected.
Tests: core (projection drops binary/embedding but keeps metadata; `get_blob`
fetches on demand; search projection keeps ranking) and Python
(`test_lazy_payload.py`). README gains a multi-modal read example.
Independent of lance-format#115 (operates on the existing inline payload columns).
Follow-up (noted, deferred): REST `include_binary` params + a blob GET route,
and Lance `take_blob`/`BlobFile` zero-copy lazy reads for blob-encoded
columns.
Closes lance-format#116
Embedding providers were text-only (`embed_texts`), so non-text payloads
(images/audio) got no embedding unless the caller supplied a vector, and
there was no way to retrieve an image with a text query.
This adds a multi-modal embedding path (no models bundled — the encoder is
user-supplied):
- `MultiModalEmbeddingProvider` protocol: extends `EmbeddingProvider` with
`embed_media(items: list[tuple[bytes, str]]) -> list[list[float]]`, which
embeds `(payload_bytes, content_type)` into the **same** vector space as
`embed_texts`. Plus a `supports_media(provider)` helper.
- Auto-embed for media: `add` / `upsert` / `add_many` / `upsert_many` now
embed `bytes` payloads via `embed_media` when the provider supports it
(text payloads still use `embed_texts`).
- Cross-modal retrieval falls out of the shared space for free: a text query
is embedded with `embed_texts` and matched against image embeddings via the
existing vector search — `ctx.search("a photo of a cat")` returns image
records. `search` also accepts a `bytes` query (embedded via `embed_media`).
Because the two encoders share one space, no core/Rust change is needed —
this is entirely in the Python embedding layer.
Tests (`test_multimodal_embeddings.py`, deterministic CLIP-style stub):
protocol/`supports_media` check, image auto-embed via `embed_media`,
cross-modal text→image retrieval, batch auto-embed, and that a text-only
provider does not embed images.
Stacked on lance-format#116 (lazy/projected reads); independent of lance-format#115.
Note: the 8 failures in the pre-existing `test_embeddings.py` are unrelated
mock drift — its hand-written `_DummyInner.add()` accepts 10–18 positional
args while the real binding now passes 22 (TypeError at `api.py` `inner.add`,
predating this change). The new tests here use a real `Context`, not that
mock.
Closes lance-format#117
86d6198 to
f268719
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #117. Adds multi-modal embeddings + cross-modal retrieval (text query → image results). lance-context bundles no models — the encoder is user-supplied.
Motivation
Embedding providers were text-only (
embed_texts), so non-text payloads got no embedding unless the caller computed one, and an image could never be retrieved by a text query (embeddings.py;api.pyauto-embed gated onisinstance(str)).What changed (Python embedding layer only — no core/Rust change)
MultiModalEmbeddingProviderprotocol — extendsEmbeddingProviderwithembed_media(items: list[tuple[bytes, str]]) -> list[list[float]], embedding(payload_bytes, content_type)into the same space asembed_texts. Plus asupports_media(provider)helper. Both exported from the package.add/upsert/add_many/upsert_manyembedbytespayloads viaembed_mediawhen the provider supports it (text still usesembed_texts).ctx.search("a photo of a cat")embeds the text withembed_textsand matches image embeddings through the existing vector search.searchalso accepts abytesquery (embedded viaembed_media).Tests (
test_multimodal_embeddings.py, deterministic CLIP-style stub)supports_mediaruntime checkembed_mediaChecks
ruff format --check,ruff check,pyright— cleantest_lazy_payload.py(from Multi-modal: lazy + projected payload reads — stop materializing binary/text/embedding on every read #116) passAcceptance criteria (#117)
embed_media), no bundled modelsNote on pre-existing failures
The 8 failures in
test_embeddings.pyare pre-existing and unrelated to this change: its hand-written_DummyInner.add()mock accepts 10–18 positional args while the real binding now passes 22 (TypeErroratapi.pyinner.add, predating this PR; not caught by the path-filtered CI job). The new tests here use a realContext, not that mock. Fixing the stale mock is a separate cleanup.Closes #117