feat: multi-modal embeddings + cross-modal retrieval (#117) by dcfocus · Pull Request #119 · lance-format/lance-context

dcfocus · 2026-06-27T23:21:54Z

Summary

Closes #117. Adds multi-modal embeddings + cross-modal retrieval (text query → image results). lance-context bundles no models — the encoder is user-supplied.

Stacked on #118 (#116). Review/merge #116 first; until then this PR shows both commits (the multi-modal commit is last).

Motivation

Embedding providers were text-only (embed_texts), so non-text payloads got no embedding unless the caller computed one, and an image could never be retrieved by a text query (embeddings.py; api.py auto-embed gated on isinstance(str)).

What changed (Python embedding layer only — no core/Rust change)

MultiModalEmbeddingProvider protocol — extends EmbeddingProvider with embed_media(items: list[tuple[bytes, str]]) -> list[list[float]], embedding (payload_bytes, content_type) into the same space as embed_texts. Plus a supports_media(provider) helper. Both exported from the package.
Auto-embed media — add / upsert / add_many / upsert_many embed bytes payloads via embed_media when the provider supports it (text still uses embed_texts).
Cross-modal retrieval for free — because the two encoders share one space, ctx.search("a photo of a cat") embeds the text with embed_texts and matches image embeddings through the existing vector search. search also accepts a bytes query (embedded via embed_media).

clip_ctx = Context.create("multimodal.lance", embedding_dim=512,
                          embedding_provider=my_clip_provider)
clip_ctx.add("user", image_bytes, content_type="image/png", external_id="img-1")
results = clip_ctx.search("a photo of a cat")   # text -> image

Tests (`test_multimodal_embeddings.py`, deterministic CLIP-style stub)

protocol / supports_media runtime check
image auto-embedded via embed_media
cross-modal text→image retrieval
batch auto-embed of images
text-only provider does not embed images

Checks

ruff format --check, ruff check, pyright — clean
new tests + test_lazy_payload.py (from Multi-modal: lazy + projected payload reads — stop materializing binary/text/embedding on every read #116) pass

Acceptance criteria (#117)

Pluggable multi-modal embedder (embed_media), no bundled models
Auto-embed image/bytes payloads
Cross-modal text→image retrieval via the shared space
Existing text retrieval unchanged; tests with a stub provider

Note on pre-existing failures

The 8 failures in test_embeddings.py are pre-existing and unrelated to this change: its hand-written _DummyInner.add() mock accepts 10–18 positional args while the real binding now passes 22 (TypeError at api.py inner.add, predating this PR; not caught by the path-filtered CI job). The new tests here use a real Context, not that mock. Fixing the stale mock is a separate cleanup.

Closes #117

Reads previously hard-loaded `binary_payload` (and `text_payload` / `embedding`) for every row on every `list` / `search` / `get` (`batch_to_records` required the columns; the scanners didn't project). For multi-modal records that means a metadata `list` or a vector `search` pulled every image's bytes into memory. This adds column projection on the read path plus an on-demand blob fetch: - `ReadProjection { text, binary, embedding }` (default = load all, backward compatible) with `metadata_only()` / `without_binary()` helpers. - `batch_to_records` now tolerates those columns being projected out (omitted fields decode as `None`). - `ContextStore::list_filtered_projected` and `search_filtered_projected` read only the requested columns; the existing `*_with_options` methods delegate with the default (full) projection, so behavior is unchanged. Search always reads the embedding internally to score, then drops it from results if `projection.embedding` is false. - `ContextStore::get_blob(id)` fetches a single record's `binary_payload` on demand via a projected, id-filtered scan. Python: - `Context.list(..., include_binary=True, include_embedding=True)` and `Context.search(..., include_binary=..., include_embedding=...)` — set `include_binary=False` to query metadata/search without materializing media bytes (omitted payloads come back as `None`). - `Context.get_blob(id) -> bytes | None` for on-demand fetch. - The new pyo3 args default to `true`, so existing callers are unaffected. Tests: core (projection drops binary/embedding but keeps metadata; `get_blob` fetches on demand; search projection keeps ranking) and Python (`test_lazy_payload.py`). README gains a multi-modal read example. Independent of lance-format#115 (operates on the existing inline payload columns). Follow-up (noted, deferred): REST `include_binary` params + a blob GET route, and Lance `take_blob`/`BlobFile` zero-copy lazy reads for blob-encoded columns. Closes lance-format#116

Embedding providers were text-only (`embed_texts`), so non-text payloads (images/audio) got no embedding unless the caller supplied a vector, and there was no way to retrieve an image with a text query. This adds a multi-modal embedding path (no models bundled — the encoder is user-supplied): - `MultiModalEmbeddingProvider` protocol: extends `EmbeddingProvider` with `embed_media(items: list[tuple[bytes, str]]) -> list[list[float]]`, which embeds `(payload_bytes, content_type)` into the **same** vector space as `embed_texts`. Plus a `supports_media(provider)` helper. - Auto-embed for media: `add` / `upsert` / `add_many` / `upsert_many` now embed `bytes` payloads via `embed_media` when the provider supports it (text payloads still use `embed_texts`). - Cross-modal retrieval falls out of the shared space for free: a text query is embedded with `embed_texts` and matched against image embeddings via the existing vector search — `ctx.search("a photo of a cat")` returns image records. `search` also accepts a `bytes` query (embedded via `embed_media`). Because the two encoders share one space, no core/Rust change is needed — this is entirely in the Python embedding layer. Tests (`test_multimodal_embeddings.py`, deterministic CLIP-style stub): protocol/`supports_media` check, image auto-embed via `embed_media`, cross-modal text→image retrieval, batch auto-embed, and that a text-only provider does not embed images. Stacked on lance-format#116 (lazy/projected reads); independent of lance-format#115. Note: the 8 failures in the pre-existing `test_embeddings.py` are unrelated mock drift — its hand-written `_DummyInner.add()` accepts 10–18 positional args while the real binding now passes 22 (TypeError at `api.py` `inner.add`, predating this change). The new tests here use a real `Context`, not that mock. Closes lance-format#117

dcfocus added 2 commits June 27, 2026 18:58

dcfocus force-pushed the feat/issue-117-multimodal-embeddings branch from 86d6198 to f268719 Compare June 28, 2026 01:58

dcfocus merged commit 15e931e into lance-format:main Jun 28, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: multi-modal embeddings + cross-modal retrieval (#117)#119

feat: multi-modal embeddings + cross-modal retrieval (#117)#119
dcfocus merged 2 commits into
lance-format:mainfrom
dcfocus:feat/issue-117-multimodal-embeddings

dcfocus commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dcfocus commented Jun 27, 2026

Summary

Motivation

What changed (Python embedding layer only — no core/Rust change)

Tests (test_multimodal_embeddings.py, deterministic CLIP-style stub)

Checks

Acceptance criteria (#117)

Note on pre-existing failures

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tests (`test_multimodal_embeddings.py`, deterministic CLIP-style stub)