Skip to content

feat: multi-modal embeddings + cross-modal retrieval (#117)#119

Merged
dcfocus merged 2 commits into
lance-format:mainfrom
dcfocus:feat/issue-117-multimodal-embeddings
Jun 28, 2026
Merged

feat: multi-modal embeddings + cross-modal retrieval (#117)#119
dcfocus merged 2 commits into
lance-format:mainfrom
dcfocus:feat/issue-117-multimodal-embeddings

Conversation

@dcfocus

@dcfocus dcfocus commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

Closes #117. Adds multi-modal embeddings + cross-modal retrieval (text query → image results). lance-context bundles no models — the encoder is user-supplied.

Stacked on #118 (#116). Review/merge #116 first; until then this PR shows both commits (the multi-modal commit is last).

Motivation

Embedding providers were text-only (embed_texts), so non-text payloads got no embedding unless the caller computed one, and an image could never be retrieved by a text query (embeddings.py; api.py auto-embed gated on isinstance(str)).

What changed (Python embedding layer only — no core/Rust change)

  • MultiModalEmbeddingProvider protocol — extends EmbeddingProvider with embed_media(items: list[tuple[bytes, str]]) -> list[list[float]], embedding (payload_bytes, content_type) into the same space as embed_texts. Plus a supports_media(provider) helper. Both exported from the package.
  • Auto-embed mediaadd / upsert / add_many / upsert_many embed bytes payloads via embed_media when the provider supports it (text still uses embed_texts).
  • Cross-modal retrieval for free — because the two encoders share one space, ctx.search("a photo of a cat") embeds the text with embed_texts and matches image embeddings through the existing vector search. search also accepts a bytes query (embedded via embed_media).
clip_ctx = Context.create("multimodal.lance", embedding_dim=512,
                          embedding_provider=my_clip_provider)
clip_ctx.add("user", image_bytes, content_type="image/png", external_id="img-1")
results = clip_ctx.search("a photo of a cat")   # text -> image

Tests (test_multimodal_embeddings.py, deterministic CLIP-style stub)

  • protocol / supports_media runtime check
  • image auto-embedded via embed_media
  • cross-modal text→image retrieval
  • batch auto-embed of images
  • text-only provider does not embed images

Checks

Acceptance criteria (#117)

  • Pluggable multi-modal embedder (embed_media), no bundled models
  • Auto-embed image/bytes payloads
  • Cross-modal text→image retrieval via the shared space
  • Existing text retrieval unchanged; tests with a stub provider

Note on pre-existing failures

The 8 failures in test_embeddings.py are pre-existing and unrelated to this change: its hand-written _DummyInner.add() mock accepts 10–18 positional args while the real binding now passes 22 (TypeError at api.py inner.add, predating this PR; not caught by the path-filtered CI job). The new tests here use a real Context, not that mock. Fixing the stale mock is a separate cleanup.

Closes #117

dcfocus added 2 commits June 27, 2026 18:58
Reads previously hard-loaded `binary_payload` (and `text_payload` /
`embedding`) for every row on every `list` / `search` / `get`
(`batch_to_records` required the columns; the scanners didn't project). For
multi-modal records that means a metadata `list` or a vector `search` pulled
every image's bytes into memory.

This adds column projection on the read path plus an on-demand blob fetch:

- `ReadProjection { text, binary, embedding }` (default = load all, backward
  compatible) with `metadata_only()` / `without_binary()` helpers.
- `batch_to_records` now tolerates those columns being projected out
  (omitted fields decode as `None`).
- `ContextStore::list_filtered_projected` and `search_filtered_projected`
  read only the requested columns; the existing `*_with_options` methods
  delegate with the default (full) projection, so behavior is unchanged.
  Search always reads the embedding internally to score, then drops it from
  results if `projection.embedding` is false.
- `ContextStore::get_blob(id)` fetches a single record's `binary_payload` on
  demand via a projected, id-filtered scan.

Python:
- `Context.list(..., include_binary=True, include_embedding=True)` and
  `Context.search(..., include_binary=..., include_embedding=...)` — set
  `include_binary=False` to query metadata/search without materializing media
  bytes (omitted payloads come back as `None`).
- `Context.get_blob(id) -> bytes | None` for on-demand fetch.
- The new pyo3 args default to `true`, so existing callers are unaffected.

Tests: core (projection drops binary/embedding but keeps metadata; `get_blob`
fetches on demand; search projection keeps ranking) and Python
(`test_lazy_payload.py`). README gains a multi-modal read example.

Independent of lance-format#115 (operates on the existing inline payload columns).

Follow-up (noted, deferred): REST `include_binary` params + a blob GET route,
and Lance `take_blob`/`BlobFile` zero-copy lazy reads for blob-encoded
columns.

Closes lance-format#116
Embedding providers were text-only (`embed_texts`), so non-text payloads
(images/audio) got no embedding unless the caller supplied a vector, and
there was no way to retrieve an image with a text query.

This adds a multi-modal embedding path (no models bundled — the encoder is
user-supplied):

- `MultiModalEmbeddingProvider` protocol: extends `EmbeddingProvider` with
  `embed_media(items: list[tuple[bytes, str]]) -> list[list[float]]`, which
  embeds `(payload_bytes, content_type)` into the **same** vector space as
  `embed_texts`. Plus a `supports_media(provider)` helper.
- Auto-embed for media: `add` / `upsert` / `add_many` / `upsert_many` now
  embed `bytes` payloads via `embed_media` when the provider supports it
  (text payloads still use `embed_texts`).
- Cross-modal retrieval falls out of the shared space for free: a text query
  is embedded with `embed_texts` and matched against image embeddings via the
  existing vector search — `ctx.search("a photo of a cat")` returns image
  records. `search` also accepts a `bytes` query (embedded via `embed_media`).

Because the two encoders share one space, no core/Rust change is needed —
this is entirely in the Python embedding layer.

Tests (`test_multimodal_embeddings.py`, deterministic CLIP-style stub):
protocol/`supports_media` check, image auto-embed via `embed_media`,
cross-modal text→image retrieval, batch auto-embed, and that a text-only
provider does not embed images.

Stacked on lance-format#116 (lazy/projected reads); independent of lance-format#115.

Note: the 8 failures in the pre-existing `test_embeddings.py` are unrelated
mock drift — its hand-written `_DummyInner.add()` accepts 10–18 positional
args while the real binding now passes 22 (TypeError at `api.py` `inner.add`,
predating this change). The new tests here use a real `Context`, not that
mock.

Closes lance-format#117
@dcfocus dcfocus force-pushed the feat/issue-117-multimodal-embeddings branch from 86d6198 to f268719 Compare June 28, 2026 01:58
@dcfocus dcfocus merged commit 15e931e into lance-format:main Jun 28, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-modal: image/multi-modal embeddings + cross-modal retrieval (text → image)

1 participant