Skip to main content

Multi-modal encoders

Multimodal encoders can generate embeddings from both text and image features. This lets you search by image content or train ranking models on image and text content.

Supported model types

  • Sentence Transformers: Text-only models from Hugging Face. Use for text-based search.
  • CLIP: Models that support both text and images from Hugging Face. Use for cross-modal search (e.g., text query over image catalog).
  • Custom backbones: Providers like Nomic AI and Jina AI are supported.

Configuration

Configure embeddings in the index block. Each embedding specifies an encoder and the item fields to encode. Use schema_override in the data block to declare Image and Text types for columns that hold images or unstructured text.

Text-only embedding

data:
item_table:
name: products
type: table
schema_override:
item:
id: item_id
features:
- name: title
type: Text
- name: description
type: Text

index:
embeddings:
- name: text_embedding
encoder:
type: hugging_face
model_name: sentence-transformers/all-MiniLM-L6-v2
batch_size: 256
item_fields:
- title
- description

Image-only embedding

data:
item_table:
name: products
type: table
schema_override:
item:
id: item_id
features:
- name: image_url
type: Image

index:
embeddings:
- name: image_embedding
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
batch_size: 32
item_fields:
- image_url

Text and image (multi-modal)

Use separate embeddings for text and images, or a single CLIP model for both. For separate embeddings, define one per modality:

data:
item_table:
name: products
type: table
schema_override:
item:
id: item_id
features:
- name: title
type: Text
- name: image_url
type: Image

index:
embeddings:
- name: text_embedding
encoder:
type: hugging_face
model_name: sentence-transformers/all-MiniLM-L6-v2
batch_size: 256
item_fields:
- title
- name: image_embedding
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
batch_size: 32
item_fields:
- image_url

CLIP models encode both text and images into a shared embedding space, enabling cross-modal search (e.g., searching images with a text query).

Encoder parameters

ParameterDescription
typehugging_face for pre-trained models
model_nameHugging Face model URI (e.g., sentence-transformers/all-MiniLM-L6-v2)
batch_sizeItems per batch during encoding. Use larger values (e.g., 256) for text, smaller (e.g., 32) for images
item_fieldsColumns to encode. Must match feature names in schema_override

Usage in queries

Reference embeddings by name in retrieval and scoring. For text search:

retrieve:
- type: text_search
input_text_query: $parameters.query_text
mode:
type: vector
text_embedding_ref: text_embedding
limit: 20

For similarity by item attributes:

retrieve:
- type: similarity
embedding_ref: text_embedding
query_encoder:
type: item_attribute_pooling
input_item_id: $parameters.item_id
limit: 20