Multi-modal encoders
Multimodal encoders can generate embeddings from both text and image features. This lets you search by image content or train ranking models on image and text content.
Supported model types
- Sentence Transformers: Text-only models from Hugging Face. Use for text-based search.
- CLIP: Models that support both text and images from Hugging Face. Use for cross-modal search (e.g., text query over image catalog).
- Custom backbones: Providers like Nomic AI and Jina AI are supported.
Configuration
Configure embeddings in the index block. Each embedding specifies an encoder and the item fields to encode. Use schema_override in the data block to declare Image and Text types for columns that hold images or unstructured text.
Text-only embedding
data:
item_table:
name: products
type: table
schema_override:
item:
id: item_id
features:
- name: title
type: Text
- name: description
type: Text
index:
embeddings:
- name: text_embedding
encoder:
type: hugging_face
model_name: sentence-transformers/all-MiniLM-L6-v2
batch_size: 256
item_fields:
- title
- description
Image-only embedding
data:
item_table:
name: products
type: table
schema_override:
item:
id: item_id
features:
- name: image_url
type: Image
index:
embeddings:
- name: image_embedding
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
batch_size: 32
item_fields:
- image_url
Text and image (multi-modal)
Use separate embeddings for text and images, or a single CLIP model for both. For separate embeddings, define one per modality:
data:
item_table:
name: products
type: table
schema_override:
item:
id: item_id
features:
- name: title
type: Text
- name: image_url
type: Image
index:
embeddings:
- name: text_embedding
encoder:
type: hugging_face
model_name: sentence-transformers/all-MiniLM-L6-v2
batch_size: 256
item_fields:
- title
- name: image_embedding
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
batch_size: 32
item_fields:
- image_url
CLIP models encode both text and images into a shared embedding space, enabling cross-modal search (e.g., searching images with a text query).
Encoder parameters
| Parameter | Description |
|---|---|
type | hugging_face for pre-trained models |
model_name | Hugging Face model URI (e.g., sentence-transformers/all-MiniLM-L6-v2) |
batch_size | Items per batch during encoding. Use larger values (e.g., 256) for text, smaller (e.g., 32) for images |
item_fields | Columns to encode. Must match feature names in schema_override |
Usage in queries
Reference embeddings by name in retrieval and scoring. For text search:
retrieve:
- type: text_search
input_text_query: $parameters.query_text
mode:
type: vector
text_embedding_ref: text_embedding
limit: 20
For similarity by item attributes:
retrieve:
- type: similarity
embedding_ref: text_embedding
query_encoder:
type: item_attribute_pooling
input_item_id: $parameters.item_id
limit: 20