Multi-modal encoders

Multimodal encoders can generate embeddings from both text and image features. This lets you search by image content or train ranking models on image and text content.

Supported model types

Sentence Transformers: Text-only models from Hugging Face. Use for text-based search.
CLIP: Models that support both text and images from Hugging Face. Use for cross-modal search (e.g., text query over image catalog).
Custom backbones: Providers like Nomic AI and Jina AI are supported.

Configuration

Configure embeddings in the index block. Each embedding specifies an encoder and the item fields to encode. Use schema_override in the data block to declare Image and Text types for columns that hold images or unstructured text.

Text-only embedding

data:
  item_table:
    name: products
    type: table
  schema_override:
    item:
      id: item_id
      features:
        - name: title
          type: Text
        - name: description
          type: Text

index:
  embeddings:
    - name: text_embedding
      encoder:
        type: hugging_face
        model_name: sentence-transformers/all-MiniLM-L6-v2
        batch_size: 256
        item_fields:
          - title
          - description

Image-only embedding

data:
  item_table:
    name: products
    type: table
  schema_override:
    item:
      id: item_id
      features:
        - name: image_url
          type: Image

index:
  embeddings:
    - name: image_embedding
      encoder:
        type: hugging_face
        model_name: openai/clip-vit-base-patch32
        batch_size: 32
        item_fields:
          - image_url

Use separate embeddings for text and images, or a single CLIP model for both. For separate embeddings, define one per modality:

data:
  item_table:
    name: products
    type: table
  schema_override:
    item:
      id: item_id
      features:
        - name: title
          type: Text
        - name: image_url
          type: Image

index:
  embeddings:
    - name: text_embedding
      encoder:
        type: hugging_face
        model_name: sentence-transformers/all-MiniLM-L6-v2
        batch_size: 256
        item_fields:
          - title
    - name: image_embedding
      encoder:
        type: hugging_face
        model_name: openai/clip-vit-base-patch32
        batch_size: 32
        item_fields:
          - image_url

CLIP models encode both text and images into a shared embedding space, enabling cross-modal search (e.g., searching images with a text query).

Encoder parameters

Parameter	Description
`type`	`hugging_face` for pre-trained models
`model_name`	Hugging Face model URI (e.g., `sentence-transformers/all-MiniLM-L6-v2`)
`batch_size`	Items per batch during encoding. Use larger values (e.g., 256) for text, smaller (e.g., 32) for images
`item_fields`	Columns to encode. Must match feature names in `schema_override`

Usage in queries

Reference embeddings by name in retrieval and scoring. For text search:

retrieve:
  - type: text_search
    input_text_query: $parameters.query_text
    mode:
      type: vector
      text_embedding_ref: text_embedding
    limit: 20

For similarity by item attributes:

retrieve:
  - type: similarity
    embedding_ref: text_embedding
    query_encoder:
      type: item_attribute_pooling
      input_item_id: $parameters.item_id
    limit: 20

Supported model types​

Configuration​

Text-only embedding​

Image-only embedding​

Text and image (multi-modal)​

Encoder parameters​

Usage in queries​