Add capabilities to your engine

The key to efficient development in Shaped is to work iteratively and incrementally. Train a small model successfully, and then add sophistication as you go along.

This guide covers six core engine configurations from simple to complex that will help you build engines iteratively and quickly.

Basic lexical search

The first model you should train is a basic BM25 lexical search engine. Use the index block to choose which columns will be indexed, and use the defaults for tokenizing etc.

version: v2
name: lexical_search_engine
data:
  item_table:
    name: movielens_items 
    type: table
index:
  lexical_search: # enables BM25 lexical search on item fields
    item_fields:
      - movie_title
      - description
      - cast
      - genres
      - writers
      - directors
      - interests

Basic semantic search

Next, create item vectors using a small embedding model from Huggingface. The sentence-transformers/all-MiniLM-L6-v2 is quick to train and gets us results fast:

version: v2
name: basic_engine
data:
  item_table:
    name: movielens_items # change this to your table name
    type: table
index:
  embeddings:
    - name: content_embedding # enables vector search
      encoder:
        type: hugging_face
        model_name: sentence-transformers/all-MiniLM-L6-v2
        item_fields: # update these to your table's text columns
          - movie_title
          - description

Improve semantic search with multiple embeddings and larger models

Next, improve our search model by using larger models and creating multiple embeddings on different features. We can also train multimodal embeddings that can search image data.

version: v2
name: basic_engine_with_embedding_models
data:
  item_table:
    name: movielens_items 
    type: table
index:
  embeddings:
    - name: content_embedding  # vectors for movie description
      encoder:
        type: hugging_face
        model_name: Alibaba-NLP/gte-modernbert-base
        item_fields: 
          - movie_title
          - description
    - name: poster_embedding # vectors for poster
      encoder:
        type: hugging_face
        model_name: openai/clip-vit-base-patch32
        item_fields: 
          - poster_url

Hybrid search

Once we have our semantic and image search configuration to our liking, we can combine them into a hybrid search engine:

version: v2
name: hybrid_search_engine
data:
  item_table:
    name: movielens_items 
    type: table
index:
  embeddings:
    - name: content_embedding 
      encoder:
        type: hugging_face
        model_name: sentence-transformers/all-MiniLM-L6-v2
        item_fields:  
          - movie_title
          - description
  lexical_search: # enables BM25 lexical search on item fields
    item_fields:
      - movie_title
      - description
      - cast
      - genres
      - writers
      - directors
      - interests

Collaborative filtering embedding model

We can further improve our hybrid search model by training an embedding on user interactions. This will produce user-item vectors that we can use to search for similar items based on user history rather than item content.

This model trains an ALS model and references it in the embedding config:

version: v2
name: collaborative_filtering_engine
data:
  item_table:
    name: movielens_items 
    type: table
  interaction_table:
    name: movielens_ratings
    type: table
index:
  embeddings:
    - name: content_embedding  
      encoder:
        type: hugging_face
        model_name: Alibaba-NLP/gte-modernbert-base
        item_fields: 
          - movie_title
          - description
    - name: poster_embedding 
      encoder:
        type: hugging_face
        model_name: openai/clip-vit-base-patch32
        item_fields: 
          - poster_url
    - name: people_also_liked 
      encoder:
        model_ref: als # refer to a model in the training block
        type: trained_model
training:
  models:
  - policy_type: als # configuration for the ALS embedding model
    name: als

Scoring and ranking model

Finally, we can train a scoring model using LightGBM to rank items by their popularity. LightGBM is often used to predict click-through-rate of items in a catalog.

version: v2
name: lightgbm_engine
data:
  item_table:
    name: movielens_items 
    type: table
  interaction_table:
    name: movielens_ratings
    type: table
index:
  embeddings:
    - name: content_embedding  # vectors for movie description
      encoder:
        type: hugging_face
        model_name: Alibaba-NLP/gte-modernbert-base
        item_fields: 
          - movie_title
          - description
    - name: poster_embedding # vectors for poster
      encoder:
        type: hugging_face
        model_name: openai/clip-vit-base-patch32
        item_fields: 
          - poster_url
training:
  models:
  - policy_type: lightgbm
    name: predicted_ctr

These six model policies will lay the foundation as you build more complex and performant retrieval systems in Shaped.

Basic lexical search​

Basic semantic search​

Improve semantic search with multiple embeddings and larger models​

Hybrid search​

Collaborative filtering embedding model​

Scoring and ranking model​