Skip to main content

Add capabilities to your engine

The key to efficient development in Shaped is to work iteratively and incrementally. Train a small model successfully, and then add sophistication as you go along.

This guide covers six core engine configurations from simple to complex that will help you build engines iteratively and quickly.

The first model you should train is a basic BM25 lexical search engine. Use the index block to choose which columns will be indexed, and use the defaults for tokenizing etc.

version: v2
name: lexical_search_engine
data:
item_table:
name: movielens_items
type: table
index:
lexical_search: # enables BM25 lexical search on item fields
item_fields:
- movie_title
- description
- cast
- genres
- writers
- directors
- interests

Next, create item vectors using a small embedding model from Huggingface. The sentence-transformers/all-MiniLM-L6-v2 is quick to train and gets us results fast:

version: v2
name: basic_engine
data:
item_table:
name: movielens_items # change this to your table name
type: table
index:
embeddings:
- name: content_embedding # enables vector search
encoder:
type: hugging_face
model_name: sentence-transformers/all-MiniLM-L6-v2
item_fields: # update these to your table's text columns
- movie_title
- description

Improve semantic search with multiple embeddings and larger models

Next, improve our search model by using larger models and creating multiple embeddings on different features. We can also train multimodal embeddings that can search image data.

version: v2
name: basic_engine_with_embedding_models
data:
item_table:
name: movielens_items
type: table
index:
embeddings:
- name: content_embedding # vectors for movie description
encoder:
type: hugging_face
model_name: Alibaba-NLP/gte-modernbert-base
item_fields:
- movie_title
- description
- name: poster_embedding # vectors for poster
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
item_fields:
- poster_url

Once we have our semantic and image search configuration to our liking, we can combine them into a hybrid search engine:

version: v2
name: hybrid_search_engine
data:
item_table:
name: movielens_items
type: table
index:
embeddings:
- name: content_embedding
encoder:
type: hugging_face
model_name: sentence-transformers/all-MiniLM-L6-v2
item_fields:
- movie_title
- description
lexical_search: # enables BM25 lexical search on item fields
item_fields:
- movie_title
- description
- cast
- genres
- writers
- directors
- interests

Collaborative filtering embedding model

We can further improve our hybrid search model by training an embedding on user interactions. This will produce user-item vectors that we can use to search for similar items based on user history rather than item content.

This model trains an ALS model and references it in the embedding config:

version: v2
name: collaborative_filtering_engine
data:
item_table:
name: movielens_items
type: table
interaction_table:
name: movielens_ratings
type: table
index:
embeddings:
- name: content_embedding
encoder:
type: hugging_face
model_name: Alibaba-NLP/gte-modernbert-base
item_fields:
- movie_title
- description
- name: poster_embedding
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
item_fields:
- poster_url
- name: people_also_liked
encoder:
model_ref: als # refer to a model in the training block
type: trained_model
training:
models:
- policy_type: als # configuration for the ALS embedding model
name: als

Scoring and ranking model

Finally, we can train a scoring model using LightGBM to rank items by their popularity. LightGBM is often used to predict click-through-rate of items in a catalog.

version: v2
name: lightgbm_engine
data:
item_table:
name: movielens_items
type: table
interaction_table:
name: movielens_ratings
type: table
index:
embeddings:
- name: content_embedding # vectors for movie description
encoder:
type: hugging_face
model_name: Alibaba-NLP/gte-modernbert-base
item_fields:
- movie_title
- description
- name: poster_embedding # vectors for poster
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
item_fields:
- poster_url
training:
models:
- policy_type: lightgbm
name: predicted_ctr

These six model policies will lay the foundation as you build more complex and performant retrieval systems in Shaped.