Iterate on your engine config
This guide covers five engine configurations from simple to complex that will help you build engines iteratively and quickly.
Use the Shaped Playground to try the examples in this article against a live engine.
One of the most common mistakes when building retrieval systems is starting with too much complexity.
Focus first on identifying high quality features and removing noisy features that pollute results. Then, build a retrieval-only engine and experiment with embeddings. Embeddings are more than enough to generate excellent results. Once you have a strong baseline, you can begin experimenting with scoring policies and more sophistication.
Stage 1: Basic semantic search
Start with a basic semantic search model. Use these results to understand which text features affect relevance the most.
In this example, we create item vectors using a small embedding model from Huggingface. The model sentence-transformers/all-MiniLM-L6-v2 is quick to train and gets us results fast.
Example config
version: v2
name: basic_engine
data:
item_table:
name: movielens_items # change this to your table name
type: table
index:
embeddings:
- name: content_embedding # enables vector search
encoder:
type: hugging_face
model_name: sentence-transformers/all-MiniLM-L6-v2
item_fields: # update these to your table's text columns
- movie_title
- description
Example query
Use the text_search retriever to do semantic search:
SELECT *
FROM retrieve(
text_search(
query='red socks with blue soles',
mode='vector',
text_embedding_ref='title_embedding',
)
)
Popular items
If your engine has interaction data, it can rank by recency and popularity without additional configuration.
SELECT *
FROM retrieve(
column_order(
columns='_derived_popular_rank',
)
)
New items
SELECT *
FROM retrieve(
column_order(
columns='created_at desc',
)
)
Stage 2: Advanced embedding model
Next, improve our search model by using larger models and creating multiple embeddings on different features. We can also train multimodal embeddings that can search image data.
Example config
version: v2
name: basic_engine_with_embedding_models
data:
item_table:
name: movielens_items
type: table
index:
embeddings:
- name: content_embedding # vectors for movie description
encoder:
type: hugging_face
model_name: Alibaba-NLP/gte-modernbert-base
item_fields:
- movie_title
- description
- name: poster_embedding # vectors for poster
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
item_fields:
- poster_url
Personalization with text embeddings
The following query searches for space movies using text_search and orders the results based on similairty to what the user has interacted with (cosine_similarity using the user's recent interactions).
The pooled_text_encoding function converts the user's most recent interactions into a single vector for similarity ranking.
SELECT *
FROM text_search(
name='text_match',
query='space movies',
mode='lexical',
fuzziness=0,
limit=200)
ORDER BY score(expression='cosine_similarity(\
pooled_text_encoding(\
user.recent_interactions, \
pool_fn=''mean'', \
embedding_ref="description_content_embedding"\
), \
text_encoding(\
item, \
embedding_ref="description_content_embedding"\
)\
)', input_user_id='55')
LIMIT 50
Check out the Query Playground to try this query live.
Personalized hybrid search
Stage 3: Hybrid search
Once we have our semantic and image search configuration to our liking, we can add lexical search configuration. This enables hybrid search:
Example config
version: v2
name: hybrid_search_engine
data:
item_table:
name: movielens_items
type: table
index:
embeddings:
- name: content_embedding
encoder:
type: hugging_face
model_name: sentence-transformers/all-MiniLM-L6-v2
item_fields:
- movie_title
- description
lexical_search: # enables BM25 lexical search on item fields
item_fields:
- movie_title
- description
- cast
- genres
- writers
- directors
- interests
Stage 4: Add collaborative filtering model
We can further improve our hybrid search model by training an embedding on user interactions. This will produce user-item vectors that we can use to search for similar items based on user history rather than item content.
This model trains an ALS model and references it in the embedding config.
Example config
version: v2
name: collaborative_filtering_engine
data:
item_table:
name: movielens_items
type: table
interaction_table:
name: movielens_ratings
type: table
index:
embeddings:
- name: content_embedding
encoder:
type: hugging_face
model_name: Alibaba-NLP/gte-modernbert-base
item_fields:
- movie_title
- description
- name: poster_embedding
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
item_fields:
- poster_url
- name: people_also_liked
encoder:
model_ref: als # refer to a model in the training block
type: trained_model
training:
models:
- policy_type: als # configuration for the ALS embedding model
name: als
Stage 5: Add ranking
Finally, we can train a scoring model using LightGBM to rank items by their popularity. LightGBM is often used to predict click-through-rate of items in a catalog.
Example config
version: v2
name: lightgbm_engine
data:
item_table:
name: movielens_items
type: table
interaction_table:
name: movielens_ratings
type: table
index:
embeddings:
- name: content_embedding # vectors for movie description
encoder:
type: hugging_face
model_name: Alibaba-NLP/gte-modernbert-base
item_fields:
- movie_title
- description
- name: poster_embedding # vectors for poster
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
item_fields:
- poster_url
training:
models:
- policy_type: lightgbm
name: predicted_ctr
These six model policies will lay the foundation as you build more complex and performant retrieval systems in Shaped.