Skip to main content

Iterate on your engine config

This guide covers five engine configurations from simple to complex that will help you build engines iteratively and quickly.

tip

Use the Shaped Playground to try the examples in this article against a live engine.

One of the most common mistakes when building retrieval systems is starting with too much complexity.

Focus first on identifying high quality features and removing noisy features that pollute results. Then, build a retrieval-only engine and experiment with embeddings. Embeddings are more than enough to generate excellent results. Once you have a strong baseline, you can begin experimenting with scoring policies and more sophistication.

Start with a basic semantic search model. Use these results to understand which text features affect relevance the most.

In this example, we create item vectors using a small embedding model from Huggingface. The model sentence-transformers/all-MiniLM-L6-v2 is quick to train and gets us results fast.

Example config

version: v2
name: basic_engine
data:
item_table:
name: movielens_items # change this to your table name
type: table
index:
embeddings:
- name: content_embedding # enables vector search
encoder:
type: hugging_face
model_name: sentence-transformers/all-MiniLM-L6-v2
item_fields: # update these to your table's text columns
- movie_title
- description

Example query

Use the text_search retriever to do semantic search:

SELECT *
FROM retrieve(
text_search(
query='red socks with blue soles',
mode='vector',
text_embedding_ref='title_embedding',
)
)

If your engine has interaction data, it can rank by recency and popularity without additional configuration.

SELECT *
FROM retrieve(
column_order(
columns='_derived_popular_rank',
)
)

New items

SELECT *
FROM retrieve(
column_order(
columns='created_at desc',
)
)

Stage 2: Advanced embedding model

Next, improve our search model by using larger models and creating multiple embeddings on different features. We can also train multimodal embeddings that can search image data.

Example config

version: v2
name: basic_engine_with_embedding_models
data:
item_table:
name: movielens_items
type: table
index:
embeddings:
- name: content_embedding # vectors for movie description
encoder:
type: hugging_face
model_name: Alibaba-NLP/gte-modernbert-base
item_fields:
- movie_title
- description
- name: poster_embedding # vectors for poster
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
item_fields:
- poster_url

Personalization with text embeddings

The following query searches for space movies using text_search and orders the results based on similairty to what the user has interacted with (cosine_similarity using the user's recent interactions).

The pooled_text_encoding function converts the user's most recent interactions into a single vector for similarity ranking.

SELECT *
FROM text_search(
name='text_match',
query='space movies',
mode='lexical',
fuzziness=0,
limit=200)
ORDER BY score(expression='cosine_similarity(\
pooled_text_encoding(\
user.recent_interactions, \
pool_fn=''mean'', \
embedding_ref="description_content_embedding"\
), \
text_encoding(\
item, \
embedding_ref="description_content_embedding"\
)\
)', input_user_id='55')
LIMIT 50

Check out the Query Playground to try this query live.

Once we have our semantic and image search configuration to our liking, we can add lexical search configuration. This enables hybrid search:

Example config

version: v2
name: hybrid_search_engine
data:
item_table:
name: movielens_items
type: table
index:
embeddings:
- name: content_embedding
encoder:
type: hugging_face
model_name: sentence-transformers/all-MiniLM-L6-v2
item_fields:
- movie_title
- description
lexical_search: # enables BM25 lexical search on item fields
item_fields:
- movie_title
- description
- cast
- genres
- writers
- directors
- interests

Stage 4: Add collaborative filtering model

We can further improve our hybrid search model by training an embedding on user interactions. This will produce user-item vectors that we can use to search for similar items based on user history rather than item content.

This model trains an ALS model and references it in the embedding config.

Example config

version: v2
name: collaborative_filtering_engine
data:
item_table:
name: movielens_items
type: table
interaction_table:
name: movielens_ratings
type: table
index:
embeddings:
- name: content_embedding
encoder:
type: hugging_face
model_name: Alibaba-NLP/gte-modernbert-base
item_fields:
- movie_title
- description
- name: poster_embedding
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
item_fields:
- poster_url
- name: people_also_liked
encoder:
model_ref: als # refer to a model in the training block
type: trained_model
training:
models:
- policy_type: als # configuration for the ALS embedding model
name: als

Stage 5: Add ranking

Finally, we can train a scoring model using LightGBM to rank items by their popularity. LightGBM is often used to predict click-through-rate of items in a catalog.

Example config

version: v2
name: lightgbm_engine
data:
item_table:
name: movielens_items
type: table
interaction_table:
name: movielens_ratings
type: table
index:
embeddings:
- name: content_embedding # vectors for movie description
encoder:
type: hugging_face
model_name: Alibaba-NLP/gte-modernbert-base
item_fields:
- movie_title
- description
- name: poster_embedding # vectors for poster
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
item_fields:
- poster_url
training:
models:
- policy_type: lightgbm
name: predicted_ctr

These six model policies will lay the foundation as you build more complex and performant retrieval systems in Shaped.