Skip to main content

Personalized search

In this guide, you will learn how to implement personalized search that combines hybrid search with a GBDT scoring model. Shaped lets you build personalized search engines without touching any embeddings or ranking model logic.

Traditional search uses a search index to find relevant items, and returns them ranked by similarity to the input query.

Example: The query Red shoes would return Red Trainers, Red Boots, Red Heels. These are all types of shoes, with Red in the name.

Personalized search uses the same search index to get a set of candidate items, but then use a trained ranking model to determine which items are most relevant to the user. For example, you can optimize for items most likely to be purchased.

Example: The query, Red shoes would return Maroon Pumps, Candy Red Boots, Crimson Mules. These are less close to the input query because they don't have "Red" in them, but more relevant to the user.

Personalized search is useful in verticals like e-commerce, when the same search can return a lot of similar items and you want to show items that are most likely to convert rather than the best match.

For hybrid personalized search, you need three components in your engine:

  1. BM25 index for lexical search
  2. Vector embeddings for semantic search
  3. GBDT model to rank the most relevant items for the final result set

The engine contains the basic infrastructure to handle the personalized search query. At query time, we will retrieve candidates with our BM25 index and vector embeddings combined. After we have 500 candidates, we'll use GBDT to pick the 20 most relevant ones.

Setting up the Shaped client SDK

First, instantiate the Shaped client:

from shaped import Client

client = Client(api_key="YOUR_KEY_HERE")

Upload your data to Shaped

Traditional search requires an item table, which are the documents you will search on.

To do personalized search, you need add an interaction table as well. The interaction table tracks user behavior to train the personalization model. The interaction table should contain a record for each event your users do (impression, view, purchase) plus a label to assign a weight to each event.

In this example, we will upload a custom table. You can use a table imported via a data connector instead.

First, declare your item table.

item_table_config = {
"schema_type": "CUSTOM",
"name": "pixar_movies",
"column_schema": {
"item_id": "Int64",
"movie_title": "String",
"poster_url": "String",
"description": "String",
"release_date": "String",
"cast": "Array(String)",
},
}

client.create_table(item_table_config)

Next, upload your item data:

records = [
{"item_id": 187541, "movie_title": "Incredibles 2 (2018)", "poster_url": "https://m.media-amazon.com/images/M/MV5BMTEzNzY0OTg0NTdeQTJeQWpwZ15BbWU4MDU3OTg3MjUz._V1_QL75_UX380_CR0,0,380,562_.jpg", "description": "The Incredibles family takes on a new mission which involves a change in family roles: Bob Parr (Mr. Incredible) must manage the house while his wife Helen (Elastigirl) goes out to save the world.", "release_date": "2018-06-15", "cast": ["Craig T. Nelson", "Holly Hunter", "Sarah Vowell", "Huck Milner", "Catherine Keener", "Eli Fucile", "Bob Odenkirk", "Samuel L. Jackson", "Michael Bird", "Sophia Bush", "Brad Bird", "Brad Bird", "Nicole Paradis Grindle", "John Walker", "Michael Giacchino", "Stephen Schaffer", "Natalie Lyon", "Kevin Reher", "Ralph Eggleston"]},
{"item_id": 177765, "movie_title": "Coco (2017)", "poster_url": "https://m.media-amazon.com/images/M/MV5BMDIyM2E2NTAtMzlhNy00ZGUxLWI1NjgtZDY5MzhiMDc5NGU3XkEyXkFqcGc@._V1_QL75_UY562_CR7,0,380,562_.jpg", "description": "Aspiring musician Miguel, confronted with his family's ancestral ban on music, enters the Land of the Dead to find his great-great-grandfather, a legendary singer.", "release_date": "2017-11-22", "cast": ["Anthony Gonzalez", "Gael García Bernal", "Benjamin Bratt", "Alanna Ubach", "Renee Victor", "Jaime Camil", "Alfonso Arau", "Herbert Siguenza", "Gabriel Iglesias", "Lombardo Boyar", "Lee Unkrich", "Lee Unkrich", "Jason Katz", "Matthew Aldrich", "Adrian Molina", "Darla K. Anderson", "Michael Giacchino", "Steve Bloom", "Lee Unkrich", "Carla Hool", "Natalie Lyon", "Kevin Reher", "Harley Jessup"]},
try:
client.insert_table_rows("pixar_movies", records)
except NameError:
# records may be defined elsewhere in your application
pass

Now do the same for the interaction table:

interaction_table_config = {
"schema_type": "CUSTOM",
"name": "user_interactions",
"column_schema": {
"user_id": "String",
"item_id": "Int64",
"event_type": "String",
"timestamp": "String",
"created_at": "DateTime",
"label": "Double",
},
}

client.create_table(interaction_table_config)

Upload sample interaction data:

interactions = [
{"user_id": "user1", "item_id": 187541, "event_type": "click", "timestamp": "2024-01-15T10:00:00Z"},
{"user_id": "user1", "item_id": 177765, "event_type": "click", "timestamp": "2024-01-16T14:30:00Z"},
{"user_id": "user1", "item_id": 1, "event_type": "purchase", "timestamp": "2024-01-17T09:15:00Z"},
{"user_id": "user2", "item_id": 134853, "event_type": "click", "timestamp": "2024-01-15T11:20:00Z"},
{"user_id": "user2", "item_id": 170957, "event_type": "click", "timestamp": "2024-01-16T16:45:00Z"},
]

client.insert_table_rows("user_interactions", interactions)

Once you have your data tables, you can start configuring your engine.

Configure your engine with GBDT ranking model

Now you will configure the personalized search engine with hybrid search and GBDT ranking.

tip

The example below splits each configuration step into its own code, for clarity. In production, you can combine the below steps into one API call to Create Engine.

Instantiate the engine object

Start by instantiating the engine configuration class:

from shaped.autogen.models.engine_config_v2 import EngineConfigV2
from shaped.autogen.models.data_config import DataConfig

personalized_search_engine = EngineConfigV2(
name="personalized_search",
data=DataConfig(),
)

Connect engine to data

Connect both the item table and interaction table to your engine:

from shaped.autogen.models.data_config_interaction_table import DataConfigInteractionTable
from shaped.autogen.models.reference_table_config import ReferenceTableConfig

personalized_search_engine.data = DataConfig(
item_table=DataConfigInteractionTable(
ReferenceTableConfig(name="pixar_movies")
),
interaction_table=DataConfigInteractionTable(
ReferenceTableConfig(name="user_interactions")
),
)

Configure both lexical and vector search as in the hybrid search example:

from shaped.autogen.models.index_config import IndexConfig
from shaped.autogen.models.search_config import SearchConfig
from shaped.autogen.models.embedding_config import EmbeddingConfig
from shaped.autogen.models.encoder import Encoder
from shaped.autogen.models.hugging_face_encoder import HuggingFaceEncoder

embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

personalized_search_engine.index = IndexConfig(
lexical_search=SearchConfig(
item_fields=["movie_title", "description"],
),
embeddings=[
EmbeddingConfig(
name="movie_text_embedding",
encoder=Encoder(
HuggingFaceEncoder(
model_name=embedding_model,
item_fields=["movie_title", "description"],
)
),
)
],
)

Configure GBDT training

note

GBDT model training requires a Standard plan or above. See Shaped pricing for details.

Configure a GBDT model to learn from user interactions and personalize search results:

from shaped.autogen.models.training_config import TrainingConfig
from shaped.autogen.models.models_inner import ModelsInner
from shaped.autogen.models.shaped_internal_recsys_policies_gbdt_gbdt_policy_config import ShapedInternalRecsysPoliciesGbdtGBDTPolicyConfig

personalized_search_engine.training = TrainingConfig(
models=[
ModelsInner(
ShapedInternalRecsysPoliciesGbdtGBDTPolicyConfig(
policy_type="gbdt",
name="click_through_rate",
)
)
],
)

Start indexing and training

After configuring your engine's data, index, and training, use the create engine method to start both indexing and model training:

client.create_engine(engine_config=personalized_search_engine)

Make a personalized search query

After the engine is finished indexing and training, you can search with personalization.

Use hybrid search retrievers combined with a score expression that weights results by the trained GBDT model:

from shaped import RankQueryBuilder, TextSearch

query = (
RankQueryBuilder()
.from_entity('item')
.retrieve([
TextSearch(
input_text_query='$query',
mode={'type': 'lexical'},
limit=50,
name='lexical_search'
),
TextSearch(
input_text_query='$query',
mode={'type': 'vector', 'text_embedding_ref': 'movie_text_embedding'},
limit=50,
name='vector_search'
)
])
.score(
value_model='click_through_rate',
input_user_id='$user_id',
input_interactions_item_ids='$interaction_item_ids'
)
.limit(20)
.build()
)

results = client.execute_query(
engine_name="personalized_search",
query=query,
parameters={
"query": "Incredibles",
"user_id": "user1",
"interaction_item_ids": ["187541", "177765", "1"]
},
return_metadata=True,
)