In this tutorial we'll show you how to setup a recommendation model for the 100k-MovieLens dataset using Shaped. This dataset contains 100,000 ratings from ~1000 users on ~1700 movies. With Shaped we'll be able to learn a recommendation model that can predict the most likely movies each user will want to watch.
This tutorial will be shown in python and using Shaped's File data connector but you can translate it easily to your favorite language or data stack.
Let's get started! 🚀
Prepare
This tutorial requires you install the following packages:
- requests
- pandas
- boto3
- pyarrow
Download public dataset
To start off, let's fetch the publicly hosted MovieLens dataset we'll be training our model with.
from urllib.request import urlretrieve
import zipfile
urlretrieve("http://files.grouplens.org/datasets/movielens/ml-100k.zip", "movielens.zip")
zip_ref = zipfile.ZipFile('movielens.zip', "r")
zip_ref.extractall()
Taking a look at the downloaded dataset, there are three tables of interest:
ratings
which are stored inml-100k/u.data
users
which are stored inml-100k/u.user
movies
which are stored inml-100k/u.item
Prepare dataset
Today we're going to show you how to create a very simple collaborative filtering model using the MovieLens rating data. In the future we'll show you how to incorporate the user and items features to improve the model with more context.
The ratings table needs some preparation to format it in a way Shaped understands. We need to:
- Convert the
unix_timestamp
column to be a timezone aware timestamp. - Store the input file as a Parquet file.
ratings['unix_timestamp'] = pd.to_datetime(ratings['unix_timestamp'], unit='s', utc=True)
print(ratings)
ratings_parquet_filename = 'ratings.parquet'
ratings_table = pa.Table.from_pandas(ratings)
pq.write_table(ratings_table, where=ratings_parquet_filename)
Create, train and deploy your model
Now that our data is prepared we can create a Shaped model. All we need to do is send a POST
request to the <https://api.dev.shaped.ai/v0/models
> endpoint with the model_name
, connector_configs
and schema
mapping.
The connector_configs
is set to a single File
type connector because, in this tutorial, we'll be uploading the data from a local file. For your use-case you may want to use a database or data warehouse connection, or some combination of many connectors to retrieve the data directly from your data store.
Within schema
, we map the user_id
column to the user
object and the item_id
column to the item
object. We're not using the context features for the user and item so we can set their source
fields to "None".
Within the interaction
object we set the corresponding label
column details (i.e. rating
column with Rating
type), the created_at
field to the unix_timestamp
column and the source
object to the file connector, referenced by id, and the local ratings parquet file path.
For further details about this endpoint please refer to the Create Model API reference.
model_name = "rating_events"
response = requests.post(
"https://api.dev.shaped.ai/v0/models",
headers={
"x-api-key": YOUR_API_KEY,
"Content-Type":"application/json"
},
json={
"model_name": model_name,
"connector_configs": [{
"id": "file",
"type": "File"
}],
"schema": {
"user": {
"source": {
"connector_id": "file",
"path": ratings_parquet_filename
},
"id": "user_id",
},
"item": {
"source": {
"connector_id": "file",
"path": ratings_parquet_filename
},
"id": "movie_id",
},
"interaction": {
"source": {
"connector_id": "file",
"path": ratings_parquet_filename,
},
"created_at": "unix_timestamp",
"label": {
"name": "rating",
"type": "Rating"
}
}
},
"exploitation_factor": 0.95
}
)
upload_request = json.loads(response.content)
When using a local File
connection, the response of the setup model endpoint returns a json object containing a signed url
that you can upload your local data too.
Note that this step is avoided if you connect directly to your data stack.
with open(ratings_parquet_filename, 'rb') as file:
files = {'file': (ratings_parquet_filename, file)}
upload_response = requests.post(upload_request['url'], data=upload_request['fields'], files=files)
Inspect your model
Your recommendation model can take up to a few hours to train depending on:
- How many interactions the model trains on
- How many features it uses
To view the status of your model you can send a GET request to the <https://api.dev.shaped.ai/v0/models
> endpoint.
response = requests.get(
f"https://api.dev.shaped.ai/v0/models",
headers={
"x-api-key": YOUR_API_KEY,
"Content-Type":"application/json"
}
)
print(json.dumps(json.loads(response.content), indent=2))
"""
{
"models": [
{
"created_at": "2022-09-29T13:53:23 UTC",
"input_schema": {
"interaction": {
"created_at": "unix_timestamp",
"label": {
"name": "rating",
"type": "Rating"
},
"source": {
"connector_id": "file",
"path": "s3://magnusstackdevrankingservic-featurerepos3f51007ca-3sdmvv2hntl5/3741/tenant_data/ratings.parquet"
}
},
"item": {
"id": "movie_id",
"source": {
"connector_id": "file",
"path": "s3://magnusstackdevrankingservic-featurerepos3f51007ca-3sdmvv2hntl5/3741/tenant_data/ratings.parquet"
}
},
"user": {
"id": "user_id",
"source": {
"connector_id": "file",
"path": "s3://magnusstackdevrankingservic-featurerepos3f51007ca-3sdmvv2hntl5/3741/tenant_data/ratings.parquet"
}
}
},
"label": "Rating",
"model_name": "rating_events",
"status": "ACTIVE"
}
]
}
"""
You'll notice the status
of the model you just created starts in the PREPARING
state. This means that the initial training job hasn't completed yet. When it is ready, the status
field will change to ACTIVE
.
For further details about this endpoint please refer to the List Models API reference.
Fetch ranked results
Once your model is ready, you can send a GET
request to the <https://api.dev.shaped.ai/v0/models/rank?user_id={user-id}
> endpoint to fetch recommendations for a user-id
.
The {user_id}
is the id of the user you want to fetch rankings for. You can also add an optional query param, limit
, which will inform how many results to return (with the default being 5).
For further details about this endpoint please refer to the Rank API reference.
response = requests.get(
f"https://api.dev.shaped.ai/v0/models/{model_name}/rank?user_id=1",
headers={
"x-api-key": YOUR_API_KEY,
"Content-Type":"application/json"
}
)
print(json.dumps(json.loads(response.content), indent=2))
"""
[
134,
408,
484,
483,
86
]
"""
The returned ids are item ids for your item entity. You can use the ordering to decide how to display them in you product.
Clean up
Don't forget to clean up your model once you're finished with it. You can used send a DELETE
request to the f"<https://api.dev.shaped.ai/v0/models/{model_name}"
> endpoint to delete it.
requests.delete(
f"https://api.dev.shaped.ai/v0/models/{model_name}",
headers={
"x-api-key": YOUR_API_KEY,
"Content-Type":"application/json"
}
)