Skip to main content

Movie Recommendations (MovieLens)

In this tutorial we'll show you how to setup a recommendation model for the 100k-MovieLens dataset using Shaped. This dataset contains 100,000 ratings from ~1000 users on ~1700 movies. With Shaped we'll be able to learn a recommendation model that can predict the most likely movies each user will want to watch.

This tutorial will be shown using Shaped's local dataset connector, but you can easily translate to any of the data stores or real-time connectors we support.

Let's get started! 🚀

You can follow along in our accompanying notebook!

Shaped CLI Setup

Installing the Shaped CLI

You'll need to install the Shaped CLI if you haven't already. You can do this with the following command:

pip install shaped
info

Shaped supports Python 3.8+, take a look at the installation instructions if you need to install pip.

Initialize the CLI

You can then initialize the shaped client with your API key. If you don't have an API key yet, check out the How to get an API key page.

shaped init --api-key <YOUR_API_KEY>

Dataset Preparation

Download public dataset

To start off, let's fetch the publicly hosted MovieLens dataset we'll be training our model with.

CLI
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip --no-check-certificate
unzip ml-100k.zip

Taking a look at the downloaded dataset, there are three tab-separated files (TSVs) of interest:

  • ratings which are stored in ml-100k/u.data
  • users which are stored in ml-100k/u.user
  • movies which are stored in ml-100k/u.item

movielens_tables

Unfortunately each of these tab separated files don't have a header (which is required by Shaped). To address this, we can prepend the header with the following command:

(echo "user_id\titem_id\trating\ttimestamp"; cat ml-100k/u.data) > ml-100k/u.data_with_header

To keep things as simple as possible, this tutorial only uses events to create the model. If you want to use the user and item data as well, just carry out the steps below in the same way. You can see how that's done in the notebook for this tutorial.

Create MovieLens Shaped dataset

For this tutorial we're going to be creating a Shaped Dataset and inserting the ratings records into it. To create this dataset, you first need to create a dataset definition which includes the schema as follows:

movielens_dataset.yaml
dataset_name: movielens_ratings
schema_type: CUSTOM
schema:
rating: Int32
user_id: String
item_id: String
timestamp: DateTime

You can use this definition to create the ratings dataset with the create-dataset command using Shaped's CLI:

CLI
shaped create-dataset --file movielens_dataset.yaml

Insert events

We now want to insert the movielens ratings into the dataset, which we can do with the dataset-insert command.

CLI
shaped dataset-insert --dataset-name movielens_ratings --file ml-100k/u.data_with_header --type 'tsv'

You'll see the records uploading in batches of 1000, once it has reached 100k records you can move forward.

Create your model

We're now ready to create your Shaped model! To keep things simple, today, we're using the ratings records to build a collaborative filtering model. Shaped will use these ratings to determine which users like which movie with the assumption that the higher the rating the more likely a user likes the rated movie.

Here's the create model definition we'll be using, and the corresponding create-model command.

movielens_movie_recommendations.yaml
model:
name: movielens_movie_recommendation
connectors:
- type: Dataset
id: movielens_ratings
name: movielens_ratings
fetch:
events: |
SELECT user_id, item_id, timestamp AS created_at, rating AS label
FROM movielens_ratings
shaped create-model --file movielens_movie_recommendations.yaml

For further details about creating models please refer to the Create Model API reference.

Inspect your model

Your recommendation model can take up to a few hours to provision your infrastructure and train on your historic events. This time mostly depends on how large your dataset is i.e. the volume of your users, items and interactions and the number of attributes you're providing.

While the model is being setup, you can view it's status with either the List Models or View Model endpoints. For example, with the CLI:

shaped list-models

Response:

[
"models": {
"created_at": "2023-03-18T19:17:51 UTC",
"model_name": "movielens_movie_recommendation",
"model_uri": "https://api.prod.shaped.ai/v1/models/movielens_movie_recommendation",
"status": "FETCHING",
}
]

As you see the model is currently fetching the data. The initial model creation pipeline goes through the following stages in order:

  1. SCHEDULING
  2. FETCHING
  3. TRAINING
  4. DEPLOYING
  5. ACTIVE

You can periodically poll Shaped to inspect these status changes. Once it's in the ACTIVE state, you can move to next step and use it to make rank requests.

Fetch your recommendations

You're now ready to fetch your movie recommendations. You can do this with the Rank endpoint, just provide the user_id you wish to get the recommendations for and the number of recommendations you want returned.

Shaped's CLI provides a convenience rank command to quickly retrieve results from the command line. You can use it as follows:

shaped rank --model-name movielens_movie_recommendation --user-id 1 --limit 5

Response:

{
"ids":[
"427010",
"182094",
"332874",
"827918",
"403528"
],
"scores":[
0.9,
0.8,
0.7,
0.3,
0.2
],
}

The response returns 2 parallel arrays containing the ids and ranking scores for the movies that Shaped estimates are most interesting to the given user.

If you want to integrate this endpoint into your website or application you can use the Rank POST REST endpoint directly with the following request:

curl https://api.prod.shaped.ai/v1/models/movielens_movie_recommendation/rank \
-H "x-api-key: <API_KEY>" \
-H "Content-Type: application/json"
-d '{
"user_id": "1",
"limit": 5,
}'

Clean Up

Don't forget to delete your model once you've finished with it, you can do it with the following CLI command:

shaped delete-model --model-name movielens_movie_recommendation