In this tutorial we'll show you how to setup a recommendation model for the 100k-MovieLens dataset using Shaped. This dataset contains 100,000 ratings from ~1000 users on ~1700 movies. With Shaped we'll be able to learn a recommendation model that can predict the most likely movies each user will want to watch.

This tutorial will be shown in python and using Shaped's File data connector but you can translate it easily to your favorite language or data stack.

Let's get started! 🚀

Prepare

This tutorial requires you install the following packages:

  • requests
  • pandas
  • boto3
  • pyarrow

Download public dataset

To start off, let's fetch the publicly hosted MovieLens dataset we'll be training our model with.

from urllib.request import urlretrieve
import zipfile

urlretrieve("http://files.grouplens.org/datasets/movielens/ml-100k.zip", "movielens.zip")
zip_ref = zipfile.ZipFile('movielens.zip', "r")
zip_ref.extractall()

Taking a look at the downloaded dataset, there are three tables of interest:

  • ratings which are stored in ml-100k/u.data
  • users which are stored in ml-100k/u.user
  • movies which are stored in ml-100k/u.item

Prepare dataset

Today we're going to show you how to create a very simple collaborative filtering model using the MovieLens rating data. In the future we'll show you how to incorporate the user and items features to improve the model with more context.

The ratings table needs some preparation to format it in a way Shaped understands. We need to:

  1. Convert the unix_timestamp column to be a timezone aware timestamp.
  2. Store the input file as a Parquet file.
ratings['unix_timestamp'] = pd.to_datetime(ratings['unix_timestamp'], unit='s', utc=True)
print(ratings)
ratings_parquet_filename = 'ratings.parquet'
ratings_table = pa.Table.from_pandas(ratings)
pq.write_table(ratings_table, where=ratings_parquet_filename)

Create, train and deploy your model

Now that our data is prepared we can create a Shaped model. All we need to do is send a POST request to the https://api.dev.shaped.ai/v0/models endpoint with the model_name, connector_configs and schema mapping.

The connector_configs is set to a single File type connector because, in this tutorial, we'll be uploading the data from a local file. For your use-case you may want to use a database or data warehouse connection, or some combination of many connectors to retrieve the data directly from your data store.

Within schema, we map the user_id column to the user object and the item_id column to the item object. We're not using the context features for the user and item so we can set their source fields to "None".

Within the interaction object we set the corresponding label column details (i.e. rating column with Rating type), the created_at field to the unix_timestamp column and the source object to the file connector, referenced by id, and the local ratings parquet file path.

For further details about this endpoint please refer to the Create Model API reference.

model_name = "rating_events"
response = requests.post(
  "https://api.dev.shaped.ai/v0/models",
  headers={
      "x-api-key": YOUR_API_KEY,
      "Content-Type":"application/json"
  },
  json={
    "model_name": model_name,
    "connector_configs": [{
      "id": "file",
      "type": "File"
    }],
    'schema': {
      "user": {
        "id": "user_id",
      },
      "item": {
        "id": "movie_id",
      },
      "interaction": {
        "created_at": "unix_timestamp",
        "source": {
          "connector_id": "file",
          "path": f"./{ratings_parquet_filename}",
        },
        "label": {
          "name": "rating",
          "type": "Rating"    
        }
      },
    }
  }
)
upload_request = json.loads(response.content)

When using a local File connection, the response of the setup model endpoint returns a json object containing a signed url that you can upload your local data too.

Note that this step is avoided if you connect directly to your data stack.

with open(ratings_parquet_filename, 'rb') as file:
    files = {'file': (ratings_parquet_filename, file)}
    upload_response = requests.post(upload_request['url'], data=upload_request['fields'], files=files)

Inspect your model

Your recommendation model can take up to a few hours to train depending on:

  • How many interactions the model trains on
  • How many features it uses

To view the status of your model you can send a GET request to the https://api.dev.shaped.ai/v0/models endpoint.

response = requests.get(
    f"https://api.dev.shaped.ai/v0/models",
    headers={
        "x-api-key": YOUR_API_KEY,
        "Content-Type":"application/json"
    }
)
print(json.dumps(json.loads(response.content), indent=2))
"""
{
  "models": [
    {
      "created_at": "2022-01-01 11:46:13",
      "label": "Rating",
      "model_name": "rating_events",
      "status": "PREPARING"
    }
  ]
}
"""

You'll notice the status of the model you just created starts in the PREPARING state. This means that the initial training job hasn't completed yet. When it is ready, the status field will change to ACTIVE.

For further details about this endpoint please refer to the List Models API reference.

Fetch ranked results

Once your model is ready, you can send a GET request to the https://api.dev.shaped.ai/v0/models/rank?user_id={user-id} endpoint to fetch recommendations for a user-id.

The {user_id} is the id of the user you want to fetch rankings for. You can also add an optional query param, limit, which will inform how many results to return (with the default being 5).

For further details about this endpoint please refer to the Rank API reference.

response = requests.get(
    f"https://api.dev.shaped.ai/v0/models/{model_name}/rank?user_id=1",
    headers={
        "x-api-key": YOUR_API_KEY,
        "Content-Type":"application/json"
    }
)
print(json.dumps(json.loads(response.content), indent=2))
"""
[
  134,
  408,
  484,
  483,
  86
]
"""

The returned ids are item ids for your item entity. You can use the ordering to decide how to display them in you product.

Clean up

Don't forget to clean up your model once you're finished with it. You can used send a DELETE request to the f"https://api.dev.shaped.ai/v0/models/{model_name}" endpoint to delete it.

requests.delete(
    f"https://api.dev.shaped.ai/v0/models/{model_name}",
    headers={
        "x-api-key": YOUR_API_KEY,
        "Content-Type":"application/json"
    }
)