Skip to main content

Game Recommendations Using Steam Reviews

In this tutorial we'll show you how to set up a recommendation model for the Steam Australian Reviews dataset using Shaped. We will focus on the australian_user_reviews.json.gz dataset in this tutorial. This dataset contains reviews from Australian Steam users, including whether they recommended the games they reviewed.

With Shaped, we'll learn a recommendation model that can predict which games each user is most likely to enjoy based on their review history.

This tutorial uses Shaped's local dataset connector, but you can easily translate to any of the data stores or real-time connectors we support.

Let's get started! 🚀

You can follow along in our accompanying notebook!

Shaped CLI Setup​

Installing the Shaped CLI​

You'll need to install the Shaped CLI if you haven't already. You can do this with the following command:

pip install shaped
info

Shaped supports Python 3.8+, take a look at the installation instructions if you need to install pip.

Initialize the CLI​

You can then initialize the shaped client with your API key. If you don't have an API key yet, check out the How to get an API key page.

shaped init --api-key <YOUR_API_KEY>

Dataset Preparation​

Download public dataset​

To start off, let's fetch the Steam Australian User Reviews dataset we'll be training our model with.

CLI
wget https://mcauleylab.ucsd.edu/public_datasets/data/steam/australian_user_reviews.json.gz --no-check-certificate

Parse and prepare the data​

The Steam reviews dataset is stored in a JSON.gz file with a nested structure. Each record contains a user_id and an array of reviews they've given for different games. We need to transform this into a flattened format where each row represents a single user-game review.

Let's parse this data and create a clean TSV file that we can use with Shaped:

Python
import pandas as pd
import gzip
import json
import re
from datetime import datetime

def parse(path):
"""Parse each line of the compressed JSON file."""
g = gzip.open(path, 'r')
for l in g:
yield eval(l)

def read_data(path):
"""Read all data from the compressed JSON file."""
data = list(parse(path))
return data

# Read the compressed dataset
users_reviews = read_data('australian_user_reviews.json.gz')

def parse_reviews_data(json_data):
"""Extract structured data from the reviews JSON."""
cleaned_reviews = []

for user_data in json_data:
user_id = user_data.get('user_id')

# Process each review for this user
if 'reviews' in user_data and isinstance(user_data['reviews'], list):
for review in user_data['reviews']:
# Extract needed fields
item_id = review.get('item_id')
recommend = 1 if review.get('recommend', False) else 0

# Parse the posted date if available
posted_date = review.get('posted', '')
# Extract date from string like 'Posted November 5, 2011.'
date_match = re.search(r'Posted (\w+ \d+, \d{4})', posted_date)

if date_match:
try:
# Parse the date string to a datetime object
date_str = date_match.group(1)
date_obj = datetime.strptime(date_str, '%B %d, %Y')
# Convert to YYYY-MM-DD format
created_at = date_obj.strftime('%Y-%m-%d')
except:
# Use a default date if parsing fails
created_at = '2000-01-01'
else:
created_at = '2000-01-01'

# Create clean review record
clean_review = {
'user_id': user_id,
'item_id': item_id,
'created_at': created_at,
'recommend': recommend
}

cleaned_reviews.append(clean_review)

return cleaned_reviews

# Process the reviews data
cleaned_reviews = parse_reviews_data(users_reviews)

# Convert cleaned reviews to a DataFrame
df = pd.DataFrame(cleaned_reviews)
print(f"Dataset shape: {df.shape}")
print("Sample data:")
print(df.head())

# Save as TSV
csv_file_path = 'user_reviews.csv'
df.to_csv(csv_file_path, sep='\t', index=False)

Create Steam Reviews Dataset in Shaped​

Now that we have our data prepared, we'll create a Shaped Dataset and upload our processed data to it. First, let's define the schema and save it into a YAML file:

steam_review_events_schema.yaml
name: steam_review_events
schema_type: CUSTOM
column_schema:
user_id: String
item_id: String
created_at: DateTime
recommend: Int32

Next, create the dataset using the Shaped CLI:

shaped create-dataset --file steam_review_events_schema.yaml

Then, insert the data into the dataset:

shaped dataset-insert --dataset-name steam_review_events --file user_reviews.csv --type 'tsv'

You can check if the dataset was created successfully:

shaped list-datasets

Create your model​

We're now ready to create our recommendation model! We'll use the review data to build a model. By default, the system will automatically select the optimal policy and hyperparameters for your model. Shaped will use this data to determine which users like which games, based on whether they recommended the game in their review.

Here's the model definition we'll be using and again we save it into a YAML file:

steam_review_model_schema.yaml
model:
name: steam_review_game_recommendations
connectors:
- type: Dataset
id: steam_review_events
name: steam_review_events
fetch:
events: SELECT user_id, item_id, created_at, recommend AS label FROM steam_review_events

Create the model using the Shaped CLI:

shaped create-model --file steam_review_model_schema.yaml

Inspect your model​

Check the status of your model:

shaped list-models

Your recommendation model can take up to a few hours to provision your infrastructure and train on your historic events. This time mostly depends on how large your dataset is i.e., the volume of your users, items, interactions, and the number of attributes you're providing.

The initial model creation goes through the following stages in order:

  1. SCHEDULING
  2. FETCHING
  3. TUNING
  4. TRAINING
  5. DEPLOYING
  6. ACTIVE

You can periodically poll Shaped to inspect these status changes. Once it's in the ACTIVE state, you can move to the next step and use it to make rank requests.

Fetch your recommendations​

You're now ready to fetch your game recommendations! You can do this with the Rank endpoint, just provide the user_id you wish to get the recommendations for and the number of recommendations you want returned.

Shaped's CLI provides a convenient rank command to quickly retrieve results from the command line:

shaped rank --model-name steam_review_game_recommendations --user-id '76561197970982479' --limit 5

Response:

{
"ids":[
"219150",
"245550",
"620",
"440",
"8930"
],
"scores":[
0.944791813545478,
0.9243345560353259,
0.9136819097511378,
0.8999791870543353,
0.8831670564757734
],
}

The response returns two parallel arrays containing the ids and ranking scores for the games that Shaped estimates are most interesting to the given user.

If you want to integrate this endpoint into your website or application, you can use the Rank POST REST endpoint directly:

curl https://api.shaped.ai/v1/models/steam_review_game_recommendations/rank \
-X POST \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"user_id": "76561197970982479", "limit": 5 }'

Clean Up​

Don't forget to delete your model (and its assets) and the dataset once you're finished with them. You can do it with the following CLI commands:

shaped delete-model --model-name steam_review_game_recommendations
shaped delete-dataset --dataset-name steam_review_events