Game Recommendations (Steam Reviews)
This tutorial demonstrates how to configure a recommendation engine using the Steam Australian Reviews dataset, specifically the australian_user_reviews.json.gz file. The dataset contains reviews from Australian Steam users, including whether they recommended the games they reviewed.
CLI Setup
Install the CLI
pip install shaped
Shaped supports Python 3.8 to 3.11. See installation instructions if you need to install pip.
Initialize the CLI
shaped init --api-key <YOUR_API_KEY>
If you don't have an API key, see How to get an API key.
Data Preparation
Download the dataset
wget https://mcauleylab.ucsd.edu/public_datasets/data/steam/australian_user_reviews.json.gz --no-check-certificate
Parse and prepare the data
The dataset is stored in JSON.gz format with a nested structure. Each record contains a user_id and an array of reviews. Transform this into a flattened format where each row represents a single user-game review:
import pandas as pd
import gzip
import json
import re
from datetime import datetime
def parse(path):
"""Parse each line of the compressed JSON file."""
g = gzip.open(path, 'r')
for l in g:
yield eval(l)
def read_data(path):
"""Read all data from the compressed JSON file."""
data = list(parse(path))
return data
# Read the compressed dataset
users_reviews = read_data('australian_user_reviews.json.gz')
def parse_reviews_data(json_data):
"""Extract structured data from the reviews JSON."""
cleaned_reviews = []
for user_data in json_data:
user_id = user_data.get('user_id')
# Process each review for this user
if 'reviews' in user_data and isinstance(user_data['reviews'], list):
for review in user_data['reviews']:
# Extract needed fields
item_id = review.get('item_id')
recommend = 1 if review.get('recommend', False) else 0
# Parse the posted date if available
posted_date = review.get('posted', '')
# Extract date from string like 'Posted November 5, 2011.'
date_match = re.search(r'Posted (\w+ \d+, \d{4})', posted_date)
if date_match:
try:
# Parse the date string to a datetime object
date_str = date_match.group(1)
date_obj = datetime.strptime(date_str, '%B %d, %Y')
# Convert to YYYY-MM-DD format
created_at = date_obj.strftime('%Y-%m-%d')
except:
# Use a default date if parsing fails
created_at = '2000-01-01'
else:
created_at = '2000-01-01'
# Create clean review record
clean_review = {
'user_id': user_id,
'item_id': item_id,
'created_at': created_at,
'recommend': recommend
}
cleaned_reviews.append(clean_review)
return cleaned_reviews
# Process the reviews data
cleaned_reviews = parse_reviews_data(users_reviews)
# Convert cleaned reviews to a DataFrame
df = pd.DataFrame(cleaned_reviews)
print(f"Table shape: {df.shape}")
print("Sample data:")
print(df.head())
# Save as TSV
csv_file_path = 'user_reviews.csv'
df.to_csv(csv_file_path, sep='\t', index=False)
Create the table
Define the table schema:
name: steam_review_events
schema_type: CUSTOM
column_schema:
user_id: String
item_id: String
created_at: DateTime
recommend: Int32
Create the table:
shaped create-table --file steam_review_events_schema.yaml
Insert data:
shaped table-insert --table-name steam_review_events --file user_reviews.csv --type 'tsv'
Verify table creation:
shaped list-tables
Create the engine
This example uses review data to build a collaborative filtering engine. The system automatically selects policy and hyperparameters. The engine uses recommendation status (whether the user recommended the game) as the interaction signal.
Engine configuration:
data:
interaction_table:
type: query
query: |
SELECT user_id, item_id, created_at, recommend AS label FROM steam_review_events
training:
models:
- name: als
policy_type: als
Create the engine:
shaped create-engine --file steam_review_engine_schema.yaml
Monitor engine status
Check engine status:
shaped list-engines
Engine creation and training can take several hours, depending on data volume and attributes. The engine progresses through these stages:
SCHEDULINGFETCHINGTRAININGDEPLOYINGACTIVE
Once the status is ACTIVE, the engine is ready for queries.
Query recommendations
Query recommendations using the Query endpoint. Provide a user_id and the number of results to return.
Using the CLI:
shaped query --engine-name steam_review_game_recommendations \
--query "SELECT * FROM similarity(embedding_ref='als', limit=50, encoder='precomputed_user', input_user_id='\$user_id') LIMIT 5" \
--parameters '{"user_id": "76561197970982479"}'
Response:
{
"results": [
{
"id": "219150",
"score": 0.944791813545478
},
{
"id": "245550",
"score": 0.9243345560353259
},
{
"id": "620",
"score": 0.9136819097511378
},
{
"id": "440",
"score": 0.8999791870543353
},
{
"id": "8930",
"score": 0.8831670564757734
}
]
}
The response contains an array of result objects with game IDs and scores.
Using the REST API:
curl https://api.shaped.ai/v2/engines/steam_review_game_recommendations/query \
-X POST \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "SELECT * FROM similarity(embedding_ref=''als'', limit=50, encoder=''precomputed_user'', input_user_id=''$user_id'') LIMIT 5",
"parameters": {
"user_id": "76561197970982479"
}
}'
Clean up
Delete the engine and table when finished:
shaped delete-engine --engine-name steam_review_game_recommendations
shaped delete-table --table-name steam_review_events