Skip to main content

Using Multiple Datasets

When it comes to recommendation systems (or machine-learning in general) often all the data you need won't be in the same spot. For example, you might have your interaction event data in BigQuery, your user data in Postgres and your item catalog in MySQL. This is usually one of the reasons it's so hard to get started with machine-learning, coercing all this data into one place as a unified schema can be tricky! It only get's harder when you move from the experimental stage to production as continuous ETL or stream pipelines need to be built that are robust to data quality issues and can maintain data consistency constraints.

Shaped makes it easy to handle this problem by providing connectors to all of the common areas you'd store recommendation data. These connectors can be used in the create model queries to flexibly choose what user, item, interaction and filter data is needed for your model. Under-the-hood Shaped parses these DuckDB syntax queries and builds ETL and stream pipelines to ingest your data with the required freshness for your use-case.

This guide runs through how to create a model using more than one connector, and transforming the data from each connector into a unified schema for your model.

Connecting Your Data

To connect your data you'll need to create a dataset in Shaped for each of your data sources, you can visit the BigQuery, Postgres and S3 pages to learn how. Once you've created your datasets you can reference them in your model configuration file by their dataset name.

A Familiar User, Item & Events Model

Recall the video recommendation model we built in Adding User, Item & Event Features guide. If you recall, there were three datasets used in that model, video_ratings, video_catalog and video_users. In this guide, we'll use the same datasets but from different data sources.

Each of the datasets will be connected to a different data source, video_ratings will be connected to BigQuery, video_catalog will be connected to Postgres and video_users will be connected to MySQL.

Creating the Model

video_recommendation_model.yaml
model:
name: video_recommendations
connectors:
- type: Dataset
id: video_ratings
name: video_bigquery_ratings
- type: Dataset
id: video_catalog
name: video_postgres_catalog
- type: Dataset
id: video_users
name: video_mysql_users
fetch:
events: |
SELECT user_id, item_id, created_at, rating as label
FROM video_ratings
users: |
SELECT user_id, created_at, gender, age
FROM video_users
items: |
SELECT item_id, created_at, description, hashtags
FROM video_catalog

```bash
shaped create-model --file video_recommendation_model.yaml

Complex Schema Transformations

In the above model configuration, the item features are still provided to Shaped in a table with many properties that may not always exist on the same source table or collection. In this case, you can add additional dataset connectors to the model configuration to fetch the item, user or event features from different sources, and perform the necessary transformations to unify the schema.

video_recommendation_model.yaml

model:
name: video_recommendations
connectors:
- type: Dataset
id: video_ratings
name: video_bigquery_ratings
- type: Dataset
id: video_catalog
name: video_postgres_catalog
- type: Dataset
id: video_tags
name: video_postgres_video_tags
fetch:
events: |
SELECT user_id, item_id, created_at, rating as label
FROM video_ratings
users: |
SELECT user_id, created_at
FROM video_users
items: |
SELECT vc.item_id, any_value(vc.created_at), any_value(vc.description), array_agg(vt.tag) as tags
FROM video_catalog vc
LEFT JOIN video_tags vt
ON vc.item_id = vt.item_id
GROUP BY vc.item_id

Conclusion

You've just learned how to create a model using multiple connectors from different data sources. This works for any of the connectors we provide and for any of the model fetch queries.