Skip to main content

Using Multiple Connectors

When it comes to recommendation systems (or machine-learning in general) often all the data you need won't be in the same spot. For example, you might have your interaction event data in Google Analytics, your user data in Postgres and your items in S3. This is usually one of the reasons it's so hard to get started with machine-learning, coercing all this data into one place as a unified schema can be tricky! It only get's harder when you move from the experimental stage to production as continuous ETL or stream pipelines need to be built that are robust to data quality issues and can maintain data consistency constraints.

Shaped makes it easy to handle this problem by providing connectors to all of the common areas you'd store recommendation data. These connectors can be used in the create model queries to flexibly choose what user, item, interaction and filter data is needed for your model. Under-the-hood Shaped parses these DuckDB syntax queries and builds ETL and stream pipelines to ingest your data with the required freshness for your use-case.

This guide runs through how to create a model using more than one connector.

A Familiar User, Item & Events Model

Let's start off with the model from the Adding User, Item & Event Features guide. If you recall, it uses a BigQuery connection to populate the events, users and items queries for a video recommendation model.

video_recommendation_model.yaml
model:
name: video_recommendation_model
connectors:
- type: BigQuery
id: bigquery_connector
location: us-west1
project_id: rocket-ship-234123
dataset: video_db
fetch:
events: |
SELECT user_id, item_id, created_at, (CASE WHEN event = 'click' THEN 1 ELSE 0 END) as label
FROM bigquery_connector.click_events
users: |
SELECT user_id, created_at, gender, age
FROM bigquery_connector.users
items: |
SELECT item_id, created_at, description, hashtags
FROM bigquery_connector.videos

shaped create-model --file create_model.yaml

Using Multiple Connections

Let's imagine that you have the click events in BigQuery, but your user data is stored in Postgres and your item data is in S3 (rather than all in BigQuery from our previous example). To setup a model you just need to add each of these connectors and reference them in your fetch queries. Here's an example:

multi_connector_video_recommendations.yaml
model:
name: multi_connector_video_recommendations
connectors:
- type: BigQuery
id: bigquery_connector
location: us-west1
project_id: rocket-ship-1
dataset: video_db
- type: Postgres
id: postgres_connector
user: database_readonly_username
password: database_readonly_password
host: your.pg.db.hostname.com
port: 5432
database: database_name
- type: File
id: items_connector
path: 's3://your-bucket/items.parquet'
fetch:
eents: |
SELECT user_id, item_id, created_at, (CASE WHEN event = 'click' THEN 1 ELSE 0 END) AS label
FROM bigquery_connector.click_events
users: |
SELECT user_id, created_at, gender, age
FROM postgres_connector.users
items: |
SELECT item_id, created_at, description, hashtags
FROM items_connector
shaped create-model --file multi_connector_video_recommendations.yaml

Conclusion

You've just learned how to create a model using multiple connectors from different data sources. This works for any of the connectors we provide and for any of the model fetch queries.