Skip to main content

Data Unification and Transformation

One of Shaped's core value propositions is the ability to unify heterogeneous data sources into a single, cohesive recommendation system. Even if your data is spread across multiple systems, in different formats, or isn't perfectly cleaned, Shaped can help you bring it all together.

The Data Fragmentation Problem

Many organizations face data fragmentation: multiple data sources (Segment, PostgreSQL, Snowflake, MongoDB), inconsistent schemas, uncleaned data, and different update frequencies. Shaped addresses these challenges through a unified data layer that connects, transforms, and consolidates your data.

The Data Unification Flow

Step 1: Connect Your Data Sources

Connect to your data sources using Shaped connectors. Create each table via POST /v2/tables:

# Real-time connector (Segment)
POST /v2/tables
{
"name": "user_events",
"schema_type": "SEGMENT"
}

# Batch connector (PostgreSQL)
POST /v2/tables
{
"name": "products",
"schema_type": "POSTGRES",
"host": "your-db-host.com",
"port": 5432,
"user": "your_user",
"password": "your_password",
"database": "your_database",
"table": "products",
"replication_key": "updated_at"
}

Each connector handles authentication, schema detection, and data synchronization automatically.

Step 2: Transform with Views

Create SQL views via POST /v2/views to transform and unify your data. Views use ClickHouse SQL syntax, so all SQL queries must be written in ClickHouse SQL.

Join multiple sources:

SELECT p.product_id as item_id, p.name as title, p.description, p.category, p.price, 
i.inventory_count, r.rating_avg
FROM products p
LEFT JOIN inventory i ON p.product_id = i.product_id
LEFT JOIN reviews_summary r ON p.product_id = r.product_id
POST /v2/views
{
"name": "unified_items",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT p.product_id as item_id, p.name as title, p.description, p.category, p.price, i.inventory_count, r.rating_avg FROM products p LEFT JOIN inventory i ON p.product_id = i.product_id LEFT JOIN reviews_summary r ON p.product_id = r.product_id"
}

Standardize schemas:

SELECT user_id, item_id, event_timestamp as created_at,
CASE
WHEN event_type = 'purchase' THEN 1
WHEN event_type = 'view' THEN 0.5
ELSE 0
END as label
FROM user_events
WHERE event_type IN ('purchase', 'view', 'click')
POST /v2/views
{
"name": "normalized_interactions",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT user_id, item_id, event_timestamp as created_at, CASE WHEN event_type = 'purchase' THEN 1 WHEN event_type = 'view' THEN 0.5 ELSE 0 END as label FROM user_events WHERE event_type IN ('purchase', 'view', 'click')"
}

Clean and enrich data:

SELECT item_id, COALESCE(title, 'Untitled') as title, COALESCE(description, '') as description, 
category, price,
CASE
WHEN price < 50 THEN 'budget'
WHEN price < 200 THEN 'mid-range'
ELSE 'premium'
END as price_tier,
toYear(created_at) as year_created
FROM products
WHERE item_id IS NOT NULL
POST /v2/views
{
"name": "cleaned_items",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT item_id, COALESCE(title, 'Untitled') as title, COALESCE(description, '') as description, category, price, CASE WHEN price < 50 THEN 'budget' WHEN price < 200 THEN 'mid-range' ELSE 'premium' END as price_tier, toYear(created_at) as year_created FROM products WHERE item_id IS NOT NULL"
}

On-demand views:

Use sql_view_type: "VIEW" for data that changes frequently:

SELECT user_id, item_id, event_type, timestamp
FROM user_events
WHERE timestamp > now() - INTERVAL 1 HOUR
POST /v2/views
{
"name": "recent_interactions",
"view_type": "SQL",
"sql_view_type": "VIEW",
"sql_query": "SELECT user_id, item_id, event_type, timestamp FROM user_events WHERE timestamp > now() - INTERVAL 1 HOUR"
}

Step 2b: Enrich with AI

Enrich your data with AI views to generate semantic content:

Example: Generate product descriptions

POST /v2/views
{
"name": "items_with_ai_descriptions",
"view_type": "AI_ENRICHMENT",
"source_table": "cleaned_items",
"source_columns": ["title", "category", "price_tier"],
"source_columns_in_output": ["item_id", "title"],
"enriched_output_columns": ["ai_description"],
"prompt": "Generate a concise product description based on the title, category, and price tier. Focus on key features and use cases."
}

Learn more about AI enrichments in the AI Enrichment guide.

Step 3: Use Unified Data in Engines

Reference your unified views in engine configurations:

version: v2
name: unified_recommendations
data:
item_table:
name: unified_items # Uses the view we created
user_table:
name: user_profiles
interaction_table:
name: normalized_interactions # Uses the normalized view

Best Practices for Data Unification

1. Start with Raw Tables

Create tables that connect directly to your sources without transformation. This gives you flexibility to experiment with different transformations later.

2. Use Views for Transformation

Views let you join multiple sources, standardize schemas, clean data, and create derived features. Materialized views are pre-computed for fast training and querying.

3. Handle Schema Differences

Standardize column names and data types across connectors:

-- Map column names
SELECT customer_id as user_id, product_sku as item_id FROM transactions

-- Normalize data types
SELECT CAST(user_id AS STRING) as user_id, CAST(price AS FLOAT) as price FROM raw_data

Handle different data structures:

JSON data (MongoDB, document stores):

SELECT document->>'user_id' as user_id, document->>'item_id' as item_id, 
document->>'timestamp'::timestamp as created_at
FROM user_profiles
WHERE document->>'status' = 'active'
POST /v2/views
{
"name": "normalized_mongodb_data",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT document->>'user_id' as user_id, document->>'item_id' as item_id, document->>'timestamp'::timestamp as created_at FROM user_profiles WHERE document->>'status' = 'active'"
}

Nested structures:

SELECT item_id, attributes->>'brand' as brand, attributes->>'color' as color FROM products

4. Clean Data Incrementally

Don't try to clean everything at once. Start with the most important data and iterate:

  1. Essential columns first: Ensure item_id, user_id, and label columns exist
  2. Handle nulls: Use COALESCE or default values for missing data
  3. Deduplicate: Remove duplicate records using DISTINCT or window functions
  4. Validate: Add WHERE clauses to filter invalid data

5. Combine Real-Time and Batch Data

Use real-time connectors for events and batch connectors for catalogs, then join them in a view:

SELECT i.user_id, i.item_id, i.event_type, i.timestamp, p.category, p.price
FROM user_interactions i
LEFT JOIN product_catalog p ON i.item_id = p.item_id
POST /v2/views
{
"name": "enriched_interactions",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT i.user_id, i.item_id, i.event_type, i.timestamp, p.category, p.price FROM user_interactions i LEFT JOIN product_catalog p ON i.item_id = p.item_id"
}

Common Patterns

Pattern 1: Event Enrichment

Enrich real-time events with batch catalog data:

SELECT e.user_id, e.item_id, e.event_type, e.timestamp, p.title, p.category, p.brand, 
u.segment, u.country
FROM events e
LEFT JOIN products p ON e.item_id = p.item_id
LEFT JOIN users u ON e.user_id = u.user_id
POST /v2/views
{
"name": "enriched_events",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT e.user_id, e.item_id, e.event_type, e.timestamp, p.title, p.category, p.brand, u.segment, u.country FROM events e LEFT JOIN products p ON e.item_id = p.item_id LEFT JOIN users u ON e.user_id = u.user_id"
}

Pattern 2: Multi-Source Aggregation

Aggregate data from multiple sources:

SELECT user_id, COUNT(*) as total_interactions, COUNT(DISTINCT item_id) as unique_items_viewed,
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as purchases,
MAX(timestamp) as last_activity
FROM (
SELECT user_id, item_id, event_type, timestamp FROM segment_events
UNION ALL
SELECT user_id, item_id, 'transaction' as event_type, timestamp FROM transactions
)
GROUP BY user_id
POST /v2/views
{
"name": "user_activity_summary",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT user_id, COUNT(*) as total_interactions, COUNT(DISTINCT item_id) as unique_items_viewed, SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as purchases, MAX(timestamp) as last_activity FROM (SELECT user_id, item_id, event_type, timestamp FROM segment_events UNION ALL SELECT user_id, item_id, 'transaction' as event_type, timestamp FROM transactions) GROUP BY user_id"
}

Pattern 3: Feature Engineering

Create derived features from raw data:

SELECT item_id, title, category, price,
CASE
WHEN price < 50 THEN 'low'
WHEN price < 200 THEN 'medium'
ELSE 'high'
END as price_segment,
dateDiff('day', created_at, now()) as days_since_creation,
COALESCE(rating_count, 0) as review_count
FROM products
POST /v2/views
{
"name": "items_with_features",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT item_id, title, category, price, CASE WHEN price < 50 THEN 'low' WHEN price < 200 THEN 'medium' ELSE 'high' END as price_segment, dateDiff('day', created_at, now()) as days_since_creation, COALESCE(rating_count, 0) as review_count FROM products"
}

Data Layer Architecture

The Shaped data layer follows this architecture:

???????????????????
? Data Sources ?
? (PostgreSQL, ?
? Segment, etc) ?
???????????????????
?
?
???????????????????
? Tables ? ? Raw data from connectors
? (item_table, ?
? user_table) ?
???????????????????
?
?
???????????????????
? Views ? ? Transformations and joins
? (SQL views, ?
? AI views) ?
???????????????????
?
?
???????????????????
? Engines ? ? Training and indexing
? (embeddings, ?
? models) ?
???????????????????

This layered approach keeps raw data accessible, enables incremental transformations, and lets you reuse views across multiple engines.

Next Steps