Data Unification and Transformation
One of Shaped's core value propositions is the ability to unify heterogeneous data sources into a single, cohesive recommendation system. Even if your data is spread across multiple systems, in different formats, or isn't perfectly cleaned, Shaped can help you bring it all together.
The Data Fragmentation Problem
Many organizations face data fragmentation: multiple data sources (Segment, PostgreSQL, Snowflake, MongoDB), inconsistent schemas, uncleaned data, and different update frequencies. Shaped addresses these challenges through a unified data layer that connects, transforms, and consolidates your data.
The Data Unification Flow
Step 1: Connect Your Data Sources
Connect to your data sources using Shaped connectors. Create each table via POST /v2/tables:
# Real-time connector (Segment)
POST /v2/tables
{
"name": "user_events",
"schema_type": "SEGMENT"
}
# Batch connector (PostgreSQL)
POST /v2/tables
{
"name": "products",
"schema_type": "POSTGRES",
"host": "your-db-host.com",
"port": 5432,
"user": "your_user",
"password": "your_password",
"database": "your_database",
"table": "products",
"replication_key": "updated_at"
}
Each connector handles authentication, schema detection, and data synchronization automatically.
Step 2: Transform with Views
Create SQL views via POST /v2/views to transform and unify your data. Views use ClickHouse SQL syntax, so all SQL queries must be written in ClickHouse SQL.
Join multiple sources:
SELECT p.product_id as item_id, p.name as title, p.description, p.category, p.price,
i.inventory_count, r.rating_avg
FROM products p
LEFT JOIN inventory i ON p.product_id = i.product_id
LEFT JOIN reviews_summary r ON p.product_id = r.product_id
POST /v2/views
{
"name": "unified_items",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT p.product_id as item_id, p.name as title, p.description, p.category, p.price, i.inventory_count, r.rating_avg FROM products p LEFT JOIN inventory i ON p.product_id = i.product_id LEFT JOIN reviews_summary r ON p.product_id = r.product_id"
}
Standardize schemas:
SELECT user_id, item_id, event_timestamp as created_at,
CASE
WHEN event_type = 'purchase' THEN 1
WHEN event_type = 'view' THEN 0.5
ELSE 0
END as label
FROM user_events
WHERE event_type IN ('purchase', 'view', 'click')
POST /v2/views
{
"name": "normalized_interactions",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT user_id, item_id, event_timestamp as created_at, CASE WHEN event_type = 'purchase' THEN 1 WHEN event_type = 'view' THEN 0.5 ELSE 0 END as label FROM user_events WHERE event_type IN ('purchase', 'view', 'click')"
}
Clean and enrich data:
SELECT item_id, COALESCE(title, 'Untitled') as title, COALESCE(description, '') as description,
category, price,
CASE
WHEN price < 50 THEN 'budget'
WHEN price < 200 THEN 'mid-range'
ELSE 'premium'
END as price_tier,
toYear(created_at) as year_created
FROM products
WHERE item_id IS NOT NULL
POST /v2/views
{
"name": "cleaned_items",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT item_id, COALESCE(title, 'Untitled') as title, COALESCE(description, '') as description, category, price, CASE WHEN price < 50 THEN 'budget' WHEN price < 200 THEN 'mid-range' ELSE 'premium' END as price_tier, toYear(created_at) as year_created FROM products WHERE item_id IS NOT NULL"
}
On-demand views:
Use sql_view_type: "VIEW" for data that changes frequently:
SELECT user_id, item_id, event_type, timestamp
FROM user_events
WHERE timestamp > now() - INTERVAL 1 HOUR
POST /v2/views
{
"name": "recent_interactions",
"view_type": "SQL",
"sql_view_type": "VIEW",
"sql_query": "SELECT user_id, item_id, event_type, timestamp FROM user_events WHERE timestamp > now() - INTERVAL 1 HOUR"
}
Step 2b: Enrich with AI
Enrich your data with AI views to generate semantic content:
Example: Generate product descriptions
POST /v2/views
{
"name": "items_with_ai_descriptions",
"view_type": "AI_ENRICHMENT",
"source_table": "cleaned_items",
"source_columns": ["title", "category", "price_tier"],
"source_columns_in_output": ["item_id", "title"],
"enriched_output_columns": ["ai_description"],
"prompt": "Generate a concise product description based on the title, category, and price tier. Focus on key features and use cases."
}
Learn more about AI enrichments in the AI Enrichment guide.
Step 3: Use Unified Data in Engines
Reference your unified views in engine configurations:
version: v2
name: unified_recommendations
data:
item_table:
name: unified_items # Uses the view we created
user_table:
name: user_profiles
interaction_table:
name: normalized_interactions # Uses the normalized view
Best Practices for Data Unification
1. Start with Raw Tables
Create tables that connect directly to your sources without transformation. This gives you flexibility to experiment with different transformations later.
2. Use Views for Transformation
Views let you join multiple sources, standardize schemas, clean data, and create derived features. Materialized views are pre-computed for fast training and querying.
3. Handle Schema Differences
Standardize column names and data types across connectors:
-- Map column names
SELECT customer_id as user_id, product_sku as item_id FROM transactions
-- Normalize data types
SELECT CAST(user_id AS STRING) as user_id, CAST(price AS FLOAT) as price FROM raw_data
Handle different data structures:
JSON data (MongoDB, document stores):
SELECT document->>'user_id' as user_id, document->>'item_id' as item_id,
document->>'timestamp'::timestamp as created_at
FROM user_profiles
WHERE document->>'status' = 'active'
POST /v2/views
{
"name": "normalized_mongodb_data",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT document->>'user_id' as user_id, document->>'item_id' as item_id, document->>'timestamp'::timestamp as created_at FROM user_profiles WHERE document->>'status' = 'active'"
}
Nested structures:
SELECT item_id, attributes->>'brand' as brand, attributes->>'color' as color FROM products
4. Clean Data Incrementally
Don't try to clean everything at once. Start with the most important data and iterate:
- Essential columns first: Ensure
item_id,user_id, andlabelcolumns exist - Handle nulls: Use
COALESCEor default values for missing data - Deduplicate: Remove duplicate records using
DISTINCTor window functions - Validate: Add
WHEREclauses to filter invalid data
5. Combine Real-Time and Batch Data
Use real-time connectors for events and batch connectors for catalogs, then join them in a view:
SELECT i.user_id, i.item_id, i.event_type, i.timestamp, p.category, p.price
FROM user_interactions i
LEFT JOIN product_catalog p ON i.item_id = p.item_id
POST /v2/views
{
"name": "enriched_interactions",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT i.user_id, i.item_id, i.event_type, i.timestamp, p.category, p.price FROM user_interactions i LEFT JOIN product_catalog p ON i.item_id = p.item_id"
}
Common Patterns
Pattern 1: Event Enrichment
Enrich real-time events with batch catalog data:
SELECT e.user_id, e.item_id, e.event_type, e.timestamp, p.title, p.category, p.brand,
u.segment, u.country
FROM events e
LEFT JOIN products p ON e.item_id = p.item_id
LEFT JOIN users u ON e.user_id = u.user_id
POST /v2/views
{
"name": "enriched_events",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT e.user_id, e.item_id, e.event_type, e.timestamp, p.title, p.category, p.brand, u.segment, u.country FROM events e LEFT JOIN products p ON e.item_id = p.item_id LEFT JOIN users u ON e.user_id = u.user_id"
}
Pattern 2: Multi-Source Aggregation
Aggregate data from multiple sources:
SELECT user_id, COUNT(*) as total_interactions, COUNT(DISTINCT item_id) as unique_items_viewed,
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as purchases,
MAX(timestamp) as last_activity
FROM (
SELECT user_id, item_id, event_type, timestamp FROM segment_events
UNION ALL
SELECT user_id, item_id, 'transaction' as event_type, timestamp FROM transactions
)
GROUP BY user_id
POST /v2/views
{
"name": "user_activity_summary",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT user_id, COUNT(*) as total_interactions, COUNT(DISTINCT item_id) as unique_items_viewed, SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as purchases, MAX(timestamp) as last_activity FROM (SELECT user_id, item_id, event_type, timestamp FROM segment_events UNION ALL SELECT user_id, item_id, 'transaction' as event_type, timestamp FROM transactions) GROUP BY user_id"
}
Pattern 3: Feature Engineering
Create derived features from raw data:
SELECT item_id, title, category, price,
CASE
WHEN price < 50 THEN 'low'
WHEN price < 200 THEN 'medium'
ELSE 'high'
END as price_segment,
dateDiff('day', created_at, now()) as days_since_creation,
COALESCE(rating_count, 0) as review_count
FROM products
POST /v2/views
{
"name": "items_with_features",
"view_type": "SQL",
"sql_view_type": "MATERIALIZED_VIEW",
"sql_query": "SELECT item_id, title, category, price, CASE WHEN price < 50 THEN 'low' WHEN price < 200 THEN 'medium' ELSE 'high' END as price_segment, dateDiff('day', created_at, now()) as days_since_creation, COALESCE(rating_count, 0) as review_count FROM products"
}
Data Layer Architecture
The Shaped data layer follows this architecture:
???????????????????
? Data Sources ?
? (PostgreSQL, ?
? Segment, etc) ?
???????????????????
?
?
???????????????????
? Tables ? ? Raw data from connectors
? (item_table, ?
? user_table) ?
???????????????????
?
?
???????????????????
? Views ? ? Transformations and joins
? (SQL views, ?
? AI views) ?
???????????????????
?
?
???????????????????
? Engines ? ? Training and indexing
? (embeddings, ?
? models) ?
???????????????????
This layered approach keeps raw data accessible, enables incremental transformations, and lets you reuse views across multiple engines.
Next Steps
- Learn about Connector Types to understand real-time vs batch
- See Views for more transformation examples
- Learn about AI Enrichments to add semantic content
- Check the Connector Reference for specific connector setup