Cross Features
Beyond Individual Values: The Power of Feature Relationships for Relevance
We've explored engineering features from individual data sources: User profiles, Item metadata, Contextual information (like time or device), and the Language or Images associated with them. While vital, the real magic often happens when we explicitly model the relationships between these sources. This involves creating features that capture interactions (e.g., how a specific User attribute relates to an Item attribute within a given Context) or deriving features by aggregating historical Interactions (like clicks, views, purchases). These advanced techniques allow systems to grasp:
- Cross-Entity Interactions: Does this user's country influence their preference for this item's category when using a mobile device?
- Historical Activity & Popularity: How many times have users from this segment viewed this brand? What's the click-through rate for items under $50 when shown on the homepage?
- Learned Propensities (Target Encoding): Based on past behavior, what's the average conversion rate associated with this user ID interacting with this item’s category?
- Multimodal Affinity: How well does the semantic meaning of a user's profile description match the semantic meaning of an item's description?
Transforming data by explicitly creating interaction terms or calculating statistics over historical Interaction Tables (linking Users, Items, and Context through events) is a potent layer of feature engineering. Get it right, and models gain deep contextual understanding. Neglect it, and crucial predictive signals hidden in the data's relationships remain untapped.
The Standard Approach: Building the Infrastructure for Relational & Aggregate Features
Creating these features requires robust data pipelines, careful consideration of entity relationships, and infrastructure for computation, storage, and serving, while guarding against data leakage.
Core Data Entities Involved:
- User Attribute Table: Contains user characteristics (e.g.,
user_id
,country
,age_bracket
,segment
,profile_text_embedding
). - Item Attribute Table: Contains item characteristics (e.g.,
item_id
,category
,brand
,price
,description_embedding
,image_embedding
). - Context Attribute Table (often implicit/request-time): Contains situational characteristics (e.g.,
timestamp
,device_type
,location
,query_text_embedding
). - Interaction Table (Events Log): The crucial link. Records events like clicks, views, purchases, ratings, etc., typically containing
user_id
,item_id
,timestamp
,event_type
, and potentially contextual snapshots. This table is the primary source for computing counts, rates, and target encodings.
Step 1: Feature Interactions & Combinations (Explicit Crosses)
Create new features representing the combination of attributes, often across different entity tables.
- Categorical Crosses: Combine levels from categorical attributes across entities.
- Example: Feature
user_country_X_item_category
with values like 'US_Electronics', 'UK_Fashion'. Computed by joining User and Item attributes (often at modeling time or precomputed).
- Example: Feature
- Multimodal Interactions (Embedding-Based): Calculate the affinity between features from different modalities, typically using embeddings.
- Example:
user_profile_text_affinity_with_item_description
. Computed asdot_product(user_profile_text_embedding, item_description_embedding)
. This scalar value (similarity score) becomes a new numerical feature. - Other Examples:
dot_product(user_visual_embedding, item_image_embedding)
,dot_product(query_text_embedding, item_text_embedding)
. - Computation: Requires embeddings to be generated first. The dot product is often calculated just-in-time during model scoring or precomputed if embeddings are stable.
- Example:
- Numerical/Binned Crosses: Combine categorical attributes with numerical or binned numerical ones.
- Example:
item_category_X_user_age_bracket
- Example:
The Challenge: Combinatorial explosion is a real issue. Requires careful selection of meaningful crosses. Multimodal interactions require managing and aligning embedding spaces.
Step 2: Count & Rate Features (Aggregations over the Interaction Table)
Calculate statistics by aggregating the Interaction Table, grouped by keys derived from User, Item, and/or Context attributes.
- Computation:
- Requires powerful aggregation frameworks (e.g., Apache Spark, Flink, SQL window functions, specialized stream processors).
- Involves filtering the Interaction Table (e.g., by event type, time window) and grouping by attribute values (potentially joined from User/Item tables).
- Example Count:
SELECT user_id, item_category, COUNT(*) FROM Interactions JOIN Items ON Interactions.item_id = Items.item_id WHERE Interactions.event_type='click' AND Interactions.timestamp > '...' GROUP BY user_id, item_category
. - Example Rate: Build upon counts, add smoothing (e.g., Bayesian priors:
(count + prior_count) / (total_impressions + prior_impressions)
). Handle division by zero.
- Examples:
count(clicks on item_id in last 7d)
count(views by user_id on brand in last 30d)
rate(purchases / clicks for item_category by users in country)
rate(clicks / views for item_id where context_device='mobile')
The Challenge: Computationally intensive, requires scalable infrastructure. Defining meaningful groupings and time windows is key. Needs robust pipelines for incremental updates or periodic recalculations.
Step 3: Target Encoding (Aggregations over Interaction Table using Target Variable)
Replace categorical feature levels (often combinations of User/Item/Context attributes) with the historical average of the target variable from the Interaction Table.
- Computation:
- Similar aggregation logic as Counts/Rates, but the aggregated value is the target variable (e.g.,
AVG(is_clicked)
). - Important: Must use techniques like cross-validation folds or hold-out sets during training computation to prevent target leakage. The encoding for a data point must not be influenced by that data point's own target value.
- Smoothing is highly recommended.
- Similar aggregation logic as Counts/Rates, but the aggregated value is the target variable (e.g.,
- Examples:
avg(is_clicked for item_category)
avg(is_purchased for user_segment X item_brand)
(calculated safely)
The Challenge: High risk of target leakage. Complex implementation to ensure safe calculation during training. Needs careful regularization/smoothing.
Step 4: Computation, Storage, and Serving Infrastructure
Handling these derived features requires a dedicated infrastructure:
- Computation: Batch frameworks (Spark, Airflow+Python/SQL) for periodic recalculations of historical aggregates and target encodings. Stream processing frameworks (Flink, Spark Streaming, Kafka Streams) for near real-time counts/rates over recent windows.
- Storage:
- Feature Store: The main component. Stores precomputed features (counts, rates, target encodings, potentially some crosses or multimodal dot products). Provides APIs for both training data generation and low-latency online inference lookups. Often uses key-value stores (Redis, DynamoDB) or specialized databases optimized for fast lookups.
- Relational Databases/Data Warehouses: Store the base User, Item, and Interaction tables.
- Serving:
- Online Inference: The ranking model needs features with low latency (\< tens of milliseconds). It queries the Feature Store using relevant keys (user_id, item_id, context attributes).
- Feature Consistency: The Feature Store helps ensure that the exact same feature calculation logic is used during training data generation and online serving, mitigating train-serve skew.
- Dynamic Features: Some features (e.g., interactions involving real-time context, very recent counts) might be computed partially or fully at request time, potentially combining precomputed elements from the Feature Store with live context.
The Challenge: Building and maintaining this infrastructure is complex and resource-intensive. Requires expertise in distributed systems, data engineering, and MLOps. Ensuring feature freshness, consistency, and low-latency serving at scale is hard.
Simplifying Advanced Features with Shaped
The standard approach to engineering interactions and derived features is powerful but undeniably complex and resource-intensive. Shaped offers a significantly simplified alternative by abstracting away this intricate process. Instead of manually building these features and the surrounding infrastructure, you provide Shaped with the foundational data, and its internal systems handle the complexity.
Let's illustrate this using the Amazon Product Recommendations tutorial workflow:
1. Provide Foundational Data:
You start by connecting your core data sources to Shaped. In the tutorial, this involves uploading the interaction data (All_Beauty.json
, containing reviews) and item metadata (meta_All_Beauty.json
).
# Upload data to Shaped Datasets
shaped create-dataset-from-uri --name amazon_beauty_ratings --type json --path ./All_Beauty.json
shaped create-dataset-from-uri --name amazon_beauty_products --type json --path ./meta_All_Beauty.json
2. Define Data Selection & Target in YAML:
The core of the user configuration is the model YAML file. Here, you use simple SQL-like fetch
queries to select the necessary columns from your connected datasets and define the prediction target (label
).
model:
name: amazon_beauty_product_recommendations
connectors:
- type: Dataset
name: amazon_beauty_products
id: amazon_beauty_products
- type: Dataset
name: amazon_beauty_ratings
id: amazon_beauty_ratings
fetch:
events: |
SELECT
CASE WHEN overall >= 4 THEN 1 ELSE 0 END as label, # TARGET
asin AS item_id, # Item ID
reviewerID AS user_id, # User ID
unixReviewTime AS created_at, # Interaction Timestamp
summary # Raw text feature
FROM amazon_beauty_ratings
items: |
SELECT
asin AS item_id,
title, # Raw text feature
price AS price, # Numerical
brand # Categorical
FROM amazon_beauty_products
3. Shaped Handles the Complexity Internally:
When you run shaped create-model --file ...
, Shaped takes over:
- Automated Data Processing: Ingests and cleans the data.
- Implicit Learning via Deep Models: Shaped utilizes sophisticated internal deep learning models to inherently learn complex patterns from the sequence and combination of features presented during training.
- Managed Infrastructure: All the complex computation, internal feature representation, storage, model training, and low-latency serving infrastructure is managed transparently by Shaped.
4. Simple API for Results:
After the model becomes ACTIVE
, you interact with it via simple API calls, providing the user_id
to get personalized rankings that already incorporate the implicitly learned advanced features.
- Python
- JavaScript
from shaped import Shaped
# Initialize the Shaped client
shaped_client = Shaped()
# Get personalized recommendations
response = shaped_client.rank(
model_name='amazon_beauty_product_recommendations',
user_id='USER_123',
limit=5
)
# Print the recommendations
if response and response.metadata:
print("Recommended Items:")
for item in response.metadata:
print(f"- {item['title']} (Price: {item['price']}, Brand: {item['brand']})")
else:
print("No recommendations found.")
const { Shaped } = require('@shaped/shaped');
// Initialize the Shaped client
const shapedClient = new Shaped();
// Get personalized recommendations
shapedClient.rank({
modelName: 'amazon_beauty_product_recommendations',
userId: 'USER_123',
limit: 5
}).then(response => {
if (response && response.metadata) {
console.log("Recommended Items:");
response.metadata.forEach(item => {
console.log(`- ${item.title} (Price: ${item.price}, Brand: ${item.brand})`);
});
} else {
console.log("No recommendations found.");
}
}).catch(error => {
console.error("Error fetching recommendations:", error);
});
Conclusion: Focus on Data, Not Complex Pipelines
Engineering features based on interactions, historical counts/rates, and target encodings unlocks deeper insights and significantly boosts relevance. However, the traditional path demands substantial investment in complex data pipelines, rigorous statistical validation (especially against target leakage), and sophisticated infrastructure like feature stores.
Shaped provides a powerful alternative by abstracting this complexity. By focusing on providing clean, well-structured foundational data (interactions and metadata), users leverage Shaped's advanced internal models and automated processing. Shaped implicitly learns the synergistic signals from interactions and historical patterns, managing the underlying infrastructure and statistical complexities. This allows teams to achieve state-of-the-art results by harnessing the power of these advanced features without bearing the immense engineering burden typically required, freeing them to concentrate on data quality and core business objectives.
Ready to streamline your feature engineering process?
Request a demo of Shaped today to see Shaped in action for your feature types. Or, start exploring immediately with our free trial sandbox.