Evaluate engine performance

Building an engine is only the first step to getting good results. Once your engine is created, you should train multiple similar engines and test them against each other.

This process is called evaluation.

Evaluation happens in two stages:

Offline evaluation - Test model performance on historical data before deployment
Online evaluation - Measure real-world impact with live users through A/B testing

Use offline evaluation to choose the best pair of engines from an initial set of ten, then validate with online A/B tests.

Calculating offline evaluation metrics in Shaped

Shaped automatically evaluates models against a chronologically split held-out dataset and displays results in the Offline Metrics tab.

Setup

Enable evaluation in your engine configuration:

{
  "training": {
    "evaluation": {
      "enable": true,
      "candidate_source": "batch_iids",
      "filter_seen_items": false,
      "evaluation_top_k": 50
    }
  }
}

tip

Note: Offline evaluation is only supported for engines with trained model policies and an interaction table.

Available Metrics

Shaped calculates these metrics on held-out data:

Relevance Metrics:

Recall@k - Proportion of relevant items in top k recommendations
Precision@k - Proportion of top k recommendations that are relevant
Hit Ratio@k - Percentage of users with at least one relevant item in top k

Ranking Quality:

MAP@k - Mean Average Precision across queries, accounting for item order
NDCG@k - Normalized Discounted Cumulative Gain, weights more relevant items higher

Diversity Metrics:

Coverage@k - Percentage of unique items recommended across all users
Personalization@k - Dissimilarity of recommendations across users

Popularity:

Average Popularity@k - Average popularity score of recommended items

The k parameter (e.g., 10, 20, 50) represents the number of recommendations evaluated. Metrics are calculated across multiple k values.

Segmented Analysis

Metrics are calculated for specific segments:

New Users - Users with limited interaction history
New Items - Recently added items
Power Users - Highly engaged users
Power Items - Popular or trending items

Baselines

Your model is compared against:

Popular Baseline - Items ranked by overall popularity
Random Baseline - Random recommendations

Viewing Results

Access metrics in the Shaped dashboard:

Metric values across different k values
Performance by segment
Baseline comparisons
Historical trends

Offline Evaluation Tips

Common Data Biases

Data Delivery Bias Interactions reflect your previous recommendation strategy. If you've only shown popular items, the best offline algorithm may just mimic your old system.

Cold-Start Bias New items/users have fewer interactions and are underweighted in held-out sets.

Observational Bias Offline evaluation measures fit to logged data, not how recommendations change user behavior in production.

User Drill-Down Analysis

Manually examine recommendations for random users:

Select several random users from different segments
Review their interaction history
Verify recommendations align with their interests

This catches issues that aggregate metrics miss. Don't use employee accounts or non-random users.

Interpretation Guidelines

Treat offline metrics as directional, not absolute predictions
Compare relative performance between models
Don't over-optimize for a single metric
High Precision with low Coverage may indicate over-recommending popular items
Evaluate the full suite of metrics together

Online Evaluation

Online evaluation measures real-world performance by A/B testing models with live users.

Running A/B Tests

Select your top 2-3 models from offline evaluation
Split user traffic between models and a control group
Serve each group recommendations from their assigned model
Track business metrics (clicks, purchases, retention, etc.)
Compare performance across groups

Example traffic split:

Control: 40% (current production model)
Model A: 30% (top offline performer)
Model B: 30% (second best)

Key Considerations

Run tests for 2+ weeks to achieve statistical significance
A/B testing isolates model impact from seasonality and other factors
Ensure sufficient sample size in each group

Common Pitfalls

Single-Metric Optimization Track multiple metrics:

Short-term: clicks, views
Medium-term: purchases, conversions
Long-term: retention, lifetime value
Diversity and discovery

Aggregate-Only Analysis Break down results by user segments to catch subgroup impacts.

Short-Term Focus Consider running longer experiments (4-8 weeks) and tracking delayed metrics. Maintain a small holdout (5%) on baseline indefinitely.

Success Criteria

Define before testing:

Primary metric - Main business objective
Secondary metrics - Supporting indicators
Guardrail metrics - Metrics that shouldn't degrade

A successful model improves the primary metric without harming guardrails.

Calculating offline evaluation metrics in Shaped​

Setup​

Available Metrics​

Segmented Analysis​

Baselines​

Viewing Results​

Offline Evaluation Tips​

Common Data Biases​

User Drill-Down Analysis​

Interpretation Guidelines​

Online Evaluation​

Running A/B Tests​

Key Considerations​

Common Pitfalls​

Success Criteria​