Skip to main content

Evaluate engine performance

Building an engine is only the first step to getting good results. Once your engine is created, you should train multiple similar engines and test them against each other.

This process is called evaluation.

Evaluation happens in two stages:

  1. Offline evaluation - Test model performance on historical data before deployment
  2. Online evaluation - Measure real-world impact with live users through A/B testing

Use offline evaluation to choose the best pair of engines from an initial set of ten, then validate with online A/B tests.

Calculating offline evaluation metrics in Shaped

Shaped automatically evaluates models against a chronologically split held-out dataset and displays results in the Offline Metrics tab.

Setup

Enable evaluation in your engine configuration:

{
"training": {
"evaluation": {
"enable": true,
"candidate_source": "batch_iids",
"filter_seen_items": false,
"evaluation_top_k": 50
}
}
}
tip

Note: Offline evaluation is only supported for engines with trained model policies and an interaction table.

Available Metrics

Shaped calculates these metrics on held-out data:

Relevance Metrics:

  • Recall@k - Proportion of relevant items in top k recommendations
  • Precision@k - Proportion of top k recommendations that are relevant
  • Hit Ratio@k - Percentage of users with at least one relevant item in top k

Ranking Quality:

  • MAP@k - Mean Average Precision across queries, accounting for item order
  • NDCG@k - Normalized Discounted Cumulative Gain, weights more relevant items higher

Diversity Metrics:

  • Coverage@k - Percentage of unique items recommended across all users
  • Personalization@k - Dissimilarity of recommendations across users

Popularity:

  • Average Popularity@k - Average popularity score of recommended items

The k parameter (e.g., 10, 20, 50) represents the number of recommendations evaluated. Metrics are calculated across multiple k values.

Segmented Analysis

Metrics are calculated for specific segments:

  • New Users - Users with limited interaction history
  • New Items - Recently added items
  • Power Users - Highly engaged users
  • Power Items - Popular or trending items

Baselines

Your model is compared against:

  • Popular Baseline - Items ranked by overall popularity
  • Random Baseline - Random recommendations

Viewing Results

Access metrics in the Shaped dashboard:

  • Metric values across different k values
  • Performance by segment
  • Baseline comparisons
  • Historical trends

Offline Evaluation Tips

Common Data Biases

Data Delivery Bias Interactions reflect your previous recommendation strategy. If you've only shown popular items, the best offline algorithm may just mimic your old system.

Cold-Start Bias New items/users have fewer interactions and are underweighted in held-out sets.

Observational Bias Offline evaluation measures fit to logged data, not how recommendations change user behavior in production.

User Drill-Down Analysis

Manually examine recommendations for random users:

  1. Select several random users from different segments
  2. Review their interaction history
  3. Verify recommendations align with their interests

This catches issues that aggregate metrics miss. Don't use employee accounts or non-random users.

Interpretation Guidelines

  • Treat offline metrics as directional, not absolute predictions
  • Compare relative performance between models
  • Don't over-optimize for a single metric
  • High Precision with low Coverage may indicate over-recommending popular items
  • Evaluate the full suite of metrics together

Online Evaluation

Online evaluation measures real-world performance by A/B testing models with live users.

Running A/B Tests

  1. Select your top 2-3 models from offline evaluation
  2. Split user traffic between models and a control group
  3. Serve each group recommendations from their assigned model
  4. Track business metrics (clicks, purchases, retention, etc.)
  5. Compare performance across groups

Example traffic split:

  • Control: 40% (current production model)
  • Model A: 30% (top offline performer)
  • Model B: 30% (second best)

Key Considerations

  • Run tests for 2+ weeks to achieve statistical significance
  • A/B testing isolates model impact from seasonality and other factors
  • Ensure sufficient sample size in each group

Common Pitfalls

Single-Metric Optimization Track multiple metrics:

  • Short-term: clicks, views
  • Medium-term: purchases, conversions
  • Long-term: retention, lifetime value
  • Diversity and discovery

Aggregate-Only Analysis Break down results by user segments to catch subgroup impacts.

Short-Term Focus Consider running longer experiments (4-8 weeks) and tracking delayed metrics. Maintain a small holdout (5%) on baseline indefinitely.

Success Criteria

Define before testing:

  • Primary metric - Main business objective
  • Secondary metrics - Supporting indicators
  • Guardrail metrics - Metrics that shouldn't degrade

A successful model improves the primary metric without harming guardrails.