Evaluating Your Models
Evaluating recommendation models is notoriously hard and there isn't always a one sized fit approach. This guide will talk through different ways to evaluate your model, the common pitfalls when evaluating recommendation models, and how to avoid these pitfalls. We'll also talk about how Shaped evaluates models in detail including the metrics we use and how we compare models to baselines.
A Typical Evaluation Workflow
Evaluating recommendation models is usually done at two different stages, which we call the offline stage and the online stage.
Offline Evaluation
The offline stage occurs before you've deployed the candidate model or ranking algorithm to live users.
Evaluating Metrics
At this stage, typically algorithms are qualitatively evaluated by looking at how well the model predicts relevant user interactions on a hold-out set of data for a given set of metrics. We can also evaluate the model qualitatively by looking at descriptive analytics of the recommendations, e.g. the distribution of recommendation in the top-k results, and how diverse the recommendations are. We recommend looking at some of the attached resources at the end of the guide to understand the common metrics to evaluate.
A hold-out set is a subset of your data that you deliberately avoid training on so that you can test the model's performance on unseen data. For production machine-learning use-cases, like recommendation systems, it's important to split this held-out set chronologically, so that you're testing the model's performance on future data, and avoid time based data leakage.
The Problems With Offline Evaluation Metrics
Offline evaluation metrics are a great way to get a sense of how well your model is performing, however, in some cases it can be misleading. The biggest problem is that predicting a held out set of interactions is not the same as predicting what your users will actually interact with. Offline evaluation is observational -- we're evaluating how well we fit logged data, rather than interventional -- evaluating how changing the recommendation algorithm leads to different outcomes (e.g. purchases). In cases that the logged data is biased in anyway, this can lead to misleading results. Here's some examples of bias that we commonly see:
Data Delivery Bias: Your interactions will be biased towards the historic delivery mechanism used to surface recommendations. For example, if you've only been showing users the most popular items for the last year, than your interactions will have significant bias towards popular items. In this case, typically the best algorithm on the held-out set, will be the same one you're using to serve the data, however, this doesn't mean it's the best algorithm for your users.
Cold-start bias: Related to data delivery bias, but it's so common it deserves it's own point: Your interactions will be bias towards older or newer items. For example new items may have less interactions, which means they're not weighted as highly within the held-out set.
Observational bias: Even in a perfect world with no data delivery biases, where all items were historically served completely randomly. The algorithms will still be bias towards an environment that isn't affected by the candidate recommendation itself. Once deploying the algorithm to production, the way users interact with items will change and therefore the model's performance will change.
Offline Metric Evaluation as a Compass
Considering all the issues with offline metric evaluation, how do we interpret the results?
We like to think of offline metric evaluation as a compass, rather than a map. You can use it to understand characteristics of the model relative to baseline algorithms, however, you can't interpret these metrics too literally, e.g. a precision of 10% doesn't mean that 10% of the items within a slate size will be relevant in live test, however, if it's 1% better than a trending baseline than it's a good sign it's worth evaluating in an online setting. Note also, even if it was 1% worse, it might still be worth while evaluating in an online setting if the results are more diverse than the baseline or you happen to know the sampled data is biased towards the baseline in some severe way.
User Drill Down Analysis
Within the offline evaluation stage, it's also critical to qualitatively evaluate the candidate model by looking closely at a sample of recommendations from different users. For example, for a book recommendation model, you might find a user that has only interacted with only romance books, and confirm that that the model is recommending a mostly romance books to that user.
Evaluating the model in this way can help sanity check that everything is working as expected. If we see unexpected qualitative results, despite seeing good quantitative results it may mean the objective being used to train/evaluate the model is incorrect.
The Problems With User Drill Down Analysis
The biggest issue with user drill down is the human biases that come in when evaluating the results. This typically happens in two ways: user-selection biases and product biases.
User-selection biases: Say you're evaluating a recommendation model and instead of a random user you pick your own internal user. You know your interests best so it might seem obvious to try yourself first. The problem is you are likely biased in ways related to being an employee at the company. You might have internal features that result in a different user experience than the average user, and maybe your interactions don't reflect your true interests because you test the product constantly. Sometimes even choosing a random 'power user' can be misleading as these power users are actually employees or have some other bias that makes them less useful to manually evaluate. We suggest choosing several random users when evaluating.
Product biases: The other human bias that's common comes from preconceived product biases of what you might think users are interested in compared to what they're actually interested in. For example, assuming that a users demographic is a good predictor of their interests when in fact it's not. Sometimes it's best not to be overly prescriptive about what you expect users to see, and as long as the results aren't majorly wrong, let the online metrics speak for themself.
Other Ways to Evaluate Offline
There are several other ways to evaluate offline that are out of scope for this doc including:
- Using model expandability tools to get a better understanding of how the model is making predictions (e.g. what are the features that are most important and are they what is expected?).
- Using counterfactual evaluation to estimate the outcomes of potential A/B tests without actually running them. This solves the observational bias problem mentioned above but requires that we have a lot of data to simulate correctly.
Online Evaluation
Online evaluation occurs after you've deployed your model to production and are serving end-users with results from your algorithm. This is the gold-standard of evaluation as you can objectively track the impact of your model on your target business objectives (e.g. clicks, purchases) in an interventional way.
Typically when first deploying a new algorithm to production, you'll run an A/B test where you serve the new algorithm to a subset of users and compare the results to a control group that's served the old algorithm. This is the best way to understand the impact of the new algorithm on your business objectives relative to the old and removes confounders that might affect the evaluation metrics (E.g. seasonality may affect purchase rates in a way that's not related to the recommendation algorithm).
The Problems & Pitfalls of Online Evaluation
The main problems with online evaluation is that it's time consuming. It can take awhile to setup correctly, particularly if you don't have a solid experimentation framework. And you have to wait for enough data to be collected to make a statistically significant decision (e.g. greater than 2 weeks). Despite this, as an objective measure of uplift it's nearly always worth it once you've feel confident the offline results are at least comparable with a baseline.
That all being said, there can be several pitfalls during online evaluating that are worth mentioning:
Looking only at one metric If you only look at only one metric, you may be optimizing for that metric at the expense of others. For example, if you're optimizing for click-through-rate, you might end up recommending the same popular items to everyone, which might not be the best for your business in the long run. We recommend looking at a suite of metrics to understand the full picture.
Looking at only the aggregate of data: If you only look at the aggregate of data, you might miss important sub-populations that are being affected by the algorithm in different ways. For example, if you're optimizing for purchases, you might miss that the algorithm is actually decreasing the number of purchases from your most loyal users. We recommend looking at the results of the A/B test across different user segments.
Focusing on short-term signals: If you only look at short-term signals like clicks, you might miss the long-term impact of the algorithm, e.g. 30D retention. Even if the algorithm is increasing clicks in the short-term, it might be worthwhile holding a long-running experiment indefinitely that keeps a baseline algorithm shown to a small subset of users (e.g. 5%).
Understanding Shaped's Evaluation Engine
Behind Shaped's powerful recommendation engine lies a rigorous evaluation framework that ensures we're constantly delivering the best possible results. We evaluate multiple model policies on your data and select the top performers to power your live recommendations. This guide explains the key metrics we use to assess and compare model performance, providing insights into how Shaped optimizes for relevance, ranking quality, and overall user experience.
Metrics at a Glance
Shaped calculates the following metrics on a held-out dataset during training:
- Recall@k: Measures the proportion of relevant items that appear within the top
k
recommendations. A higher recall indicates the model's ability to retrieve a larger proportion of relevant items. - Precision@k: Measures the proportion of recommendations within the top
k
that are actually relevant. A higher precision indicates the model's ability to surface relevant items early in the ranking. - MAP@k (Mean Average Precision@k): Calculates the average precision across multiple
queries, considering the order of relevant items within the top
k
. MAP@k provides a more comprehensive view of ranking quality than precision alone. - NDCG@k (Normalized Discounted Cumulative Gain@k): Similar to MAP@k, NDCG@k accounts for the position of relevant items but also considers the relevance scores themselves, giving higher weight to more relevant items appearing at the top.
- Hit Ratio@k: Represents the percentage of users for whom at least one relevant
item is present within the top
k
recommendations. A high hit ratio signifies the model's effectiveness in satisfying a broad range of user preferences. - Coverage@k: Measures the diversity of recommendations by calculating the
percentage of unique items recommended across all users within the top
k
. Higher coverage indicates a wider exploration of your item catalog. - Personalization@k: Quantifies the degree of personalization by measuring the dissimilarity of recommendations across different users. Higher personalization suggests that the model tailors recommendations to individual user preferences rather than providing generic results.
- Average Popularity@k: Provides insights into the model's tendency to recommend
popular items by averaging the popularity scores of items within the top
k
recommendations.
Understanding k
: The k
parameter represents the number of recommendations
considered (e.g., the top 10, 20, etc.). We calculate these metrics across various
values of k
to provide a comprehensive view of model performance across different
recommendation list sizes.
Segmented Analysis for Deeper Insights
We go beyond overall performance by calculating these metrics for various data segments, including:
- New Users: Evaluates how effectively the model recommends to users with limited interaction history.
- New Items: Assesses the model's ability to surface new or less popular items.
- Power Users: Examines performance for users who engage heavily with your platform.
- Power Items: Analyzes how well the model handles highly popular or trending items.
This segmented analysis provides insights into the model's strengths and weaknesses across different user groups and item types, enabling us to fine-tune performance and address specific challenges effectively.
Baseline Comparisons: Outperforming The Ordinary
To demonstrate the value of Shaped's approach, we compare our top-performing model policies against two baselines:
- Popular Baseline: Ranks items based solely on their overall popularity. This represents a non-personalized approach.
- Random Baseline: Generates recommendations randomly. This serves as a lower bound for comparison.
Comparing against these baselines highlights the significant uplift in performance achieved by Shaped's sophisticated algorithms and personalized approach.
Accessing Evaluation Metrics
You can view detailed evaluation results, including metric values and baseline comparisons, directly within the Shaped dashboard. These insights empower you to:
- Understand Model Performance: Gain a clear understanding of how well Shaped's models are performing on your specific data.
- Track Progress Over Time: Monitor changes in performance as you refine your models, adjust parameters, or incorporate new data.
- Make Informed Decisions: Use these insights to make data-driven decisions about your recommendation strategy and optimize your platform for maximum user engagement.
Conclusion
We've talked about a typical evaluation workflow for recommendation models. Notably the main stages of offline and online evaluation. If you want to dive deeper, take a look at some of the resources below where we dive into the specifics of different evaluation metrics and methodologies.
Resources
- Evaluating Recommendation Systems -- Precision@k, Recall@k, and R-Precision
- Evaluting recommendation systems -- mAP, MMR, NDCG
- Evaluating recommendation systems (ROC, AUC, and Precision-Recall)
- Not your average RecSys metrics. Part 1: Serendipity
- Not your average RecSys metrics Part 2: Novelty
- Counterfactual Evaluation for Recommendation Systems