Evaluating Your Model

Evaluating recommendation models is notoriously hard and there isn't always a one sized fit approach. This guide will talk through different ways to evaluate your model, the common pitfalls when evaluating recommendation models, and how to avoid these pitfalls.

A Typical Evaluation Workflow

Evaluating recommendation models is usually done at two different stages, which we call the offline stage and the online stage.

Offline Evaluation

The offline stage occurs before you've deployed the candidate model or ranking algorithm to live users.

Evaluating Metrics

At this stage, typically algorithms are qualitatively evaluated by looking at how well the model predicts relevant user interactions on a hold-out set of data for a given set of metrics. We can also evaluate the model qualitatively by looking at descriptive analytics of the recommendations, e.g. the distribution of recommendation in the top-k results, and how diverse the recommendations are. We recommend looking at some of the attached resources at the end of the guide to understand the common metrics to evaluate.

info

A hold-out set is a subset of your data that you deliberately avoid training on so that you can test the model's performance on unseen data. For production machine-learning use-cases, like recommendation systems, it's important to split this held-out set chronologically, so that you're testing the model's performance on future data, and avoid time based data leakage.

The Problems With Offline Evaluation Metrics

Offline evaluation metrics are a great way to get a sense of how well your model is performing, however, in some cases it can be misleading. The biggest problem is that predicting a held out set of interactions is not the same as predicting what your users will actually interact with. Offline evaluation is observational -- we're evaluating how well we fit logged data, rather than interventional -- evaluating how changing the recommendation algorithm leads to different outcomes (e.g. purchases). In cases that the logged data is biased in anyway, this can lead to misleading results. Here's some examples of bias that we commonly see:

Data Delivery Bias: Your interactions will be biased towards the historic delivery mechanism used to surface recommendations. For example, if you've only been showing users the most popular items for the last year, than your interactions will have significant bias towards popular items. In this case, typically the best algorithm on the held-out set, will be the same one you're using to serve the data, however, this doesn't mean it's the best algorithm for your users.
Cold-start bias: Related to data delivery bias, but it's so common it deserves it's own point: Your interactions will be bias towards older or newer items. For example new items may have less interactions, which means they're not weighted as highly within the held-out set.
Observational bias: Even in a perfect world with no data delivery biases, where all items were historically served completely randomly. The algorithms will still be bias towards an environment that isn't affected by the candidate recommendation itself. Once deploying the algorithm to production, the way users interact with items will change and therefore the model's performance will change.

Offline Metric Evaluation as a Compass

Considering all the issues with offline metric evaluation, how do we interpret the results?

We like to think of offline metric evaluation as a compass, rather than a map. You can use it to understand characteristics of the model relative to baseline algorithms, however, you can't interpret these metrics too literally, e.g. a precision of 10% doesn't mean that 10% of the items within a slate size will be relevant in live test, however, if it's 1% better than a trending baseline than it's a good sign it's worth evaluating in an online setting. Note also, even if it was 1% worse, it might still be worth while evaluating in an online setting if the results are more diverse than the baseline or you happen to know the sampled data is biased towards the baseline in some severe way.

User Drill Down Analysis

Within the offline evaluation stage, it's also critical to qualitatively evaluate the candidate model by looking closely at a sample of recommendations from different users. For example, for a book recommendation model, you might find a user that has only interacted with only romance books, and confirm that that the model is recommending a mostly romance books to that user.

Evaluating the model in this way can help sanity check that everything is working as expected. If we see unexpected qualitative results, despite seeing good quantitative results it may mean the objective being used to train/evaluate the model is incorrect.

The Problems With User Drill Down Analysis

The biggest issue with user drill down is the human biases that come in when evaluating the results. This typically happens in two ways: user-selection biases and product biases.

User-selection biases: Say you're evaluating a recommendation model and instead of a random user you pick your own internal user. You know your interests best so it might seem obvious to try yourself first. The problem is you are likely biased in ways related to being an employee at the company. You might have internal features that result in a different user experience than the average user, and maybe your interactions don't reflect your true interests because you test the product constantly. Sometimes even choosing a random 'power user' can be misleading as these power users are actually employees or have some other bias that makes them less useful to manually evaluate. We suggest choosing several random users when evaluating.
Product biases: The other human bias that's common comes from preconceived product biases of what you might think users are interested in compared to what they're actually interested in. For example, assuming that a users demographic is a good predictor of their interests when in fact it's not. Sometimes it's best not to be overly prescriptive about what you expect users to see, and as long as the results aren't majorly wrong, let the online metrics speak for themself.

Other Ways to Evaluate Offline

There are several other ways to evaluate offline that are out of scope for this doc including:

Using model expandability tools to get a better understanding of how the model is making predictions (e.g. what are the features that are most important and are they what is expected?).
Using counterfactual evaluation to estimate the outcomes of potential A/B tests without actually running them. This solves the observational bias problem mentioned above but requires that we have a lot of data to simulate correctly.

Online Evaluation

Online evaluation occurs after you've deployed your model to production and are serving end-users with results from your algorithm. This is the gold-standard of evaluation as you can objectively track the impact of your model on your target business objectives (e.g. clicks, purchases) in an interventional way.

Typically when first deploying a new algorithm to production, you'll run an A/B test where you serve the new algorithm to a subset of users and compare the results to a control group that's served the old algorithm. This is the best way to understand the impact of the new algorithm on your business objectives relative to the old and removes confounders that might affect the evaluation metrics (E.g. seasonality may affect purchase rates in a way that's not related to the recommendation algorithm).

The Problems & Pitfalls of Online Evaluation

The main problems with online evaluation is that it's time consuming. It can take awhile to setup correctly, particularly if you don't have a solid experimentation framework. And you have to wait for enough data to be collected to make a statistically significant decision (e.g. greater than 2 weeks). Despite this, as an objective measure of uplift it's nearly always worth it once you've feel confident the offline results are at least comparable with a baseline.

That all being said, there can be several pitfalls during online evaluating that are worth mentioning:

Looking only at one metric If you only look at only one metric, you may be optimizing for that metric at the expense of others. For example, if you're optimizing for click-through-rate, you might end up recommending the same popular items to everyone, which might not be the best for your business in the long run. We recommend looking at a suite of metrics to understand the full picture.
Looking at only the aggregate of data: If you only look at the aggregate of data, you might miss important sub-populations that are being affected by the algorithm in different ways. For example, if you're optimizing for purchases, you might miss that the algorithm is actually decreasing the number of purchases from your most loyal users. We recommend looking at the results of the A/B test across different user segments.
Focusing on short-term signals: If you only look at short-term signals like clicks, you might miss the long-term impact of the algorithm, e.g. 30D retention. Even if the algorithm is increasing clicks in the short-term, it might be worthwhile holding a long-running experiment indefinitely that keeps a baseline algorithm shown to a small subset of users (e.g. 5%).

Conclusion

We've talked about a typical evaluation workflow for recommendation models. Notably the main stages of offline and online evaluation. If you want to dive deeper, take a look at some of the resources below where we dive into the specifics of different evaluation metrics and methodologies.

Evaluating Your Model

A Typical Evaluation Workflow​

Offline Evaluation​

Evaluating Metrics​

The Problems With Offline Evaluation Metrics​

Offline Metric Evaluation as a Compass​

User Drill Down Analysis​

The Problems With User Drill Down Analysis​

Other Ways to Evaluate Offline​

Online Evaluation​

The Problems & Pitfalls of Online Evaluation​

Conclusion​

Resources​