Average Popularity
When evaluating recommendation systems, we often focus intensely on relevance metrics like Precision@K, Recall@K, NDCG, and mAP. These tell us how accurate our recommendations are and how well they are ordered. But do they tell the whole story? Imagine two recommendation algorithms with similar NDCG scores. One consistently recommends the current bestsellers and viral hits, while the other surfaces lesser-known but potentially interesting items from the long tail. Are these systems truly performing equally?
Relevance metrics alone wouldn't capture this difference. We need ways to understand other characteristics of our recommendations, such as their tendency towards popular items versus niche content. This is where metrics like Average Popularity @ K come into play. It helps diagnose potential biases and understand if your system is genuinely personalizing or just echoing mainstream trends.
What is Popularity?
Before calculating Average Popularity, we first need to define what "popularity" means for an item. This is context-dependent but usually involves aggregating user interactions across a large portion (or all) of your user base over a specific period. Common ways to measure item popularity include:
- Total number of views or impressions.
- Total number of clicks or interactions.
- Total number of purchases or conversions.
- Total number of times added to cart or wishlist.
The key is that it reflects the item's overall engagement or success level across the platform, independent of any specific user's preference (though aggregate preferences drive it).
Calculating Average Popularity @ K
Once you have a popularity score for each item in your catalog, calculating Average Popularity @ K for a given recommendation list is simple:
- Identify Top K Items: Look at the top K items recommended by the system for a specific user or context.
- Get Popularity Scores: Retrieve the pre-calculated global popularity score for each of these K items.
- Calculate the Average: Compute the mean of these K popularity scores.
Average Popularity @ K \= (Sum of Popularity Scores of top K items) / K
This process is repeated for all lists in your evaluation set, and often the overall average across all lists is reported.
Example:
Imagine popularity is measured by total clicks in the last month.
- List A: [Item 1 (10,000 clicks), Item 2 (8,000 clicks), Item 3 (12,000 clicks)]
- Average Popularity @ 3 \= (10000 + 8000 + 12000) / 3 \= 10,000
- List B: [Item 4 (500 clicks), Item 5 (1,500 clicks), Item 6 (1,000 clicks)]
- Average Popularity @ 3 \= (500 + 1500 + 1000) / 3 \= 1,000
List A clearly recommends, on average, much more popular items than List B.
Interpreting Average Popularity
Average Popularity is not inherently "good" or "bad." Its interpretation depends heavily on your goals:
- High Average Popularity: Indicates the algorithm tends to recommend mainstream hits or bestsellers.
- Potential Pros: Might lead to higher immediate click-through rates (CTR) as popular items are often "safe bets."
- Potential Cons: Suggests weak personalization, potential filter bubble effects, lack of discovery for niche items, missed opportunities to surface relevant long-tail content. Could indicate the model is overfitting on popular items or ignoring user-specific signals.
- Low Average Popularity: Indicates the algorithm recommends more niche, less-viewed, or long-tail items.
- Potential Pros: Suggests stronger personalization, potential for serendipity and discovery, exposure to diverse content.
- Potential Cons: Might lead to lower immediate CTR if items are too obscure, risk of showing irrelevant niche items if personalization isn't accurate.
The key is often balance and comparison. You might track Average Popularity alongside relevance metrics during A/B tests. If a new algorithm improves NDCG but drastically increases Average Popularity, it might be achieving relevance simply by recommending obvious hits, potentially at the cost of true personalization or discovery. Conversely, if relevance drops while Average Popularity plummets, the model might be recommending niche items that aren't actually relevant.
Pros and Cons of Average Popularity
Pros:
- Measures Popularity Bias: Directly quantifies the tendency towards recommending popular vs. niche items.
- Diagnoses Personalization Issues: Can help identify if a system is truly personalizing or just relying on global trends.
- Evaluates Serendipity/Novelty: Lower scores can indicate recommendations that might surprise and delight users with less mainstream content.
- Simple Concept: Easy to understand and calculate (once item popularity is defined).
Cons:
- Not a Relevance Metric: A high or low score says nothing about whether the recommendations were correct or useful for the specific user. An algorithm could recommend popular but irrelevant items, or niche but highly relevant ones.
- Dependent on Popularity Definition: The metric's value and interpretation heavily depend on how item popularity is calculated.
- Context is Crucial: Interpretation requires understanding business goals (e.g., maximize immediate clicks vs. foster long-term discovery).
- Can Be Skewed: A single hyper-popular item in a list can significantly inflate the average.
Average Popularity in the Context of Shaped
At Shaped, our primary focus is on optimizing the relevance and personalization of recommendations and search results. We leverage sophisticated machine learning models, often based on user interaction sequences and collaborative filtering principles, which inherently aim to capture individual user preferences beyond mere global popularity. Core metrics like NDCG, mAP, Precision@K, Recall@K, and AUC are central to how we evaluate model performance because they directly measure how well we connect users with items they are likely to engage with and find useful.
While Average Popularity isn't a primary optimization target within Shaped, it can serve as a valuable diagnostic metric. Since Shaped models are trained on interaction data (which can be used to derive popularity scores), this metric can be computed during analysis or A/B testing. Comparing Average Popularity between different models or user segments can provide insights into model behavior, helping ensure that improvements in relevance aren't solely due to an increased reliance on obvious bestsellers, but reflect genuine gains in personalization and the ability to rank relevant long-tail items effectively.
Conclusion: A Diagnostic Lens on Recommendation Bias
Average Popularity @ K offers a valuable lens for evaluating recommendation systems, shifting the focus from pure relevance to the type of items being recommended. It helps quantify the system's bias towards popular hits versus niche discoveries. While not a measure of correctness itself, it serves as an important diagnostic tool. When analyzed alongside core relevance metrics like NDCG or mAP, Average Popularity provides crucial context, helping you understand if your system is truly personalizing user experiences or simply amplifying what's already trending.
Want to build recommendation systems that go beyond popularity and deliver truly personalized relevance?
Request a demo of Shaped today to see how our platform focuses on optimizing core relevance metrics for superior user experiences. Or, start exploring immediately with our free trial sandbox.