mAP
Mean Average Precision (mAP) evaluates relevance of a ranked list by taking into considering the ordering of the relevant items. Consider this scenario:
- Algorithm A shows: Nike sneakers, Adidas shorts, Apple Watch
- Algorithm B shows: Apple Watch, Adidas shorts, Nike sneakers
Both lists contain the exact same items. If a user's purchase history (our ground truth for relevance) shows they bought the Apple Watch and Adidas shorts, both algorithms achieve identical Precision@3 (2/3) and identical Recall@3 (assuming these are the only 2 relevant items out of, say, 6 total relevant items, Recall@3 would be 2/6 for both).
So, are both algorithms equally good? Intuitively, no. Algorithm B seems better because it placed the relevant items (Apple Watch, Adidas shorts) right at the top (positions 1 and 2), while Algorithm A pushed them down to positions 2 and 3. Users typically engage more with items at the beginning of a list, whether it's scrolling a feed or scanning a product carousel. We need a metric that rewards putting relevant items higher up – this is where Mean Average Precision (mAP) shines.
From Precision@K to Average Precision (AP)
mAP builds directly on the concept of Precision@K but cleverly incorporates rank order. Instead of calculating precision just once at rank K, Average Precision (AP) calculates Precision@k at every position k
where a relevant item is found, and then averages these precision scores.
Let's break it down for our example (Relevant items: Apple Watch, Adidas Shorts):
- Algorithm A: [Nike sneakers (Irrelevant), Adidas shorts (Relevant), Apple Watch (Relevant)]
- Relevant item found at rank
k=2
(Adidas shorts). Precision@2 \= (1 relevant item) / 2 items \= 0.5 - Relevant item found at rank
k=3
(Apple Watch). Precision@3 \= (2 relevant items) / 3 items ≈ 0.67 - Average these precision scores: AP for A \= (0.5 + 0.67) / 2 ≈ 0.58 (Note: We divide by 2 because there were 2 relevant items found in the list)
- Relevant item found at rank
- Algorithm B: [Apple Watch (Relevant), Adidas shorts (Relevant), Nike sneakers (Irrelevant)]
- Relevant item found at rank
k=1
(Apple Watch). Precision@1 \= (1 relevant item) / 1 item \= 1.0 - Relevant item found at rank
k=2
(Adidas shorts). Precision@2 \= (2 relevant items) / 2 items \= 1.0 - Average these precision scores: AP for B \= (1.0 + 1.0) / 2 \= 1.0
- Relevant item found at rank
As you can see, Algorithm B achieves a perfect AP score of 1.0 because all relevant items were ranked at the very top. Algorithm A is penalized because the irrelevant Nike sneakers at position 1 negatively impacts the precision score calculated at positions 2 and 3. AP effectively rewards models that place relevant items earlier in the list.
What is Mean Average Precision (mAP)?
Average Precision (AP) gives us the score for a single ranked list (e.g., for one user or one query). To get an overall performance measure across all users or queries in your test set, you simply calculate the AP for each list and then compute the mean of all those AP scores. That's Mean Average Precision (mAP).
mAP \= (Sum of AP scores for all lists) / (Number of lists)
(For simplicity, often the term AP is used when discussing the concept, and mAP implies the averaged version, as used in the original source material for this post.)
Why Use mAP? (Pros)
- Rank-Sensitive: Its primary advantage. Unlike P@K and R@K, mAP heavily rewards placing relevant items higher in the ranking.
- Considers Overall Ranking: It evaluates the ordering of relevant items throughout the list, not just within a fixed top K or only the first relevant item (like MRR).
- Provides a Single Metric: Summarizes ranking quality sensitive to order into one number.
- Widely Used & Understood: A standard metric in information retrieval and recommendation system evaluation.
- Interpretability Link: mAP is related to the area under the precision-recall curve, providing a connection to another common evaluation perspective.
Limitations of mAP (Cons)
- Less Directly Interpretable than P@K: While P@3 \= 0.67 clearly means "2 out of the top 3 were relevant", mAP \= 0.58 is less intuitive to explain in simple terms.
- Binary Relevance: Standard mAP assumes items are either relevant (1) or not (0). It doesn't naturally handle degrees of relevance (e.g., "highly relevant" vs. "somewhat relevant") as elegantly as metrics like NDCG.
- Can Still Be Influenced by Number of Relevant Items: While averaging helps, AP scores for users with very few relevant items might behave differently than those for users with many.
Evaluating Ranking Order with mAP at Shaped
At Shaped, we know that the order of recommendations and search results is often just as important as the items themselves. Getting relevant content in front of users quickly drives engagement and satisfaction. That's why Mean Average Precision (mAP) is a key metric we compute and monitor for models trained using our platform.
Using mAP allows us and our customers to assess how well models rank relevant items towards the top of the list, going beyond simple hit counts. We use mAP alongside other vital metrics like Precision@K, Recall@K, AUC, and NDCG to provide a comprehensive evaluation suite, ensuring a deep understanding of model performance from multiple angles.
Conclusion: Rewarding the Right Order
When the sequence of recommendations matters, metrics like Precision@K and Recall@K only tell part of the story. Mean Average Precision (mAP) steps in to fill the gap by explicitly evaluating and rewarding the ranking order. By averaging precision scores at each relevant item's position, it provides a powerful, rank-sensitive metric that reflects the quality of the entire ordered list. While often used with other metrics for a complete picture, mAP is an indispensable tool for anyone serious about optimizing the order of their recommendations and search results.
Ready to optimize not just what you recommend, but how you rank it?
Request a demo of Shaped today to learn how we leverage mAP and other advanced metrics to build high-performing, rank-aware discovery models. Or, start exploring immediately with our free trial sandbox.