NDCG
Normalized Discounted Cumulative Gain (NDCG) evaluates relevance of a ranked list by taking into consideration both the ordering of the relevant items and the weighted relevance each item.
Consider these lists of ranking results, where the user bought an Apple Watch and Adidas shorts:
- List A: Nike sneakers, Adidas shorts, Apple Watch
- List B: Apple Watch, Adidas shorts, Nike sneakers
mAP correctly identified List B as superior because it ranked the relevant items higher. But what if the user had bought five pairs of Adidas shorts and only one Apple Watch? Does List B still seem like the best order? Standard mAP treats both items as equally relevant. To handle varying degrees of relevance and properly normalize scores, we turn to Normalized Discounted Cumulative Gain (NDCG).
Building Up to NDCG: CG, DCG, and IDCG
NDCG is best understood by breaking it down into its components:
- Cumulative Gain (CG):
- This is the simplest step. CG@k is just the sum of the relevance scores of the items in the top K positions. For binary relevance (relevant=1, irrelevant=0), CG@k is simply the count of relevant items in the top K (the numerator in Precision@K and Recall@K calculations).
- Limitation: CG ignores the ranking order entirely. List A and List B might have the same CG@3 if they contain the same relevant items within the top 3.
- Discounted Cumulative Gain (DCG):
- This is where ranking order starts to matter. DCG@k sums the relevance scores but applies a discount based on rank. Items ranked lower contribute less to the total score. The standard discount factor is
1 / log2(rank + 1)
. Items at rank 1 get full relevance (log2(1+1)=1), rank 2 is discounted by log2(3), rank 3 by log2(4), and so on. - Formula:
DCG@k = sum(relevance_i / log2(rank_i + 1))
for itemsi
from 1 to K. - Example (Binary Relevance):
- List A: [Nike(0), Adidas(1), Apple(1)]
DCG@3 = 0/log2(2) + 1/log2(3) + 1/log2(4) ≈ 0 + 0.631 + 0.5 = 1.131
- List B: [Apple(1), Adidas(1), Nike(0)]
DCG@3 = 1/log2(2) + 1/log2(3) + 0/log2(4) = 1 + 0.631 + 0 = 1.631
- List A: [Nike(0), Adidas(1), Apple(1)]
- Limitation: DCG incorporates rank, but the score itself isn't easily comparable. A DCG of 1.631 might be excellent for one query but poor for another, depending on the maximum possible score.
- This is where ranking order starts to matter. DCG@k sums the relevance scores but applies a discount based on rank. Items ranked lower contribute less to the total score. The standard discount factor is
- Ideal Discounted Cumulative Gain (IDCG):
- To make DCG scores comparable, we need to know the best possible DCG score for that specific set of relevant items up to rank K. This is the IDCG@k. It's calculated by imagining the perfect ranking: sorting all relevant items by their relevance score (highest first) and placing them at the top of the list, then calculating the DCG for this ideal ordering up to position K.
- Example (Binary Relevance): Assuming the third expected item was another Apple item that was predicted by neither list, the ideal order for our top 3 relevant items is [Apple(1), Adidas(1), Apple(1)].
IDCG@3 = 1/log2(2) + 1/log2(3) + 1/log2(4) ≈ 1 + 0.631 + 0.5 = 2.131
.
Normalized Discounted Cumulative Gain (NDCG)
Now we can define NDCG:
NDCG@k \= DCG@k / IDCG@k
By dividing the actual DCG by the maximum possible (ideal) DCG, we normalize the score to a range between 0 and 1 (approximately, slight variations can occur). An NDCG of 1 means the algorithm produced the perfect ranking order (at least up to K). An NDCG of 0 means none of the top K items had any relevance.
- Example (Binary Relevance):
- List A:
NDCG@3 = DCG_A / IDCG = 1.131 / 2.131 ≈ 0.531
- List B:
NDCG@3 = DCG_B / IDCG = 1.631 / 2.131 ≈ 0.765
- List A:
NDCG clearly shows that List B achieved the better ranking for these items.
The Power of Graded Relevance
NDCG truly shines when relevance isn't just yes/no. Let's use the example where relevance is based on purchase quantity: Apple Watch (1), Adidas shorts (5), Nike sneakers (3).
- Calculate IDCG@3: The ideal order is Adidas (5), Nike (3), Apple (1).
IDCG@3 = 5/log2(2) + 3/log2(3) + 1/log2(4) ≈ 5/1 + 3/1.585 + 1/2 = 5 + 1.893 + 0.5 = 7.393
- Calculate DCG@3 for List A: [Nike(3), Adidas(5), Apple(1)]
DCG@3 = 3/log2(2) + 5/log2(3) + 1/log2(4) ≈ 3/1 + 5/1.585 + 1/2 = 3 + 3.154 + 0.5 = 6.654
- Calculate DCG@3 for List B: [Apple(1), Adidas(5), Nike(3)]
DCG@3 = 1/log2(2) + 5/log2(3) + 3/log2(4) ≈ 1/1 + 5/1.585 + 3/2 = 1 + 3.154 + 1.5 = 5.654
- Calculate NDCG@3:
- List A:
NDCG@3 = 6.654 / 7.393 ≈ 0.90
- List B:
NDCG@3 = 5.654 / 7.393 ≈ 0.76
- List A:
With graded relevance, List A is now considered better by NDCG! Why? Because placing the highly relevant Adidas shorts (score 5) at rank 2 is penalized less by the discount factor than placing the less relevant Apple Watch (score 1) at rank 1, especially when the Nike sneakers (score 3) are also considered. NDCG naturally handles these trade-offs based on the relevance scores provided.
(Note: Obtaining accurate, reliable graded relevance scores in real-world scenarios can be challenging, which is why binary relevance is still common.)
Pros and Cons of NDCG
Pros:
- Rank-Sensitive: Heavily weights items at the top of the list.
- Handles Graded Relevance: Its biggest advantage; naturally incorporates different levels of item importance.
- Normalized Score: Allows for fair comparison across different queries, users, or lists of varying difficulty.
- Widely Adopted: A standard metric in information retrieval and recommendation evaluation.
Cons:
- Less Intuitive: Scores like 0.9 or 0.76 don't have a simple, direct explanation like Precision@K.
- Imperfect Discount: The logarithmic discount is standard but not necessarily the best discount function.
- Sensitive to K: The choice of K still impacts the score.
- Requires Relevance Scores: Needs ground truth relevance scores (binary or graded).
Evaluating with NDCG at Shaped
Understanding nuanced ranking quality and handling potential variations in item relevance is crucial for building sophisticated recommendation and search systems. At Shaped, NDCG is a core metric in our evaluation framework. We value its sensitivity to rank order and its ability to incorporate graded relevance signals when available.
By monitoring NDCG alongside mAP@K, Precision@K, Recall@K, and AUC, we provide ourselves and our customers with a robust, multi-faceted view of model performance. This enables fine-tuning strategies that optimize not just for clicks, but for overall ranking quality that reflects true user value.
Conclusion: A Sophisticated Measure of Ranking Quality
NDCG stands out as a powerful metric for evaluating ranked lists. Its core strengths lie in its sensitivity to rank order and its inherent ability to handle graded relevance scores, providing a more nuanced evaluation than metrics relying solely on binary relevance. By normalizing the score using an ideal ranking, NDCG allows for fair comparisons across diverse scenarios. While perhaps less immediately interpretable than simpler metrics, NDCG is an invaluable tool for optimizing the quality and effectiveness of recommendation and search systems where the order and degree of relevance truly matter.
Want to ensure your rankings are optimally ordered, even with varying item relevance?
Request a demo of Shaped today to see how NDCG and our comprehensive evaluation suite help build state-of-the-art discovery experiences. Or, start exploring immediately with our free trial sandbox.