【簡單說】 
Average Precision(AP)是:整組答案正確結果要優先出現的衡量指標,即:正確答案出現位置(多快多早)的整體衡量
Reciprocal Rank  (RR)是:第一個正確答案出現序位衡量指標,取之倒數。即:第一個正確答案多快(多早)出現?出現越早RR越高分。
M(Mean)則是多次查詢之的均值(mean)來做為度量代表。

【例】當Google顯示10個關鍵字查詢的結果時,如果10個結果都是相關(正確)的答案,無疑這是最優良的結果。但若只有部分(通常是如此)的查詢結果是正確的,那麼好的系統應該優先顯示正確的答案,將較不相關的答案置於末端。
AP值正是反映【整組答案正確結果要優先出現】這種概念的具體衡量指標。定義AP@k:根據前k個查詢結果,其所出現的次序算出具體的衡量指標。

結果 O X O O O O X X X O
Recall 1/6 1/6 2/6 3/6 4/6 5/6 5/6 5/6 5/6 6/6
Precision 1/1 (1/2) 2/3 3/4 4/5 5/6 (5/7) (5/8) (5/9) 6/10

 (雖然Ranking#1#2的正確、錯誤答案數目相同。但,整體來看出現位置:#1中錯誤的檢索出現在List的後段,所以AP較高)

結果2 X O X X O O O X O O
Recall 0/6 1/6 1/6 1/6 2/6 3/6 4/6 4/6 5/6 6/6
Precision (0/6) 1/2 (1/3) (1/4) 2/5 3/6 4/7 (4/8) 5/9 6/10

 

The (decision support metric) P@N  calculates the fraction of n recommendations that are good. The drawback of this metric is that it does not consider the recommended list as an ordered listP@N considers the whole list as a set of items, and treats all the errors in the recommended list equally. The goal is to cut the error in the first few elements (錯誤答案盡量置後) rather than much later in the list.

Accordingly, we need a metric that to weight heavily the errors at the top of the list. Then gradually decrease the significance of the errors as we go down the lower items in a list.


Computing the precision through this item means sub-dividing the recommendation list. We examine a new sub-list every time we get a relevant item(每次出現正確答案,就重算一次sub-list的Precision).
Then we calculate the precision on this current sublist. We do this for every sublist until we reach the end of our recommendations. Now that we have a set of precisions, we average them to get the average precision for a single user.

The AP metric represents the area under the precision-recall curve. We get the precision-recall curve by computing the precision as a function of recall values. In this excellent lecture, the concept is expanded in great detail. (Recall-precision graphs are the standard way to compare search algorithms. To construct a standard recall-precision graph, we interpolate precision values, and average them over a large set of queries. The standard interpolation strategy is based on the assumption that precision always decreases as recall grows.)

To compare two systems we want the largest possible area under the PR curve. 

In the above example we compare systems A, B and C. We notice that system A is better than system C for all levels of recall. However, system A and B intersect where system B does better at higher levels of recall. The problem with this scenario is that it is hard to determine which system does better overall. Plots are harder to interpret than single metrics. This is why researchers came up with a single metric to approximate the Average Precision (i.e. area under the precision-recall curve). Here is my annotated approximation I adapted from the wikipedia page that describes this process:


A worth pondering point is realizing 【what we are actually averagingin AP.

>> It means averaging noisy signals across many users(queies).

Below is a plot of the noise that is common across many users. (This presentation goes in more details about this issue.) This concern is useful to keep in mind when interpreting the MAP score. In the plot below we can see the bright red line is the average PR-curve. The other individual curves in the plot below are for each user for a list of N users. Indeed effectiveness can vary widely across queries. The MAP averaging will undoubtedly have an effect on the reported performance. Such sample curves can help evaluate the quality of the MAP metric.


MAP Pros

  • Gives a single metric that represents the complex Area under the Precision-Recall curve. This provides the average precision per list.
  • Handles the ranking of lists recommended items naturally. This is in contrast to metrics that considering the retrieved items as sets.
  • This metric is able to give more weight to errors that happen high up in the recommended lists. Conversely, it gives less weight to errors that happens deeper in the recommended lists. This matches the need to show as many relevant items as possible high up the recommended list.

MAP Cons

  • This metrics shines for binary (relevant/non-relevant) ratings. However, it is not fit for fine-grained numerical ratings. This metric is unable to extract an error measure from this information.
  • With fine-grained ratings, for example on a scale from 1 to 5 stars, the evaluation would need first to threshold the ratings to make binary relevancies. One option is to consider only ratings bigger than 4 as relevant. This introduces bias in the evaluation metric because of the manual threshold. Besides, we are throwing away the fine-grained information. This information is in the difference between a 4 and 5 stars ratings, as well as the information in the non-relevant items. Is a 1 star rating really the same as a 3 stars rating?

【Reference】https://medium.com/swlh/rank-aware-recsys-evaluation-metrics-5191bba16832

 

---------------MRR------------

 Mean Reciprocal Rank(序位倒數均值)是一種用來評估資料檢索查詢(information retrieval)回應答案之品質的衡量方法。

The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries Q,

 

Reciprocal rank: Capture how early get relevant result in ranking reciprocal rank of ranked results of a query.The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer.第一個正確答案出現序位之倒數稱為Reciprocal rank

• perfect = 1 → worse → 0。

MRR是一個measurement,用來衡量結果的正確性。以特定的前top k個結果為代表做比較。當有正確答案出現在top k returns,便能得分,以出現序位的倒數reciprocal rank計點,越早出現RR越高,出現在第一位RR=1,top k returns中都沒出現正確答案則RR=0,然後將n次實驗的RR平均,求得MRR。

【範例一】假設我們要評估一個「系統」:能將英文名詞,變成plural(複數)。實驗的結果如下: 

Query

Top 3 Returns

Correct response

Rank

Reciprocal rank

cat

catten, cati, cats

cats

  3

1/3

torus

torii, tori, toruses

tori

  2

1/2

virus

viruses, virii, viri

viruses

  1

1

  Top 3  MMR=(1/3 + 1/2 + 1)/3 = 0.61 【範例二】 

MRR Pros

  • This method is simple to compute and is easy to interpret.
  • This method puts a high focus on the first relevant element of the list. It is best suited for targeted searches such as users asking for the “best item for me”.
  • Good for known-item search such as navigational queries or looking for a fact.

MRR Cons

  • The MRR metric does not evaluate the rest of the list of recommended items. It focuses on a single item from the list.
  • It gives a list with a single relevant item just a much weight as a list with many relevant items. It is fine if that is the target of the evaluation.
  • This might not be a good evaluation metric for users that want a list of related items to browse. The goal of the users might be to compare multiple related items.

-------------------

 

Normalized Discounted Cumulative Gain

 

The goal of the MAP measure is similar to the goal of the NDCG metric. They both value putting highly relevant documents high up the recommended lists. However, the NDCG further tunes the recommended lists evaluation. It is able to use the fact that some documents are “more” relevant than others. Highly relevant items should come before medium relevant items, which should come before non-relevant items.

I provide the following annotated diagram that shows the stages of calculating the NDCG linearly:

The cumulative gain CG. This represented a basic measure to accumulate the graded relevances. This metric does not take into account the position of the elements in the ranked list. For ranking tasks, we need to increase the relative impact of the position of elements in the ranked list. The standard Discounted Cumulative Gain, DCG, adds a logarithmic reduction factor to penalize the relevance score proportionally to the position of the item. Furthermore, in industrial applications, it is common to see that the relevance scores get a boost to emphasis retrieving relevant documents. This appears in the industrial DCG formula.

We are dealing with dynamic systems. Users will get a variable number of relevant items recommended. This makes the DCG measure not comparable across users. We need to normalize the metric to be between 0 and 1. To this effect, we determine the ideal ranking for a user. Then we use that ranking as the Ideal Discounted Cumulative Gain IDCG. This provides a nice normalization factor. It helps compute the Normalized Discounted Cumulative Gain. As this is a per-user metric, we need to calculate this metric for all users in the test set. Then we average across users to get a single number. This average is then used for comparing recsys systems to each other. To visualize this process, we go through the calculation in the figure below with the predicted and ideal ranking for a single user.

 

NDCG Pros

  • The primary advantage of the NDCG is that it takes into account the graded relevance values. When they are available in the dataset, the NDCG is a good fit.
  • Compared to the MAP metric it does a good job at evaluating the position of ranked items. It operates beyond the binary relevant/non-relevant scenario.
  • The smooth logarithmic discounting factor has a good theoretical basis discussed here. The authors of that work show that for every pair of substantially different ranking recommender, the NDCG metric is consistently able to determine the better one.

NDCG Cons

  • The NDCG has some issues with partial feedback. This happens when we have incomplete ratings. This is case for the majority of recommender systems situations. If we had complete ratings there would be no real task to achieve! In this case, the recsys system owner needs to decide how to impute the missing ratings. Setting the missing values to 0 would mark them as irrelevant items. Other calculated value such as the mean/median rating for a user can also help with this drawback.
  • Next, the user needs to manually handle the case where the IDCG is equal to zero. This occurs when users have no relevant documents. A strategy here is to set the NDCG to 0 as well.
  • Another issue is handling NDCG@K. The size of the ranked list returned by the recsys system can be less than K. To handle this we can consider fixed-size result sets and pad the smaller sets with minimum scores.

The primary advantage of the NDCG is that it takes into account the graded relevance values. If your dataset has the right form and you are dealing with graded relevance, then NDCG measure is your go-to metric.

-----------------------------------------------------------------------------
有幾個個問題,思考一下:

 【問題一】實驗統計時,應該用多大的k,是恰當的? 

【問題二】同樣的系統做Top 3, Top 5, Top 10的統計,會呈現甚麼變化?代表甚麼意義? 

【問題三】多大的MRR值稱得上是不錯的正確率(不同的k下)? 

-----------------------------------------------------------------------------  

 

【延伸】再次來探究一下 MAP及AP的含意。
假設我們透「
圖像搜索系統」輸入 rose (query)要查詢關於「花」的相關圖片。結果得到一堆按照相關(正確)度排列(ranked images)的清單。通常,清單上不會是只有正確的結果,也包含一些可能相關卻不見得是正確錯誤的答案。 所以,我們用 1 代表正確(是花)的結果, 0代表錯誤(不是花)的結果。結果可能是:0, 0, 1, 1, 1, 1然後計算每個位置的 precision,如下: 0, 0, 1/3,2/4, 3/5, 4/6。

 

每個位置的 precision 所代表的意義是:所有正確的答案中,到目前為止已經出現過多少個正確的圖像(分母是正確答案數,分子是已經出現的正確數)。並算出averageAP = 0.525

 

【思考】雖然6結果個中有4個(從個數看,正確率有2/3)是正確的,但其AP值也只稍稍大於0.5一點而已。
因為正確的結果,多數出現在清單末端所致,這是AP值得留意之處!也就是說,不單單是正確的個數會影響AP值而已。正確答案出現的位置也會影響AP值。
【小結】AP是希望「整組答案正確結果要優先出現的衡量指標這種概念的具體衡量指標。接著,我們來看討論幾種特殊的狀況:
【例】0, 1, 0, 1, 0, 1, ...;每2個就有1個是正確的,AP=0.5【例】0, 0, 1, 0, 0, 1, 0, 0, 1, ...;每3個就有1個是正確的,AP=0.33
【例】0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...;每10個就有1個是正確的,AP=0.1從另一方面來看,當AP>0.5,我們會在清單的看到較多正確的結果,較少錯誤的答案。 MAP則是AP的延伸,取各次查詢的AP的平均值,也是同理。
再來想想:當MAP 值從 0.2 (every 5th) 提升到 0.4 (every 2nd/3rd)時,與從 0.7 到 0.9 是否代表相同程度的改善?

 

以Top 5來說,假設正確答案平均出現在top [1到k] returns中,每個序位都出現1次正確的答案,也就是說:都有找到正確答案,且最好的結果到最後的結果都有並均勻分布。則,這種情況下,來探討k的效應;  Top 3  MRR=(0.33+0.5+1)/3=0.611 Top 5  MRR=(0.2+0.25+0.33+0.5+1)/5=0.456 Top 10  MRR=(........)/10=0.293 利用Excel拉到Top 100的MRR圖如下: 
【注意】不要忘記!這圖是假設:答案平均分布在系統回應的Top k中。 那這有什麼意義?該怎麼解釋呢?先說恰當的k值是多大:看你的需求吧, 以我們都有的經驗,就是Google,每頁有10筆,通常你會看個3到5頁吧,這是經驗值!因人而異,也因慾望強度而異。也就是k>50以上的正確答案已不太有意義。 Top 100大約是我使用論文查詢系統時,系統介面允許使用者設定的最大值。 應該是受限於螢幕的大小,更大也不恰當了,可是我看個300, 500筆是常有的,不過是title就能決定,模糊一點的再看個abstract而已,就能決定要不要下載。 從以上二個經驗來說,Top 5, Top 10應該都是很不錯的規格吧!接下來,想想:比較Top 5, Top 10的MRR有何區別和意義呢? 我們先釐清上圖的假設!當【答案平均分布在系統的回應的Top k中】是怎麼樣的系統?
不太容易定義,試試看。 1.這是一個沒有個性的系統,是一個答案隨k均勻(uniform)分布的系統,以這樣的圖來看Top 30以後,就不太有意義了,因為斜率已經沒太大變化了。 所以,我們暫且從Top 30或Top 50(whatever)切一刀,也就是Top 30以後不太又研究意義和價值。這也符合我們大家的Google經驗。 2.如果是有個性的系統會怎樣? 好的系統,MRR值會愈大(完美=1),即使你的k取30, 50, 100。他的答案總是維持在前面,所以MRR偏高。定義若答案不在k中,RR=0,所以當你k值取很大,MRR若還能維持不墜,表示系統不僅正確率高,同時穩定度(可靠度)也高。 爛的系統,MRR值偏低,極爛=0,穩定的爛:MRR=0即使k已經到100了! 所以,經過一番思考,我有重大發現! 通常實驗會去比較二個系統的Top 5, Top 10看誰高來斷定優劣,很有意義嗎?

根據上述,其實Top 5, Top 10沒太大差別(是嗎???)。
 應該換個做法: 從需求先訂出一個大約所需的MRR值,然後二個系統去RUN,看誰的k值越小者越佳。也就是已幾個特定的MRR值下(問題:多少是恰當的?)來做系統的比較,會不會是比較好的做法呢?有沒有哪裡不對呢?

  

k值根據參考的資料[1],提出,k=5,(Task is precision-oriented: only look at top 5 answers)。我是認為:10,20應該都無所謂。 我的想法是在思考:有別於傳統的系統優劣法的方式,是否可行? A.【傳統】在指定k值下,來比較MRR定優劣(k為主變數,MRR為因變數) B.【新】在指定MRR值下,來比較定優劣(MRR為主變數,k為因變數)
為什麼沒有人這麼做,所以我我不確定,想法是否正確?
 可能是:MRR為主變數的實驗困難且成本高,因為: 1.指定MRR值,反而比指定k不易,更有爭議。 2.指定MRR值,然後實驗,可能須操作很多次,才能達到指定的MRR值,而取得k,成本高。 難度及成本上,B法高出傳統A法非常多。 所以,是不是因為 這樣的設計,反而吃力不討好,所以未見有此做法? 【Reference】 Simone Teufel, Lecture 7: Question Answering, Natural Language and Information Processing (NLIP) Grouphttp://fastml.com/what-you-wanted-to-know-about-mean-average-precision/http://zh.scribd.com/doc/81208516/L10Evaluationhttp://makarandtapaswi.wordpress.com/2012/07/02/intuition-behind-average-precision-and-map/

 【Matlab】MAP計算範例

The formula is: sum i=1:x of (precision at i * change in recall at i)Precision at i is a percentage of correct items among first i recommendations.Change in recall is 1/x if item at i is correct (for every correct item), otherwise zero. Let’s assume that the number of relevant items is bigger or equal to x: r >= x. If not, change in recall is 1/r for each correct i, not 1/x.You might want to look at some test cases (in Matlab) to check your understanding of how to calculate AP.

 

function testAveragePrecisionAtK()
%TESTAVERAGEPRECISIONATK Test cases for AP@K
%
% Author: Ben Hamner (ben@benhamner.com)
fprintf('Testing averagePrecisionAtK ...');
actual = 1:5;
prediction = 1:10;
score = averagePrecisionAtK(actual, prediction);
assert(abs(1-score) < eps);
test_case(1:5, [6 4 7 1 2], 2, 0.25);
test_case(1:5, [1 1 1 1 1], 5, 0.2);
test_case(1:100, [1:20 200:600], 20, 1);
test_case([1 3], 1:5, 3, 5/6);
test_case([1 2 3], [1 1 1], 3, 1/3);
test_case([1 2 3], [1 2 1], 3, 2/3);
fprintf('tests passed\n');
function test_case(actual, prediction, k, expected_score)
score = averagePrecisionAtK(actual, prediction, k);
assert(abs(expected_score-score) < eps);

 

For example: test_case(1:5, [6 4 7 1 2], 2, 0.25);This means that actual items were 1:5, that is [1, 2, 3, 4, 5]. We recommend [6, 4, 7, 1, 2], so we get 4, 1 and 2 correct, but with some incorrect guesses in between.It’s AP@2, So in fact only two first predictions matter: 6 and 4. First is wrong, so precision@1 is 0. Second is right, so precision@2 is 0.5. Change in recall is 0.5 in both cases (that’s 1/x), so AP@2 = 0 * 0.5 + 0.5 * 0.5 = 0.25

創作者介紹
創作者 The Dance of Disorder (Fluctuations of Entropy) 的頭像
Jason

The Dance of Disorder (Fluctuations of Entropy)

Jason 發表在 痞客邦 留言(0) 人氣( 2 )