Recall and Precision
Recall
The fraction of the relevant documents (R) which has been retrieved
Precision
The fraction of the retrieved documents (A) which is relevants
Recall=RRaPrecision=ARa Precision versus recall curve
通常 Recall 越高,會使 Precision 越低 (tradeoff)
Top-k Precision (Precision at k, P@k)
Precision evaluation in a ranking list
The precision value of the top-k results
前 5 筆有 2 筆中,P@5 = 2/5 = 40%
Average Over Multiple Queries
Pˉ(r)=Nq1i=1∑NqPi(r) Average precision at the recall level r
Nq=Number of queries used
Pi(r)=Precision at recall level r for the i-th query
Interpolated precision
What is interpolation mean ?
下方的 (P, R) 分別為讀到第三個、第八個、第十五個時候
P(ri)=maxri≤r≤ri+1P(r) Single Value Summaries
Average precision 可能會隱藏演算法中不正常的部分
需要知道對某特定 query 的 performance
The single value should be interpreted as a summary of the corresponding precision versus recall curve
Average Precision
obtained after each new relevant document is observed
e.g. (1+0.66+0.5+0.4+0.3)/5=0.57
Mean Average Precision (MAP)
Average of the precision value obtained for the top k documents
each time a relevant doc is retrieved
Avoids interpolation, use of fixed recall levels
MAP(Q)=∣Q∣1j=1∑∣Q∣mj1k=1∑mjP(Rjk) R-Precision
break-even point: recall = precision
The precision at the R-th position in the ranking
R : the total number of relevant documents of the current query
Comparison
RNRNN NNNRR
MAP = (1 + 2/3 + 3/9 + 4/10) / 4 = 0.6
NRNNR RRNNN
MAP = (1/2 + 2/5 + 3/6 + 4/7) / 4 = 0.49
MRR: Mean Reciprocal Rank
Rank of the first correct answer
MRR=∣Q∣1i=1∑∣Q∣ranki1 Precision-Recall
Microaveraging
Each Instance has equal weight
Largest classes have most influence
Micro-average precision : 4/9=0.44 Macroaveraging
Each Class has equal weight
OrangeLemonAppleMacro-average precision=1/5=0.20=1/2=0.50=2/2=1.00:(0.20+0.50+1.00)/3=0.57 P/R 適用性
Maximum recall 需要知道所有文件相關的背景知識
平時不太可能得到標準答案,因為實際案例的 Recall 不好算
Recall and precision 是相對的測量方式,兩者要合併使用比較適合
F-score
The Harmonic Mean, F-measure
F(j)=r(j)1+P(j)12,F∈[0,1] Example
0.71+0.4120.71+0.512=0.509=0.583... Harmonic
User-Oriented Measure
在不同領域上可以修改 recall / precision 來進行不同評分
e.g. coverage / novelty
Coverage 越高,找到越多 user 期望文件
Novelty 越高,找到越多使用者原本不知道的文件
Alternative Measures (confusion matrix)
將預測與結果分為 True/False 的 Positive/Negative 四類
可以改寫原本的 Recall / Precision
Recall=∣R∣∣Ra∣=TP+FNTP
Precision=∣A∣∣Ra∣=TP+FPTP
新增兩種測量方法
Accuracy=NTP+TN
Specificity=TN+FPTN
Specificity 可以看作 Negative Recall
Not useful, TN is always too large
Example
100% sensitivity 代表完全猜對所有生病的人
100% specificity 代表完全猜對所有健康的人
Limitation of Accuracy
想像 class 0 的 examples 有 9990 個
然後 class 1 的 examples 只有 10 個
我的模型全猜 class 0
Accuracy = 9990/10000 = 99.9%
Cost Matrix
Cost matrix 和 confusion matrix 幾乎一樣
給 false negative 的 weight 加大
Example
我們設置上面 matrix 為計算的標準
False Negative (猜沒生病,但其實有生病) 的權重 Cost 為 100
左邊的預測可以得到
acc = (150+250)/(150+250+60+40)=80%
cost = 150×(−1)+40×100+60×1+250×0=3910
右邊的預測可以得到
acc = (250+200)/(250+200+5+45)=90%
cost = 250×(−1)+45×100+5×1+200×0=4255
可以發現雖然右邊 acc 較高,但 cost 也較高
很多情況,例如醫學上可能就不能容許 cost 較高情況出現
ROC
ROC : Receiver Operating Characteristic
ROC is a plot of sensitivity vs. 1-specificity for a binary classifier
ROC curve
ROC 的結果可以等同於 plot TP fraction vs. FP fraction
the area under the ROC curve, or "AUC"
Example
若是 ppppppppppnnnnnnnnnn 就會是一個 Γ 的形狀
Issues
ROC curves are insensitive to changes in class distribution
TPR and FPR are all strict columnar ratio
對於不 balance 的 data,產生的結果差不多,因為是參考 ratio
ROC measures the ability of a classifier to produce good relative scores.
need only produce relative accurate scores that serve to discriminate positive and negative instances
Questions
Q : What is the relationship between the value of F1 and the break-even point?
A : at break-even point F1 = P = R.
Q : Prove that the F1 is equal to the Dice coefficient of the retrieved and relevant document sets.
Dice(X,Y)=2∣X∩Y∣/∣X∣+∣Y∣ A :
PRF1XYDice(X,Y)=TP+FPTP=TP+FNTP=P+R2PR=2TP+FP+FN2TP=TP+FP=TP+FN=2TP+FP+FN2TP Estimation Methods
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the remaining one
LOOCV: k=n (train on all data except 1 for test)
Stratified sampling
oversampling vs undersampling
Bootstrap
Sampling with replacement
Evaluate Ranked list
The ground truth is ranked / partially preferred
Correlation coefficient measurement
Discounted Cumulative Gain (DCG)
measures the usefulness, or gain, of a document based on its position in the result list
The gain is accumulated cumulatively
CG is independent with the result order
DCG is dependent with the result order
DCG at position p
DCGp=rel1+i=2∑plog2(i)reli The reli is the graded relevance of the result at position i
Example
The score of order above is
DCG5=2+log2(2)1+log2(3)0+log2(4)2+log2(5)0=2+1+0+1+0=4 The ideal order is 2, 2, 1, 0, 0
IDCG5=2+log2(2)2+log2(3)1+log2(4)0+log2(5)0=2+2+0.63+0+0=4.63 NDCG 將 DCG 正規化
NDCG5=IDCG5DCG5=4.634=0.86
Correlation coefficient measurement
Kendall-tau
measure the association between two measured quantities
2n(n+1)concordant - discordant
where 2n(n+1)=(2n)
e.g. Ground Truth = 12345, Result = 21534
ConcordantDiscordantKendall-tau=7{1,3},{1,4},{1,5},{2,3},{2,4},{2,5},{3,4}=3{1,2},{3,5},{4,5}=(7−3)/10=0.4
Cohen's Kappa
度量兩個評分者 (rater) 的 agreement
K=1−Pr(e)Pr(a)−Pr(e)
Example 1
例如有 rater A 和 rater B
在 30 個投票中,共有 25 個 agreement
Agreement Pr(a)=(10+15)/30=0.83
看起來很高,但有時候是因為很多 noise 導致 (例如選擇太 trivial)
解決方法就是減去 Pr(e) 這個 random agreement
這個 Pr(e) 代表的是兩人隨便亂猜就可以一致的機率
P(A=Y)P(B=Y)P(A=Y,B=Y)P(A=N,B=N)⇒Pr(e)=10/30=0.33=15/30=0.5=0.33×0.5=0.17=0.66×0.5=0.33=0.17+0.33=0.5
得到 a 和 e 的機率就可以算出上面的 kappa 值
K=1−0.50.83−0.5=0.66 所以 agreement 就從 0.83 下降至 0.66
Example 2
Both Pr(a)Pr1(e)Pr2(e)K1K2=10060=0.6=10060×10070+10040×10030=0.54=10060×10030+10040×10070=0.46=1−0.540.6−0.54=0.13=1−0.460.6−0.46=0.26 雖然兩題在未計算 kappa 之前的 agreement 都是 0.6
但計算完 kappa 後,可以發現右邊的 2 raters agreement 較好
Applicability
each system need to use diff evaluation
NDCG : sort data, ranking
Recall : use with ground truth
Top-1 precision : recommandation
F1 : find the best in precision + recall tradeoff