I was trying to calculate the AUC using MySQL for the data in table like below:
y p
1 0.872637
0 0.130633
0 0.098054
...
...
1 0.060190
0 0.110938
I came across the following SQL query which is giving the correct AUC score (I verified using sklearn method).
SELECT (sum(y*r) - 0.5*sum(y)*(sum(y)+1)) / (sum(y) * sum(1-y)) AS auc
FROM (
SELECT y, row_number() OVER (ORDER BY p) r
FROM probs
) t
Using pandas this can be done as follows:
temp = df.sort_values(by="p")
temp['r'] = np.arange(1, len(df)+1, 1)
temp['yr'] = temp['y']*temp['r']
print( (sum(temp.yr) - 0.5*sum(temp.y)*(sum(temp.y)+1)) / (sum(temp.y) * sum(1-temp.y)) )
I did not understand how we are able to calculate AUC using this method. Can somebody please give intuition behind this?
I am already familiar with the trapezoidal method which involves summing the area of small trapezoids under the ROC curve.
Short answer: it is Wilcoxon-Mann-Whitney statistic, see https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve The page has proof as well.
The bottom part of your formula is identical to the formula in the wiki. The top part is trickier.
fin wiki corresponds topin your data andt_0andt_1are indexes in the data frame. Note that we first sort byp, which makes our life easier.Note that the double sum may be decomposed as
Here
#stands for the total number of such indexes.For each row index
t_1(such thaty(t_1) =1), how manyt_0are such thatp(t_0) < p(t_1)andy(t_0)=0? We know that there are exactlyt_1values ofpthat are less or equal thant_1because values are sorted. We conclude thatNow imagine scrolling down the sorted dataframe. For the first time we meet
y=1,#{t_0: t_0 <= t_1 and y(t_0)=1}=1, for the second time we meety=1, the same quantity is 2, for the third time we meety=1, the quantity is 3, and so on. Therefore, when we sum the equality over all indexest_1wheny=1, we getwhere
nis the total number of ones inycolumn. Now we need to do one more simplification. Note thatIf
y(t_1)is not one, it is zero. Therefore,Plugging this to our formula and using that
finished the proof of the formula you found.
P.S. I think that posting this question on math or stats overflow would make more sense.