How this method or formula for calculating ROC AUC works?

Question

How this method or formula for calculating ROC AUC works?

915 Views Asked by Aditya Jain At 16 November 2025 at 14:50

I was trying to calculate the AUC using MySQL for the data in table like below:

y   p
1   0.872637
0   0.130633
0   0.098054
...
...
1   0.060190
0   0.110938

I came across the following SQL query which is giving the correct AUC score (I verified using sklearn method).

 SELECT (sum(y*r) - 0.5*sum(y)*(sum(y)+1)) / (sum(y) * sum(1-y)) AS auc
 FROM ( 
   SELECT y, row_number() OVER (ORDER BY p) r
   FROM probs
 ) t

 Using pandas this can be done as follows:

 temp = df.sort_values(by="p")
 temp['r'] = np.arange(1, len(df)+1, 1)
 temp['yr'] = temp['y']*temp['r']
 print( (sum(temp.yr) - 0.5*sum(temp.y)*(sum(temp.y)+1)) / (sum(temp.y) * sum(1-temp.y)) )

I did not understand how we are able to calculate AUC using this method. Can somebody please give intuition behind this?

I am already familiar with the trapezoidal method which involves summing the area of small trapezoids under the ROC curve.

Original Q&A

There are 1 best solutions below

**Kate Melnykova** · Answer 1

Short answer: it is Wilcoxon-Mann-Whitney statistic, see https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve The page has proof as well.

The bottom part of your formula is identical to the formula in the wiki. The top part is trickier. f in wiki corresponds to p in your data and t_0 and t_1 are indexes in the data frame. Note that we first sort by p, which makes our life easier.

Note that the double sum may be decomposed as

Sum_{t_1 such that y(t_1)=1} #{t_0 such that p(t_0) < p(t_1) and y(t_0)=0}

Here # stands for the total number of such indexes.

For each row index t_1 (such that y(t_1) =1), how many t_0 are such that p(t_0) < p(t_1) and y(t_0)=0? We know that there are exactly t_1 values of p that are less or equal than t_1 because values are sorted. We conclude that

#{t_0: p(t_0) < p(t_1) and y(t_0)=1) = t_1 - #{t_0: t_0 <= t_1 and y(t_0)=1}

Now imagine scrolling down the sorted dataframe. For the first time we meet y=1, #{t_0: t_0 <= t_1 and y(t_0)=1}=1, for the second time we meet y=1, the same quantity is 2, for the third time we meet y=1, the quantity is 3, and so on. Therefore, when we sum the equality over all indexes t_1 when y=1, we get

Sum_{t_1: y(t_1)=1}#{t_0: p(t_0) < p(t_1) and y(t_0)=1) = Sum_{t_1: y(t_1)=1} t_1 - (1 + 2 + 3 + ... + n),

where n is the total number of ones in y column. Now we need to do one more simplification. Note that

Sum_{t_1: y(t_1)=1} t_1 = Sum_{t_1: y(t_1)=1} t_1 y(t_1)

If y(t_1) is not one, it is zero. Therefore,

Sum_{t_1: y(t_1)=1} t_1 = Sum_{t_1: y(t_1)=1} t_1 y(t_1) = Sum_{t} t y(t)

Plugging this to our formula and using that

1 + 2+ 3 + ... + n = n(n+1)/2

finished the proof of the formula you found.

P.S. I think that posting this question on math or stats overflow would make more sense.

How this method or formula for calculating ROC AUC works?

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in METRICS

Related Questions in ROC

Related Questions in AUC

Trending Questions

Popular # Hahtags

Popular Questions