How do you speed up np.corrcoef with numba

117 Views Asked by At

My code is simply this:

@njit()
def corr(arr: np.ndarray):
    return np.corrcoef(arr)

arr = np.random.random([10000, 10000])
corr_matrix = corr(arr)

It takes around 50 seconds to finish on my computer, and just 18 seconds without @njit. If I increase size to 30,000, the function takes forever.

Is there a way to improve performance of numba @njit on np.corrcoef like in this situation or np.corrcoef is already as fast as it can be? I think I'm not understanding numba correctly here because it's a lot slower with @njit than without.

1

There are 1 best solutions below

4
Matt Haberland On

Based on comments, the OP seemed interested in a speedup using a GPU, so I tested CuPy corrcoef on a Colab T4 GPU instance.

from time import perf_counter_ns
import cupy as cp
import numpy as np

rng = np.random.default_rng(65651651684)
x = rng.random((3000, 3000))

%timeit np.corrcoef(x)
# 835 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit cp.asnumpy(cp.corrcoef(x))  # faster
# 273 ms ± 2.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit cp.asnumpy(cp.corrcoef(x, dtype=cp.float32))  # much faster
# 37.8 ms ± 581 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If we're concerned about the possibility of caching or lazy evaluation, we can generate new data, run once, and even include a print statement for good measure.

rng = np.random.default_rng(23368551581)
x = rng.random((3000, 3000))

tic = perf_counter_ns()
rho0 = np.corrcoef(x)
print(rho0[0])
toc = perf_counter_ns()
print(toc - tic)  # 879117900

tic = perf_counter_ns()
rho = cp.asnumpy(cp.corrcoef(x))
print(rho[0])
toc = perf_counter_ns()
print(toc - tic)  # 445459866

# passes
np.testing.assert_allclose(cp.asnumpy(rho), rho0)

tic = perf_counter_ns()
rho = cp.asnumpy(cp.corrcoef(x, dtype=cp.float32))
print(rho[0])
toc = perf_counter_ns()
print(toc - tic)  # 41231956

# passes, although it may depend on how nice your data is numerically
np.testing.assert_allclose(cp.asnumpy(rho), rho0, atol=1e-6)

I also checked the NumPy calculations on a regular CPU instance with both float64 and float32, and the CuPy calculations on the T4 were faster.

Results may vary by GPU, but I figured this comparison was fair since anyone can use Colab (for some amount of time) for free. I don't know the constraints of the OP, but use of a GPU - especially with float32 arithmetic - seems to be a way to speed up correlation correctly calculation for large arrays.