I have a very large numpy array with entries like:
[['0/1' '2/0']
['3/0' '1/4']]
I want to convert it/ get an array with the 3d array like
[[[0 1] [2 0]]
[[3 0] [1 4]]]
The array is very wide, so a lot of columns, but not many rows. And there are around 100 or so possibilities for the string. This isnt actually a fraction, just a demonstration of what is in the file (its genomics data, given to me in this format).
I don't want to run in parallel, as I will be running this on a single CPU before moving to a single GPU, so the extra CPUs would be idle while the GPU kernel is running. I have tried numba:
import numpy as np
import itertools
from numba import njit
import time
@njit(nopython=True)
def index_with_numba(data,int_data,indices):
for pos in indices:
str_match = str(pos[0])+'/'+str(pos[1])
for i in range(data.shape[0]):
for j in range(data.shape[1]):
if data[i, j] == str_match:
int_data[i,j] = pos
return int_data
def generate_masks():
masks=[]
def _2d_array(i,j):
return np.asarray([i,j],dtype=np.int32)
for i in range(10):
for j in range(10):
masks.append(_2d_array(i,j))
return masks
rows = 100000
cols = 200
numerators = np.random.randint(0, 10, size=(rows,cols))
denominators = np.random.randint(0, 10, size=(rows,cols))
samples = np.array([f"{numerator}/{denominator}" for numerator, denominator in zip(numerators.flatten(), denominators.flatten())],dtype=str).reshape(rows, cols)
samples_int = np.empty((samples.shape[0],samples.shape[1],2),dtype=np.int32)
# Generate all possible masks
masks = generate_masks()
t0=time.time()
samples_int = index_with_numba(samples,samples_int, masks)
t1=time.time()
print(f"Time to index {t1-t0}")
But it is too slow to be feasible.
Time to index 182.0304057598114
The reason I want this is I want to write a cuda kernel to perform an operation based on the original values - so for '0/1' i need 0 and 1 etc, but I cannot handle the strings. I had thought perhaps masks could be used, but they dont seem to be suitable.
Any suggestions appreciated.
Since your integers are all single digits, you can view your input array as a
'U1'array:Now, you already know the indices of the numbers in the strings: The
[:, :, 0]th elements of your expected result are inarr_u1[:, ::3], and the[:, :, 1]elements of your expected result are inarr_u1[:, 2::3]This gives you the expected result:
Comparing the runtime of your
index_with_numbavs. mine shows a ~20x speedup on my computer: