Fast way to convert 2d string numpy array to a 3d int array

110 Views Asked by At

I have a very large numpy array with entries like:

[['0/1' '2/0']
['3/0' '1/4']]

I want to convert it/ get an array with the 3d array like

[[[0 1] [2 0]]
[[3 0] [1 4]]]

The array is very wide, so a lot of columns, but not many rows. And there are around 100 or so possibilities for the string. This isnt actually a fraction, just a demonstration of what is in the file (its genomics data, given to me in this format).

I don't want to run in parallel, as I will be running this on a single CPU before moving to a single GPU, so the extra CPUs would be idle while the GPU kernel is running. I have tried numba:

import numpy as np
import itertools
from numba import njit
import time


@njit(nopython=True)
def index_with_numba(data,int_data,indices): 
    for pos in indices:   
        str_match = str(pos[0])+'/'+str(pos[1])
        for i in range(data.shape[0]):
            for j in range(data.shape[1]):
                if data[i, j] == str_match:
                    int_data[i,j] = pos
    return int_data

def generate_masks():
    masks=[]
    def _2d_array(i,j):
        return np.asarray([i,j],dtype=np.int32)
    for i in range(10):
        for j in range(10):
            masks.append(_2d_array(i,j))
    return masks

rows = 100000
cols = 200

numerators = np.random.randint(0, 10, size=(rows,cols))
denominators = np.random.randint(0, 10, size=(rows,cols))

samples =  np.array([f"{numerator}/{denominator}" for numerator, denominator in zip(numerators.flatten(), denominators.flatten())],dtype=str).reshape(rows, cols)
samples_int = np.empty((samples.shape[0],samples.shape[1],2),dtype=np.int32)


# Generate all possible masks
masks = generate_masks()
t0=time.time()
samples_int = index_with_numba(samples,samples_int, masks)
t1=time.time()
print(f"Time to index {t1-t0}")

But it is too slow to be feasible.

Time to index 182.0304057598114

The reason I want this is I want to write a cuda kernel to perform an operation based on the original values - so for '0/1' i need 0 and 1 etc, but I cannot handle the strings. I had thought perhaps masks could be used, but they dont seem to be suitable.

Any suggestions appreciated.

4

There are 4 best solutions below

2
pho On

Since your integers are all single digits, you can view your input array as a 'U1' array:

arr_s = np.array([['0/1', '2/0'],
                  ['3/0', '1/4']])

arr_u1 = arr_s.view('U1')
# array([['0', '/', '1', '2', '/', '0'],
#        ['3', '/', '0', '1', '/', '4']], dtype='<U1')

Now, you already know the indices of the numbers in the strings: The [:, :, 0]th elements of your expected result are in arr_u1[:, ::3], and the [:, :, 1] elements of your expected result are in arr_u1[:, 2::3]

nrows, ncols = arr_s.shape
result = np.zeros((nrows, ncols, 2), dtype=int)
result[:, :, 0] = arr_u1[:, ::3].astype(int)
result[:, :, 1] = arr_u1[:, 2::3].astype(int)

This gives you the expected result:

array([[[0, 1],
        [2, 0]],

       [[3, 0],
        [1, 4]]])

Comparing the runtime of your index_with_numba vs. mine shows a ~20x speedup on my computer:

def pho(data):
    arr_u1 = data.view('U1')
    nrows, ncols = data.shape
    result = np.zeros((nrows, ncols, 2), dtype=int)
    result[:, :, 0] = arr_u1[:, ::3].astype(int)
    result[:, :, 1] = arr_u1[:, 2::3].astype(int)
    return result

t0 = time.time()
samples_int = index_with_numba(samples,samples_int, masks)
t1 = time.time() - t0
samples2_int = pho(samples)
t2 = time.time() - t1 - t0
print(f"index_with_numba: {t1}\npho: {t2}")
index_with_numba: 162.01522159576416
pho: 8.772466897964478
5
Jérôme Richard On

Unicodes strings are barely supported by Numba and very (very) slow currently. You can make the Numba code multiple order of magnitude faster by just converting the unicode strings to byte-strings before calling the Numba function. There is no need to use a GPU for such a simple task. Note the conversion is quite slow even for Numpy (which is not very efficient for that since it is designed for numerical computation, hence its name). If you can, please consider generating directly the string array as a byte-string array since there is no need for unicode here.

Here is an example:

# ~6 seconds on my machine (i5-9600KF CPU)
bsamples = samples.astype('S3')

# 1 ms instead of ~198 s  --  BUT INCORRECT (see below)
index_with_numba(bsamples, samples_int, masks)

UPDATE:

There is a catch though: the resulting Numba function is not correct anymore due to type mismatch! You also need to adapt the Numba function. In fact, the code can be optimized a bit and parallelized. Here is the resulting optimized function (which is also correct):

@nb.njit(parallel=True)
def numba_faster(data,int_data,indices):
    for pos in indices:
        v0 = pos[0] + ord('0')
        v1 = pos[1] + ord('0')
        for i in nb.prange(data.shape[0]):
            for j in range(data.shape[1]):
                if data[i, j][0] == v0 and data[i, j][2] == v1:
                    int_data[i,j] = pos
    return int_data

Faster implementation

The above implementation has the benefit of being very similar to your initial code. However, it seems the indices-based loop is not actually needed and this loop makes the code pretty slow. Indeed, the full data array is travelled len(indices) times and only 1 item is set for each iteration. This is not efficient. We can directly convert the strings on the fly in parallel.

@nb.njit(parallel=True)
def numba_fastest(data,int_data):
    for i in nb.prange(data.shape[0]):
        for j in range(data.shape[1]):
            int_data[i,j,0] = data[i, j][0] - ord('0')
            int_data[i,j,1] = data[i, j][2] - ord('0')
    return int_data

Performance evaluation

Here are performance results on my machine (with a i5-9600KF CPU):

Initial code:                        198.31 s
Pho's code:                           11.88 s
numba_faster (with conversion):        6.37 s
numba_fastest (without conversion):    5.77 s
numba_faster (without conversion):     0.62 s
Andrej Kesely's code (latest):         0.20 s
numba_fastest (without conversion):    0.02 s   <-----

numba_fastest is the fastest implementation. The conversion is still the bottleneck, but it can certainly be easily removed. Also note that the compilation time of the Numba functions are not included in the benchmark (you can eagerly compile the Numba function by adding a signature to the function).

In the end, numba_fastest is 34 times faster than the initial code with the string conversion, and about 10 000 times faster without the conversion.

3
Andrej Kesely On

Another solution: just cast the array as uint8 and substract 48 from the right values, then reshape:

def get_arr(arr):
    out = arr.ravel().view("uint8")[::4]
    out = (out[out != 47] - 48).reshape(arr.shape[0], arr.shape[1], -1)
    return out

arr = np.array([["0/1", "2/0"], ["3/0", "1/4"]])
print(get_arr(arr))

Prints:

[[[0 1]
  [2 0]]

 [[3 0]
  [1 4]]]

Benchmark:

import numpy as np
from timeit import timeit

def get_arr(arr):
    out = arr.ravel().view("uint8")[::4]
    out = (out[out != 47] - 48).reshape(arr.shape[0], arr.shape[1], -1)
    return out


def get_arr_pho(arr):
    nrows, ncols = arr.shape
    arr_u1 = arr.view("U1")
    result = np.zeros((nrows, ncols, 2), dtype=int)
    result[:, :, 0] = arr_u1[:, ::3].astype(int)
    result[:, :, 1] = arr_u1[:, 2::3].astype(int)
    return result

rows = 100000
cols = 200


numerators = np.random.randint(0, 10, size=(rows, cols))
denominators = np.random.randint(1, 10, size=(rows, cols))

arr = np.array(
    [
        f"{numerator}/{denominator}"
        for numerator, denominator in zip(numerators.flatten(), denominators.flatten())
    ],
    dtype=str,
).reshape(rows, cols)

assert np.allclose(get_arr(arr), get_arr_pho(arr))

t1 = timeit("get_arr(arr)", number=1, globals=globals())
t2 = timeit("get_arr_pho(arr)", number=1, globals=globals())
print(t1)
print(t2)

Prints:

0.1379915471188724
4.7036198258865625
0
nikhil swami On

i want to attack the source of problem

people have used mind blowing solutions, from parallel loops, to ascii math. my hypothesis is , assuming row, column , is usually found in CSVs here is a helper function for future readers. can be generalized to any data source and work for 13/21221 fractions too, and only one iteration is enough , without numba JITs, and ascii hits.

import csv
import numpy as np

# File contains
"""
1/2,3/4
5/6,7/8
"""

def load_fractions_from_csv(file_path):
    with open(file_path, 'r') as file:
        reader = csv.reader(file)
        data = []
        for row in reader:
            fractions = []
            for fraction_str in row:
                numerator, denominator = map(int, fraction_str.split('/'))
                fractions.append((numerator, denominator))
            data.append(fractions)

    return np.array(data)


# Example usage
fractions_array = load_fractions_from_csv('data.csv')
print(fractions_array)
"""
[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]
"""


best i achieve using original problem was 8 seconds, using an enhanced version of masks. because i wanted to keep as pythonic as possible without library bloat.


rows = 10000
cols = 200

numerator_range = np.arange(0, 10)
denominator_range = np.arange(1, 10)
masker = {f"{i}/{j}": np.asarray([i, j], dtype=np.int8) for i in numerator_range for j in denominator_range}
# print((masker))

numerators = np.random.randint(0, 10, size=(rows, cols))
denominators = np.random.randint(1, 10, size=(rows, cols))

samples = np.array(
    [f"{numerator}/{denominator}" for numerator, denominator in zip(numerators.flatten(), denominators.flatten())],
    dtype=str,
).reshape(rows, cols)
samples_int = np.empty((*samples.shape, 2), dtype=np.int8)
t1 = time.time()
for row in nb.prange(rows):
    for col in range(cols):
        samples_int[row, col] = masker[samples[row, col]]
print(samples_int)

print((time.time() - t1))