Cupyx 3D FFT convolution on GPU slows down after multiple sequential calls

247 Views Asked by At

First off, I have never asked a question on stackoverflow before, and I will do my best to follow the site guidelines, but let me know if I should changes something about my post.

I am attempting to write a function that can quickly extract the pore size distribution from a binary 3D image. I do this by computing the local thickness of the image, in a similar fashion to that implementing in ImageJ's local thickness plugin. I need this function ideally to run in under 1 second, as I am calling it ~200000 times in a simulated annealing process. It is performed partially on the CPU (12th Gen Intel(R) Core(TM) i7-12700KF, 20 cores, 16GB RAM) and partially on the GPU (RTX GeForce 3050, 8GB).

The function works, but there is something happening I think on the backend, which is slowing it down artificially. This may have to do with threading, or GPU to CPU overhead, or some kind of 'cool down' period.

There are three parts of the function:

  1. Euclidean Distance Transform - performed on CPU, in parallel using edt package. Currently takes ~0.25 seconds on a 250^3 binary image

  2. 3d Skeletonization - performed on CPU using skimage.morphology.skeletonize_3d, but with the image split into chunks using dask. This implementation is provided by porespy.filters.chunked_func. Multiply skeleton by the distance transform to get a skeleton with values equal to the minimum distance to nearest background voxel. This process takes 0.45 to 0.5 seconds.

  3. Dilate each voxel on the skeleton using a spherical structuring element with radius equal to the value of the skeleton voxel. This is done in a for loop, starting from the maximum structuring element size, and in decreasing order. Larger spheres do not get overwritten by smaller spheres. The dilations are accomplished using fft convolution on the GPU using cupyx.scipy.signal.signaltools.convolve, which takes ~ 0.005 seconds.

Less code is required to reproduce the effect I am seeing, however. The essential part is performing many fft convolutions in sequence.

A minimum reproducible example is as follows:

import skimage
import time
import cupy as cp
from cupyx.scipy.signal.signaltools import convolve


# Generate a binary image
im = cp.random.random((250,250,250)) > 0.4

# Generate spherical structuring kernels for input to convolution
structuring_kernels = {}
for r in range(1,21):
    structuring_kernels.update({r: cp.array(skimage.morphology.ball(r))})

# run dilation process in loop
for i in range(10):
    s = time.perf_counter()
    for j in range(20,0,-1):
        convolve(im, structuring_kernels[j], mode='same', method='fft')
    e = time.perf_counter()
    # time.sleep(2)
    print(e-s)

When run as is, after the first couple of loops, each dilation loop takes ~ 1.8 seconds on my computer. If I uncomment the time.sleep(2) line (ie pause for 2 seconds between each loop), then the loop function call only takes 0.05 seconds. I suspect this has to do with threading or GPU use, as it takes a couple of loops for it to reach the 1.8 seconds, then it remains steady at that value. When I monitor my GPU usage, the 3D monitor quickly spikes to 100% and stays close to there.

If I am just being limited by the capacity of my GPU, why do the first couple of loops run faster? Could a memory leak be happening? Does anyone know why this is happening, and if there is a way to prevent it, possibly using backend controls in cupy?

I'm not sure if this is necessary, but my local thickness function in its entirety is as follows:

import porespy as ps
from skimage.morphology import skeletonize_3d
import time
import numpy as np
import cupy as cp
from edt import edt
from cupyx.scipy.signal.signaltools import convolve

def local_thickness_cp(im, masks=None, method='fft'):
    """

    Parameters
    ----------
    im: 3D voxelized image for which the local thickness map is desired
    masks: (optional) A dictionary of the structuring elements to be used
    method: 'fft' or 'direct'

    Returns
    -------
    The local thickness map
    """
    s = time.perf_counter()
    # Calculate the euclidean distance transform using edt package
    dt = cp.array(edt(im, parallel=15))
    e = time.perf_counter()
    # print(f'EDT took {e - s}')
    s = time.perf_counter()
    # Calculate the skeleton of the image and multiply by dt
    skel = cp.array(ps.filters.chunked_func(skeletonize_3d,
                                            overlap=17,
                                            divs=[2, 3, 3],
                                            cores=20,
                                            image=im).astype(bool)) * dt
    e = time.perf_counter()
    # print(f'skeletonization took {e - s} seconds')
    r_max = int(cp.max(skel))
    s = time.perf_counter()
    if not masks:
        masks = {}
        for r in range(int(r_max), 0, -1):
            masks.update({r: cp.array(ps.tools.ps_ball(r))})
    e = time.perf_counter()
    # print(f'mask creation took {e - s} seconds')
    # Initialize the local thickness image
    final = cp.zeros(cp.shape(skel))
    time_in_loop = 0
    s = time.perf_counter()
    for r in range(r_max, 0, -1):
        # Get a mask of where the skeleton has values between r-1 and r
        skel_selected = ((skel > r - 1) * (skel <= r)).astype(int)
        # Perform dilation on the mask using fft convolve method, and multiply by radius of pore size
        dilation = (convolve(skel_selected, masks[r], mode='same', method=method) > 0.1) * r
        # Add dilation to local thickness image, where it is still zero (ie don't overwrite previous inserted values)
        final = final + (final == 0) * dilation

    e = time.perf_counter()
    # print(f'Dilation loop took {e - s} seconds')
    return final

Now, in theory, the function should take ~ 0.80 seconds to compute. However, when called in a loop on separate images, it takes ~1.5 seconds. However, if I add a time.sleep(1) after each function call, then the function does take approximately 0.8 seconds.

1

There are 1 best solutions below

0
Laurent Facq On

i think it's because call to GPU are asynchronous. so cupy function can return but the GPU is still working. but when you'll try to access the result variable, you'll have to wait if GPU computation are not finished.

try to call

cp.cuda.stream.get_current_stream().synchronize()

just before calling

time.perf_counter()

and you'll get "synchronous" accurate timing