I need to perform a convolution on an image in-place, and by in-place I mean such that as the structuring element gets applied to different pixels, I want the result of the previous steps to overwrite the image. To put it in context, this is useful in Gauss-Seidel iteration.
I'm currently using scipy.ndimage.convolve1d, which clearly doesn't do in-place convolution as I described. I know how to write a kernel that does that using numba (basically a for-loop where you overwrite the existing elements as you iterate on pixels), but I was wondering if I could get further speedups on GPU. The issue is that since the array being operated on changes in each iteration, it's not trivial to code it in GPU because of race conditions.
Here's a concrete example:
a = [1, 5, 3, 0, -4, 1]
weights = [1, 2, 1]
Here's what scipy.ndimage.convolve1d does (assume out is the result, also assume 0 for values extending the boundaries):
# Step 1: 1*0 + 2*1 + 1*5 = 7 -> out[0], a = [1, 5, 3, 0, -4, 1]
# Step 2: 1*1 + 2*5 + 1*3 = 14 -> out[1], a = [1, 5, 3, 0, -4, 1]
# Step 3: 1*5 + 2*3 + 1*0 = 12 -> out[2], a = [1, 5, 3, 0, -4, 1]
# ...
Here's what I want:
# Step 1: 1*0 + 2*1 + 1*5 = 7 -> a[0], a = [7, 5 , 3 , 0, -4, 1]
# Step 2: 1*7 + 2*5 + 1*3 = 20 -> a[1], a = [7, 20, 3 , 0, -4, 1]
# Step 3: 1*20 + 2*3 + 1*0 = 26 -> a[2], a = [7, 20, 26, 0, -4, 1]
# ...
Here's my numba implementation for a 2D convolution with the following weights:
w = [[0 1 0],
[1 -4 1],
[0 1 0]]
@numba.njit()
def myconv(u):
Nx, Ny = u.shape
for i in range(1, Nx-1):
for j in range(1, Ny-1):
u[i, j] = u[i-1, j] + u[i+1, j] + u[i, j-1] + u[i, j+1] - 4*u[i, j]
If I understood correctly you are trying to run a 2D infinite filter response filter.
If your filter is separable you can do that by applying lfilter, once each axis. Also, this method will shift the output maybe preferrable to use filtfilt.
Edit
The example given by the OP is not separable to begin with, however it is simple enough to make the computation reasonable using cumulative sum.
This is the original implementation
Notice that in the
jloop the only modified value in the sumu[i,j-1], that is added tou[i,j], this operation could be deferred and performed in a second loop without changing the output.Then, notice that the second loop could be calculated using cummulative sum
This vectorizes one axis.
This is fully compatible with pytorch that support GPU out of the box.