I have a nested loop structure in which I want to work with a sub buffer of a OpenCL buffer object. However, the sub buffer is not alignment to the CL_DEVICE_MEM_BASE_ADDR_ALIGN of my device (which is 128 Bytes). I used bitwise operation to align the buffer as required by the hardware:
cl_mem sub_buff;
for(int u = 0; u < u_max; u++) {
for (int v = 0; v < v_max; v++) {
int offset = (u * v_max + v) * (size_of_slice);
size_t base_align = 128; // Bytes
size_t aligned_offset = (offset + base_align - 1) & ~(base_align - 1); // align to next multiple of base_align
cl_buffer_region subBufferRegion = {aligned_offset,
size_of_slice * sizeof(float)};
i_fft_buff = clCreateSubBuffer(data_buff_d, CL_MEM_READ_WRITE,
CL_BUFFER_CREATE_TYPE_REGION,
&subBufferRegion, &error);
// ... continue to work with sub buffer
}
}
But doing so, Im missing out on elements because aligned_offset is rounding-up to the next multiple of 128 Bytes.
I was wondering if there is a way of aligning the sub buffer to the required 128 Bytes and still getting all size_of_slice elements? I don't want to avoid creating a new cl_mem buffer in each iteration.