I am using C# and CUDAfy.net (yes, this problem is easier in straight C with pointers, but I have my reasons for using this approach given the larger system).
I have a video frame grabber card that is collecting byte[1024 x 1024] image data at 30 FPS. Every 33.3 ms it fills a slot in a circular buffer and returns a System.IntPtr
that points to that un-managed 1D vector of *byte
; The Circular buffer has 15 slots.
On the GPU device (Tesla K40) I want to have a global 2D array that is organized as a dense 2D array. That is, I want something like the Circular Queue but on the GPU organized as a dense 2D array.
byte[15, 1024*1024] rawdata;
// if CUDAfy.NET supported jagged arrays I could use byte[15][1024*1024 but it does not
How can I fill in a different row each 33ms? Do I use something like:
gpu.CopyToDevice<byte>(inputPtr, 0, rawdata, offset, length) // length = 1024*1024
//offset is computed by rowID*(1024*1024) where rowID wraps to 0 via modulo 15.
// inputPrt is the System.Inptr that points to the buffer in the circular queue (un-managed)?
// rawdata is a device buffer allocated gpu.Allocate<byte>(1024*1024);
And in my kernel header is:
[Cudafy]
public static void filter(GThread thread, byte[,] rawdata, int frameSize, byte[] result)
I did try something along these lines. But there is no API pattern in CudaFy for:
GPGPU.CopyToDevice(T) Method (IntPtr, Int32, T[,], Int32, Int32, Int32)
So I used the gpu.Cast Function to change the 2D device array to 1D.
I tried the code below, but I am getting CUDA.net exception: ErrorLaunchFailed
FYI: When I try the CUDA emulator, it aborts on the CopyToDevice claiming that Data is not host allocated
public static byte[] process(System.IntPtr data, int slot)
{
Stopwatch watch = new Stopwatch();
watch.Start();
byte[] output = new byte[FrameSize];
int offset = slot*FrameSize;
gpu.Lock();
byte[] rawdata = gpu.Cast<byte>(grawdata, FrameSize); // What is the size supposed to be? Documentation lacking
gpu.CopyToDevice<byte>(data, 0, rawdata, offset, FrameSize * frameCount);
byte[] goutput = gpu.Allocate<byte>(output);
gpu.Launch(height, width).filter(rawdata, FrameSize, goutput);
runTime = watch.Elapsed.ToString();
gpu.CopyFromDevice(goutput, output);
gpu.Free(goutput);
gpu.Synchronize();
gpu.Unlock();
watch.Stop();
totalRunTime = watch.Elapsed.ToString();
return output;
}
You should consider using the GPGPU Async functionality that's built in for a really efficient way to move data from/to host/device and use the
gpuKern.LaunchAsync(...)
Check out http://www.codeproject.com/Articles/276993/Base-Encoding-on-a-GPU for an efficient way to use this. Another great example can be found in CudafyExamples project, look for PinnedAsyncIO.cs. Everything you need to do what you're describing.
This is in
CudaGPU.cs
in Cudafy.Host project, which matches the method you're looking for (only it's async):