Copy unmanaged System.IntPtr byte vector into GPU row of 2D device byte array

1k Views Asked by At

I am using C# and CUDAfy.net (yes, this problem is easier in straight C with pointers, but I have my reasons for using this approach given the larger system).

I have a video frame grabber card that is collecting byte[1024 x 1024] image data at 30 FPS. Every 33.3 ms it fills a slot in a circular buffer and returns a System.IntPtr that points to that un-managed 1D vector of *byte; The Circular buffer has 15 slots.

On the GPU device (Tesla K40) I want to have a global 2D array that is organized as a dense 2D array. That is, I want something like the Circular Queue but on the GPU organized as a dense 2D array.

byte[15, 1024*1024] rawdata; 
// if CUDAfy.NET supported jagged arrays I could use byte[15][1024*1024 but it does not

How can I fill in a different row each 33ms? Do I use something like:

gpu.CopyToDevice<byte>(inputPtr, 0, rawdata, offset, length) // length = 1024*1024
//offset is computed by  rowID*(1024*1024) where rowID wraps to 0 via modulo 15.
// inputPrt is the System.Inptr that points to the buffer in the circular queue (un-managed)?
// rawdata is a device buffer allocated gpu.Allocate<byte>(1024*1024);

And in my kernel header is:

[Cudafy]
public static void filter(GThread thread, byte[,] rawdata, int frameSize, byte[] result)

I did try something along these lines. But there is no API pattern in CudaFy for:

GPGPU.CopyToDevice(T) Method (IntPtr, Int32, T[,], Int32, Int32, Int32)

So I used the gpu.Cast Function to change the 2D device array to 1D.

I tried the code below, but I am getting CUDA.net exception: ErrorLaunchFailed

FYI: When I try the CUDA emulator, it aborts on the CopyToDevice claiming that Data is not host allocated

public static byte[] process(System.IntPtr data, int slot)
{
    Stopwatch watch = new Stopwatch();
    watch.Start();
    byte[] output = new byte[FrameSize];
    int offset = slot*FrameSize;
    gpu.Lock();
    byte[] rawdata = gpu.Cast<byte>(grawdata, FrameSize); // What is the size supposed to be? Documentation lacking
    gpu.CopyToDevice<byte>(data, 0, rawdata, offset, FrameSize * frameCount);
    byte[] goutput = gpu.Allocate<byte>(output);
    gpu.Launch(height, width).filter(rawdata, FrameSize, goutput);
    runTime = watch.Elapsed.ToString();
    gpu.CopyFromDevice(goutput, output);
    gpu.Free(goutput);
    gpu.Synchronize();
    gpu.Unlock();
    watch.Stop();
    totalRunTime = watch.Elapsed.ToString();
    return output;
}
3

There are 3 best solutions below

0
On BEST ANSWER

You should consider using the GPGPU Async functionality that's built in for a really efficient way to move data from/to host/device and use the gpuKern.LaunchAsync(...)

Check out http://www.codeproject.com/Articles/276993/Base-Encoding-on-a-GPU for an efficient way to use this. Another great example can be found in CudafyExamples project, look for PinnedAsyncIO.cs. Everything you need to do what you're describing.

This is in CudaGPU.cs in Cudafy.Host project, which matches the method you're looking for (only it's async):

public void CopyToDeviceAsync<T>(IntPtr hostArray, int hostOffset, DevicePtrEx devArray,
                                  int devOffset, int count, int streamId = 0) where T : struct;
public void CopyToDeviceAsync<T>(IntPtr hostArray, int hostOffset, T[, ,] devArray,
                                 int devOffset, int count, int streamId = 0) where T : struct;
public void CopyToDeviceAsync<T>(IntPtr hostArray, int hostOffset, T[,] devArray,
                                  int devOffset, int count, int streamId = 0) where T : struct;
public void CopyToDeviceAsync<T>(IntPtr hostArray, int hostOffset, T[] devArray,
                                  int devOffset, int count, int streamId = 0) where T : struct;
1
On

If I understand your question properly I think you are looking to convert the
byte* you get from the cyclic buffer into a multi-dimensional byte array to be sent to
the graphics card API.

            int slots = 15;
            int rows = 1024;
            int columns = 1024;

//Try this
            for (int currentSlot = 0; currentSlot < slots; currentSlot++)
            {
                IntPtr intPtrToUnManagedMemory = CopyContextFrom(currentSlot);
                // use Marshal.Copy ?  
                byte[] byteData = CopyIntPtrToByteArray(intPtrToUnManagedMemory); 

                int offset =0;
                for (int m = 0; m < rows; m++)
                    for (int n = 0; n < columns; n++)
                    {
                        //then send this to your GPU method
                        rawForGpu[m, n] = ReadByteValue(IntPtr: intPtrToUnManagedMemory, 
                                                        offset++);
                    }
            }

//or try this
            for (int currentSlot = 0; currentSlot < slots; currentSlot++)
            {
                IntPtr intPtrToUnManagedMemory = CopyContextFrom(currentSlot);

                // use Marshal.Copy ?
                byte[] byteData = CopyIntPtrToByteArray(intPtrToUnManagedMemory); 

                byte[,] rawForGpu = ConvertTo2DArray(byteData, rows, columns);
            }
        }

        private static byte[,] ConvertTo2DArray(byte[] byteArr, int rows, int columns)
        {
            byte[,] data = new byte[rows, columns];
            int totalElements = rows * columns;
            //Convert 1D to 2D rows, colums
            return data;
        }

        private static IntPtr CopyContextFrom(int slotNumber)
        {
            //code that return byte* from circular buffer.
            return IntPtr.Zero;
        }
1
On

I propose this "solution", for now, either: 1. Run the program only in native mode (not in emulation mode). or 2. Do not handle the pinned-memory allocation yourself.

There seems to be an open issue with that now. But this happens only in emulation mode.

see: https://cudafy.codeplex.com/workitem/636