I'm trying to use Alea to speed up a program I'm working on but I need some help.
What I need to do is a lot of bitcount and bitwise operations with values stored in two arrays.
For each element of my first array I have to do a bitwise & operation with each element of my second array, then count the bits set to 1 of the & result.
If the result is greater than/equal to a certain value I need to exit the inner for and go to the next element of my first array.
The first array is usually a big one, with millions of elements, the second one is usually less than 200.000 elements.
Trying to do all these operations in parallel, here is my code:
[GpuManaged]
private long[] Check(long[] arr1, long[] arr2, int limit)
{
Gpu.FreeAllImplicitMemory(true);
var gpu = Gpu.Default;
long[] result = new long[arr1.Length];
gpu.For(0, arr1.Length, i =>
{
bool found = false;
long b = arr1[i];
for (int i2 = 0; i2 < arr2.Length; i2++)
{
if (LibDevice.__nv_popcll(b & arr2[i2]) >= limit)
{
found = true;
break;
}
}
if (!found)
{
result[i] = b;
}
});
return result;
}
This works as expected but is just a little faster than my version running in parallel on a quad core CPU.
I'm certainly missing something here, it's my very first attempt to write GPU code.
By the way, my NVIDIA is a GeForce GT 740M.
EDIT
The following code is 2x faster than the previous one, at least on my PC. Many thanks to Michael Randall for pointing me in the right direction.
private static int[] CheckWithKernel(Gpu gpu, int[] arr1, int[] arr2, int limit)
{
var lp = new LaunchParam(16, 256);
var result = new int[arr1.Length];
try
{
using (var dArr1 = gpu.AllocateDevice(arr1))
using (var dArr2 = gpu.AllocateDevice(arr2))
using (var dResult = gpu.AllocateDevice<int>(arr1.Length))
{
gpu.Launch(Kernel, lp, arr1.Length, arr2.Length, dArr1.Ptr, dArr2.Ptr, dResult.Ptr, limit);
Gpu.Copy(dResult, result);
return result;
}
}
finally
{
Gpu.Free(arr1);
Gpu.Free(arr2);
Gpu.Free(result);
}
}
private static void Kernel(int a1, int a2, deviceptr<int> arr1, deviceptr<int> arr2, deviceptr<int> arr3, int limit)
{
var iinit = blockIdx.x * blockDim.x + threadIdx.x;
var istep = gridDim.x * blockDim.x;
for (var i = iinit; i < a1; i += istep)
{
bool found = false;
int b = arr1[i];
for (var j = 0; j < a2; j++)
{
if (LibDevice.__nv_popcll(b & arr2[j]) >= limit)
{
found = true;
break;
}
}
if (!found)
{
arr3[i] = b;
}
}
}
Update
It seems pinning wont work with GCHandle.Alloc()
However the point of this answer is you will get a much greater performance gain out of direct memory access.
http://www.aleagpu.com/release/3_0_3/doc/advanced_features_csharp.html
Directly Working with Device Memory
Original Answer
Disregarding the logic going on, or how relevant this is to GPU code. However you could compliment your Parallel routine and possibly speed things up by by Pinning your Arrays in memory with
GCHandle.Alloc()and theGCHandleType.Pinnedflag and using Direct Pointer access (if you can rununsafecode)Example
Additional Resources
GCHandle.Alloc Method (Object)
GCHandleType Enumeration
Unsafe Code and Pointers (C# Programming Guide)
is now this:
some of the samples and info have changed or moved in release 3.0.4.