Partial sorting algorithm, why is there such a difference in performance?

910 Views Asked by At

I have a list of randomly generated numbers, which contains 1900 numbers, and I want to obtain a sorted list of the top 190 numbers. I've written two versions of the partial sorting algorithm, 1st is a CPU version and 2nd is written so it can run on Cudafy.net. But there is a large difference in execution time between them, when run on the CPU, I was wondering if someone could shed some light on why, + is it possible to speed the 2nd version up further?

Note: the 2nd algorithm is going to be run on a GPU so I can't use linq or anything which wouldn't run on C as I will be using cudafy.net to run the code. Unfortunately cudafy.net also doesn't support jagged arrays.

Version 1:

/// <summary>
/// Sequentially runs through all the values in the array and identifies if 
/// the current number is less than the highest number in the sorted list.
/// </summary>
/// <param name="numbers"> Unsorted array of numbers.</param>
/// <param name="sortedNumbers"> Array used to hold the partial list of sorted numbers.</param>
public static void NewSorter(int[] numbers, int[] sortedNumbers)
{
    for (int i = 0; i < numbers.Length; i++)
    {
        if (sortedNumbers[sortedNumbers.Length - 1] > numbers[i])
        {
            //Update numbers
            IdentifyPosition(sortedNumbers, numbers[i]);
        }
    }
}

/// <summary>
/// Identifies the position the number should be placed in the partial list of sorted numbers.
/// </summary>
/// <param name="sortedNumbers"> Array used to hold the partial list of sorted numbers.</param>
/// <param name="NewNumber"> Number to be inserted.</param>
static void IdentifyPosition(int[] sortedNumbers, int NewNumber)
{
    for (int i = 0; i < sortedNumbers.Length; i++)
    {
        if (NewNumber < sortedNumbers[i])
        {
            //Offset and add.
            ArrayShifter(sortedNumbers, i, NewNumber);
            break;
        }
    }
}

/// <summary>
/// Moves all the elements to the right of a point up one and 
/// then places the new number in the specified point.
/// </summary>
/// <param name="SortedNumbers"> Array used to hold the partial list of sorted numbers.</param>
/// <param name="position"> Position in the array where the new number should be place.</param>
/// <param name="NewNumber"> Number to include in the array.</param>
static void ArrayShifter(int[] SortedNumbers, int position, int NewNumber)
{
    for (int i = SortedNumbers.Length - 1; i > position; i--)
    {
        SortedNumbers[i] = SortedNumbers[i - 1];
    }

    SortedNumbers[position] = NewNumber;
}

The above version executed in ~ 0.65 milliseconds.

Version 2:

/// <summary>
/// Sequentially runs through all the values in the array and identifies if 
/// the current number is less than the highest number in the sorted list.
/// </summary>
/// <param name="unsortedNumbers"> Unsorted numbers.</param>
/// <param name="lookbackCount"> Length of the array.</param>
/// <param name="sortedNumbers"> Array which will contain the partial list of sorted numbers.</param>
[Cudafy]
public static void CudaSorter(GThread thread, long[,] unsortedNumbers, int[] lookbackCount, long[,] sortedNumbers)
{
    int threadIndex = thread.threadIdx.x;
    int blockIndex = thread.blockIdx.x;
    int threadsPerBlock = thread.blockDim.x;
    int gpuThread = (threadIndex + (blockIndex * threadsPerBlock));

    if (gpuThread < 32)
    {
        int maxIndex = (lookbackCount[gpuThread] * 10) / 100;
        int maxLookback = lookbackCount[gpuThread];

        for (int i = 0; i < maxLookback; i++)
        {
            if (sortedNumbers[gpuThread, maxIndex] > unsortedNumbers[gpuThread, i])
            {
                //Update numbers
                IdentifyPosition2(sortedNumbers, unsortedNumbers[gpuThread, i], maxIndex, gpuThread);
            }
        }
    }
}


/// <summary>
/// Identifies the position in the sortedNumbers array where the new number should be placed.
/// </summary>
/// <param name="sortedNumbers"> Sorted numbers.</param>
/// <param name="newNumber"> Number to be included in the sorted array.</param>
/// <param name="maxIndex"> length of sortedNumbers array. </param>
/// <param name="gpuThread"> GPU thread index.</param>
[Cudafy(eCudafyType.Device)]
public static void CudaIdentifyPosition(long[,] sortedNumbers, long newNumber, int maxIndex, int gpuThread)
{
    for (int i = 0; i < maxIndex; i++)
    {
        if (newNumber < sortedNumbers[gpuThread, i])
        {
            //Offset and add.
            ArrayShifter2(sortedNumbers, i, newNumber, maxIndex, gpuThread);
            break;
        }
    }
}


/// <summary>
/// Shifts all the elements to the right of the specified position, 1 position
/// to the right, and insert the new number in the specified position.
/// </summary>
/// <param name="sortedNumbers"> Sorted Numbers.</param>
/// <param name="position"> Where the new number needs to be inserted.</param>
/// <param name="newNumber"> New number to insert.</param>
/// <param name="maxIndex"> Length of sortedNumbers array.</param>
/// <param name="gpuThread"> GPU thread index.</param>
[Cudafy(eCudafyType.Device)]
public static void CudaArrayShifter(long[,] sortedNumbers, int position, long newNumber, int maxIndex, int gpuThread)
{
    for (int i = maxIndex - 1; i > position; i--)
    {
        sortedNumbers[gpuThread, i] = sortedNumbers[gpuThread, i - 1];
    }

    sortedNumbers[gpuThread, position] = newNumber;
}

The above executes in 2.8 milliseconds i.e. ~ 4x slower.

I've already tried the following:

  1. Declared local variable for maxLookBack count and used that in the for loop => no improvement.
  2. Changed data types from long[,] to int[,] => 2.6 milliseconds (This isn't feasible as I need to use long.)
  3. Changed int[,] to int[] => 1.3 milliseconds (This isn't feasible either as I need to pass multiple arrays to the GPU to keep it occupied.) I was surprised how much this affected the time.

EDIT: I modified the code due to Henk's comments. I now ran the GPU version on the GPU with unsortedNumbers[32,1900] vs a single thread on the CPU sorting 1 array. Even when I multiply the CPU time by 32, it's still considerably lower than the GPU's time.

1

There are 1 best solutions below

0
On

After being disgraced here, I decided to read a little about the task in order to understand what it was about.

So, you need to select a subarray of minimum numbers from a large array, which then needs to be sorted. It’s probably better not to come up with an option for the CPU: run through the array, select the lows and immediately insert them into the final array by shifting the elements. Clearly, sorting will occur during the selection.

However, I can’t imagine how you can sample in parallel! In addition, you need to use well-parallelizing sorting algorithms. Otherwise, if the task is solved sequentially, the graphics core will certainly lose in speed to the CPU core, whose frequency is higher and data access is faster!

I think that merge sorting can help you with this. Only instead of fetching the lows and then sorting, try sorting everything right away! And then select N first or last elements.

Unfortunately, I’m not ready to edit your code right now. But I hope that was at least a little useful.