Why is this multithreading code running slower than single thread?

178 Views Asked by At

enter image description here

Hi,

I'm doing my first attempts at multithreading in c# within grasshopper. The function is a simple distance between two points function, with two inputs of 10,000 points. The code is working but its taking a much much longer time than a single threaded component. I'm curious if anyone could explain why this is.

using System;
using System.Collections;
using System.Collections.Generic;

using Rhino;
using Rhino.Geometry;

using Grasshopper;
using Grasshopper.Kernel;
using Grasshopper.Kernel.Data;
using Grasshopper.Kernel.Types;

using System.Threading;
using System.Threading.Tasks;
using System.Collections.Concurrent;


  private void RunScript(List<Point3d> x, List<Point3d> y, ref object A)
  {

    var results = new ConcurrentDictionary<Point3d, double>(Environment.ProcessorCount, x.Count);

    List<double> output = new List<double>();
    foreach(Point3d pt in x) results[pt] = 0;


    Parallel.ForEach(x, pt =>
      {
      results[pt] = pt.DistanceTo(y[x.IndexOf(pt)]);
      });

    foreach(Point3d pt in x)
    {
      output.Add(results[pt]);
    }

    A = output;

I know there must be heavy overheads with using concurrentdictionary, two loops and list but I assumed the higher the population of points, the more you would see a benefit over a single thread, which isn't the case. If I'm using a really inefficient method I'd like to know, or if for such a simple task there's little chance the overheads will ever justify multi-threading?

2

There are 2 best solutions below

10
Johan Donne On

There are a few things you can do to optimize your code, two of which I tried with your code:

  • Limit the number of threads you are using (using ParallelOptions). As the workload for each iteration is very small, this will limit the overhead of the parallelization.
  • Use Parallel.For instead of Parallel.ForEach so you don't need to lookup the index for each point. This has nothing to do with parallelization but just reduces the workload in a dramatical manner.
  • Create output from result in one go instead of adding the results one by one.

Implementing these in your method could give something like:

private void RunScriptOpt(List<Point3D> x, List<Point3D> y, ref List<double> A)
{
    var results = new ConcurrentDictionary<Point3D, double>(Environment.ProcessorCount, x.Count);
    List<double> output = new List<double>();
    Parallel.For(0, x.Count, new ParallelOptions { MaxDegreeOfParallelism = 4 }, (i) =>
    {            
        results[x[i]] = x[i].DistanceTo(y[i]);
    });       
    
    output = results.Values.ToList();

    A = output;
}

Benchmarkresults for both:

Method Mean Error StdDev Gen0 Gen1 Gen2 Allocated
NaiveParallel 105,161.0 us 2,039.82 us 2,652.35 us 159600.0000 1000.0000 800.0000 1909.71 MB
OptParallel 736.1 us 4.65 us 4.12 us 99.6094 98.6328 49.8047 1.18 MB

Executing time is reduced to 0.7% of the original and memory allocation to 0.06% (reducing load on the garbage collector a lot).

Obviously this optimized version will compare much more favorably to the single-threaded code than the original method...

I should clarify I'm trying to output the distances so they associate with the order of the points going in,

Accomplishing this is even more simple. You don't need a ConcurrentDictionary for this. Since each iteration will calculate and store the result completely independent from the other iterations, no synchronization is required and you can use a simple array:

private void RunScriptOpt(List<Point3D> x, List<Point3D> y, ref double[] A)
{
    double[] results = new double[x.Count];
    Parallel.For(0, x.Count, new ParallelOptions { MaxDegreeOfParallelism = 10 }, (i) =>
    {            
        results[i] = x[i].DistanceTo(y[i]);
    });                       
    
    A = results;
}

Of course this will make your code still faster (and reduce the memory allocations even further):

Method Mean Error StdDev Gen0 Gen1 Gen2 Allocated
NaiveParallel 143,491.07 us 9,306.979 us 27,441.839 us 159600.0000 32800.0000 600.0000 1955547.13 KB
OptParallel 27.47 us 0.203 us 0.190 us 6.6528 2.2888 - 81.42 KB
2
Guru Stron On

I would argue that the biggest issue in the code is the use of IndexOf which makes your code effectively O(n^2) so I would start from testing out the single-threaded approach. If you think about your data manipulation happening here - you are matching incoming collection by index and computing the result, which is what the Zip is for, so try something like (add using System.Linq;):

private void RunScript(List<Point3d> x, List<Point3d> y, ref object A)
{
    // if needed - check that both sequences are of the same size
    A = x.Zip(y, (l, r) => l.DistanceTo(r))
        .ToArray();
}

Note that Zip will produce result with size of the shortest collection:

The method merges each element of the first sequence with an element that has the same index in the second sequence. If the sequences do not have the same number of elements, the method merges sequences until it reaches the end of one of them. For example, if one sequence has three elements and the other one has four, the result sequence will have only three elements.

If you have really that much items you can try using PLINQ before going deep into different kinds of optimizations:

private void RunScript(List<Point3d> x, List<Point3d> y, ref object A)
{
    // if needed - check that both sequences are of the same size
    A = x
        .AsParallel()
        .Zip(y, (l, r) => l.DistanceTo(r))
        .ToArray();
}

If this method is a part of critical path and you want to squeeze as much performance as possible then you can go with the 3rd option from the another answer - preallocate the resulting array of needed size and fill it (no need for concurrent collections here because there would be no contetntion/race guarantees - every read/write will for different index). Or similar (preallocate results array and use for loop) single-threaded approach (parallelization usually make sense either for big inputs sizes or relatively long computations performed).

P.S.

Use BenchmarkDotNet for benchmarking, but note that in general there are a lot of moving parts and same code can behave differently based on number of factors (.NET version, OS, CPU architecture ...)