Possibilities to improve performance using vectorization for the following function in C#?

236 Views Asked by At

I have a function that estimates correlation between two input arrays.

The input is feeded by a dataDict which is of type Dictionary<string, double[]> which has 153 keys with values as double array of size 1500.

For each individual key, I need to estimate its correlation with all other keys and store the result to a double[,] that has a size of double[dataDict.Count(), dataDict.Count()]

The following function prepares two double[] arrays whose correlation needs to be estimated.

public double[,] CalculateCorrelation(Dictionary<string, double?[]> dataDict, string corrMethod = "kendall")
        {
            CorrelationLogicModule correlationLogicModule = new CorrelationLogicModule();
            double[,] correlationMatrix = new double[dataDict.Count(), dataDict.Count()];
            for (int i = 0; i < dataDict.Count; i++)
            {                
                for (int j = 0; j < dataDict.Count; j++)
                {
                    var arrayA = dataDict[dataDict.ElementAt(i).Key].Cast<double>().ToArray();
                    var arrayB = dataDict[dataDict.ElementAt(j).Key].Cast<double>().ToArray();
                    correlationMatrix[i, j] = correlationLogicModule.Kendall_Formula(arrayA, arrayB);
                }
            }
            return correlationMatrix;
        }

The following function (I found it on internet from here) finds correlation between two input arrays using 'Kendall's' method.

public double Kendall_Formula(double[] Ticker1, double[] Ticker2)
        {
            double NbrConcord, NbrDiscord, S;
            NbrConcord = 0;
            NbrDiscord = 0;
            S = 0;

            for (int i = 0; i < Ticker1.Length - 1; i++)
            {
                for (int j = i + 1; j < Ticker1.Length; j++)
                {
                    //Compute the number of concordant pairs
                    if (((Ticker1[i] < Ticker1[j]) & (Ticker2[i] < Ticker2[j])) | ((Ticker1[i] > Ticker1[j]) & (Ticker2[i] > Ticker2[j])))
                    {
                        NbrConcord++;
                    }
                    //Compute the number of discordant pairs
                    else if (((Ticker1[i] > Ticker1[j]) & (Ticker2[i] < Ticker2[j])) | ((Ticker1[i] < Ticker1[j]) & (Ticker2[i] > Ticker2[j])))
                    {
                        NbrDiscord++;
                    }
                }
            }
            S = NbrConcord - NbrDiscord;
            //Proportion with the total pairs
            return 2 * S / (Ticker1.Length * (Ticker1.Length - 1));
        }

Moving this way forward, takes a very long time to calculate the correlations for all the keys. is there a possible way to optimize the performance?.

I am new to C# but I have been using Python for a long time and in Python using 'Numpys' and 'Pandas' I am sure the above operation would take seconds to compute. For e.g. lets say I had the above data in form of a pandas dataframe, then data[[list of columns]].corr('method') would lead the result in seconds. This is because pandas uses numpy under the hood which takes benefit from vectorization. I would like to learn how can I take benefit from vectorization to improve the performance of the above code in C# and if there are other factors I need to consider. Thank you!

1

There are 1 best solutions below

4
Matthew Watson On

You are using dataDict[dataDict.ElementAt(i).Key] to access the dictionary values in an undefined order. I don't know if that's what you intended, but the following code should give the the same results.

If you call dataDict.Values.ToArray(); you will get the dictionary values in the same order as you would when using foreach to iterate over it. That means that it will be the same as the order when using dataDict[dataDict.ElementAt(i).Key].

Therefore this code should be equivalent, and it should be faster:

public double[,] CalculateCorrelation(Dictionary<string, double?[]> dataDict, string corrMethod = "kendall")
{
    CorrelationLogicModule correlationLogicModule = new CorrelationLogicModule();

    var values = dataDict.Values.Select(array => array.Cast<double>().ToArray()).ToArray();

    double[,] correlationMatrix = new double[dataDict.Count, dataDict.Count];

    for (int i = 0; i < dataDict.Count; i++)
    {
        for (int j = 0; j < dataDict.Count; j++)
        {
            var arrayA = values[i];
            var arrayB = values[j];
            correlationMatrix[i, j] = correlationLogicModule.Kendall_Formula(arrayA, arrayB);
        }
    }
    return correlationMatrix;
}

Note that the .ElementAt() call in your original code is a Linq extension, not a member of Dictionary<TKey,TValue>. It iterates from the start of the dictionary EVERY TIME you call it - and it also returns items in an unspecified order. From the documentation: For purposes of enumeration, each item in the dictionary is treated as a KeyValuePair<TKey,TValue> structure representing a value and its key. The order in which the items are returned is undefined.


Also:

You should change the bitwise & to logical && in your conditions. The use of & will prevent the compiler applying a boolean short-circuit optimisation, meaning that all the < / > comparisons will be performed, even if the first condition is false.