I am taking a course on parallel programming, and one of the topics is about reusing data inside the processor registers to perform as many computations as possible before needing to load more data from memory (this also expands to cache levels). The course's programming language is C++.
I am trying to apply this concept to the following problem: given a matrix M of vectors x_i, calculate the Pearson's Correlation Coefficient between each of the vectors in the matrix, and store the result in the matrix R. Since the coefficient is symmetrical, only the completion of the upper triangle of the result matrix is needed.
Initial implementation of the naive solution
The approach from which I started was to just traverse the rows of the matrix in a double loop. The matrix goes through a pre-processing phase, where I make use of vector types to perform multiple multiplications and sums at the same time.
To compute the coefficient, I need to compute the dot product between each vector, and the sum of each vector's elements.
#include <cmath>
#include <vector>
using namespace std;
#define PARTITION_SIZE 8
#define INIT_PARTITION {.0,.0,.0,.0,.0,.0,.0,.0}
typedef double double4_t __attribute__ ((vector_size (PARTITION_SIZE * sizeof(double))));
void correlate(int ny, int nx, const float *data, float *result) {
int partitions_per_row = nx % PARTITION_SIZE != 0 ? nx/PARTITION_SIZE + 1 : nx/PARTITION_SIZE;
vector<double4_t> processed_data((partitions_per_row) * ny);
/* Load data into vectorized format for easier processing */
// Traverse rows
for(int i = 0; i < ny; i++){
// Traverse partitions in row
for(int j = 0; j < partitions_per_row; j++){
// Traverse elements inside a partition
for (int k = 0; k < PARTITION_SIZE; ++k) {
int old_x_index = j * PARTITION_SIZE + k;
// If the element is inside the matrix, load it, otherwise load 0 for padding
// This way we can have an integer number of partitions per row
processed_data[i * partitions_per_row + j][k] = old_x_index < nx ? (data[i * nx + old_x_index] ): 0.0;
}
}
}
/* Calculate Pearson's coefficient between each vector of the matrix */
for(int i = 0; i < ny; i++){
double4_t x_partition;
double4_t y_partition;
for(int j = i; j < ny; j++){
double4_t sum_of_x_partition = INIT_PARTITION;
double4_t sum_of_y_partition = INIT_PARTITION;
double4_t sum_of_x_squared_partition = INIT_PARTITION;
double4_t sum_of_y_squared_partition = INIT_PARTITION;
double4_t sum_of_x_times_y_partition = INIT_PARTITION;
int x_row_index = i * partitions_per_row;
int y_row_index = j * partitions_per_row;
// Traverse partitions in row
for(int k = 0; k < partitions_per_row; k++){
x_partition = processed_data[x_row_index + k];
y_partition = processed_data[y_row_index + k];
// Calculate the sums of each partition
sum_of_x_partition += x_partition;
sum_of_y_partition += y_partition;
sum_of_x_squared_partition += x_partition * x_partition;
sum_of_y_squared_partition += y_partition * y_partition;
sum_of_x_times_y_partition += x_partition * y_partition;
}
// Agregate the different sums
for (int k = 1; k < PARTITION_SIZE; ++k) {
sum_of_x_partition[0] += sum_of_x_partition[k];
sum_of_y_partition[0] += sum_of_y_partition[k];
sum_of_x_squared_partition[0] += sum_of_x_squared_partition[k];
sum_of_y_squared_partition[0] += sum_of_y_squared_partition[k];
sum_of_x_times_y_partition[0] += sum_of_x_times_y_partition[k];
}
// Calculate the correlation
double numerator = nx * sum_of_x_times_y_partition[0] - sum_of_x_partition[0] * sum_of_y_partition[0];
double denominator = sqrt((nx * sum_of_x_squared_partition[0] - sum_of_x_partition[0] * sum_of_x_partition[0]) *
(nx * sum_of_y_squared_partition[0] - sum_of_y_partition[0] * sum_of_y_partition[0]));
double correlation = numerator / denominator;
result[i * ny + j] = correlation;
}
}
}
First implementation reusing registers
I couldn't see how to reuse data when traversing the matrix in the normal way, so I computed the transpose of the matrix. Now each row element contains an element from each vector that must be multiplied to the others and added to compute the dot product, and added to the sum of the vector's elements. After this, create another matrix with the data in vector format to make use of processor vector operations.
The idea is to save multiplications by, instead of multiplying elements one by one, multiply whole vectors and permuting their elements to obtain all the element combinations, instead of just v1[0]*v2[0], v1[1]*v1[1], etc. Not only that, but now we don't need to traverse the matrix's rows multiple times like in the initial version, but instead we traverse the same row multiple times. As far as I am aware, this is a much more efficient to read and write data.
To permute the vectors I use the built in functions _mm256_permute4x64_pdand _mm256_permute_pd.
v1 and v2 products (variable mult_vector) are then traversed and each component is added to the corresponding cell in temporal_mults. This is a bit of a tricky process because the multiplications end up disorganized.
After figuring out all of this, the program works, but it is way less efficient than the first solution. Evidently I have made a mistake (or many) somewhere along the way, but I don't know if the cause is a poor implementation, or that I didn't understood the technique in the first place.
Here is the code. The dot product of the vectors is stored in temporal_mults, and the vector sums in sums_vector.:
#include <cmath>
#include <vector>
#include <x86intrin.h>
using namespace std;
// The registers are 256 bits long, so we can fit 4 doubles in one register.
#define PARTITION_SIZE 4
#define INIT_PARTITION {.0,.0,.0,.0}
typedef double double4_t __attribute__ ((vector_size (PARTITION_SIZE * sizeof(double))));
static inline double4_t swap2(double4_t x) { return _mm256_permute4x64_pd(x, 0b01001110); }
static inline double4_t swap1(double4_t x) { return _mm256_permute_pd(x, 0b0101); }
/* Load data into vectorized format for easier processing.
It is a bit hard to wrap your head around the indeces,
but basically we are creating the transpose of the matrix,
vectorizing the data and adding padding at the same time.
*/
vector<double4_t> optimizeMatrix(int nx, int ny, int partitions_per_column, const float *data){
vector<double4_t> processed_data(nx*partitions_per_column);
// Traverse columns
for(int i = 0; i < nx; i++){
// Traverse rows
for(int j = 0; j < partitions_per_column; j++){
int processed_data_column = i * partitions_per_column;
int processed_vector_index = processed_data_column + j;
for (int k = 0; k < PARTITION_SIZE; ++k) {
int old_column = j * PARTITION_SIZE + k;
processed_data[processed_vector_index][k] =
old_column < ny ? data[old_column * nx + i] : .0;
}
}
}
return processed_data;
}
void computeRowMultiplications(
int ny,
int processed_data_row,
int partitions_per_column,
vector<double4_t> &processed_data,
vector<double4_t> &sums_vector,
vector<double> &temporal_mults){
for(int partition_a = 0; partition_a < partitions_per_column; partition_a++){
double4_t a000 = processed_data[processed_data_row + partition_a];
double4_t a001 = swap1(a000);
sums_vector[partition_a] += a000;
int starting_ele_index = partition_a * PARTITION_SIZE; // Translate partition number to vector index
int result_row_1 = starting_ele_index * ny; // Calculate result row for first element from vector index
for(int partition_b = partition_a; partition_b < partitions_per_column; partition_b++){
double4_t b000 = processed_data[processed_data_row + partition_b];
double4_t b001 = swap1(b000);
double4_t b010 = swap2(b000);
double4_t mult_vector[PARTITION_SIZE];
mult_vector[0] = a000 * b000;
mult_vector[1] = a000 * b001;
mult_vector[2] = a000 * b010;
mult_vector[3] = swap1(a001 * b010);
short contador_2 = -1, contador_4 = -2;
for(int perm = 0, perm_opposite = PARTITION_SIZE - 1; perm < PARTITION_SIZE; perm++, perm_opposite--){
contador_2 *= -1;
if (perm % 2 == 0) {
contador_4 *= -1;
}
temporal_mults[result_row_1 + partition_b * PARTITION_SIZE + perm] += mult_vector[perm][0];
temporal_mults[result_row_1 + ny + partition_b * PARTITION_SIZE + perm] += mult_vector[perm + contador_2][1];
temporal_mults[result_row_1 + ny + ny + partition_b * PARTITION_SIZE + perm] += mult_vector[perm + contador_4][2];
temporal_mults[result_row_1 + ny + ny + ny + partition_b * PARTITION_SIZE + perm] += mult_vector[perm_opposite][3];
}
}
}
}
void correlate(int ny, int nx, const float *data, float *result) {
int partitions_per_column = ny % PARTITION_SIZE != 0 ? ny/PARTITION_SIZE + 1 : ny/PARTITION_SIZE;
vector<double4_t> processed_data;
vector<double> temporal_mults( pow(partitions_per_column * PARTITION_SIZE, 2), 0.0);
vector<double4_t> sums_vector(partitions_per_column);
for (auto& vec : sums_vector) {
vec[0] = 0.0;
vec[1] = 0.0;
vec[2] = 0.0;
vec[3] = 0.0;
}
processed_data = optimizeMatrix(nx, ny, partitions_per_column, data);
/* Calculate the multiplication between vectors and store them in temporal_muls,
and the element sum of each vector and store them in sums_vectors */
for(int i = 0; i < nx; i++){
int current_row = i*partitions_per_column;
computeRowMultiplications(
ny,
current_row,
partitions_per_column,
processed_data,
sums_vector,
temporal_mults);
}
/* Use the values computed above to calculate the correlations between vectors,
and store them in results. */
for(int i = 0; i < ny; i++){
int current_row = i * ny;
double sum_i = sums_vector[i/PARTITION_SIZE][i % PARTITION_SIZE];
double squared_sum_i = temporal_mults[current_row + i];
for(int j = ny-1; j >= i; j--){
double sum_j = sums_vector[j/PARTITION_SIZE][j % PARTITION_SIZE];
double squared_sum_j = temporal_mults[j*ny + j];
int current_cell = current_row + j;
double sum_of_i_times_j = temporal_mults[current_cell];
double numerator = nx * sum_of_i_times_j - sum_i * sum_j;
double denominator = sqrt((nx * squared_sum_i - sum_i * sum_i) *
(nx * squared_sum_j - sum_j * sum_j));
result[current_cell] = numerator / denominator;
}
}
}
This is my first question in stack overflow. If I am making any mistake on how to present the problem or the question or anything like that, please let me know.
Any help is appreciated. This is also my first approach to the topic of parallel and efficient programming and I am sure that I have made many mistakes that I am not even aware of.