How to speed up the computation of absolute loss matrix in C?

111 Views Asked by At

I am running a Monte-Carlo experiment and evaluate the absolute loss function described below. Since it is very computationally intensive I would like to optimise my code and improve the speed further. My main code is in MATLAB, but I am evaluating the function in C using the MEX functionality of MATLAB.

The mathematical problem is as follows: I have a matrix D with dimensions (M times N). Usually M is around 20,000 and N takes values around {10, 30, 144}.

Definition of matrix D

Effectively, I need to obtain L column vector with dimensions (M times 1) defined as

Definition of matrix L

My C function looks like this:

void absolute_loss(double *D, double *L, mwSize cols, mwSize rows)
{

  double aux;
  int i;
  int j;
  int k;
  for (i = 0; i < rows; i++) {
    for (j = 0; j < rows; j++){
      aux = 0;
      for  (k = 0; k < cols; k++) {
        aux = aux + fabs(D[j + rows * k] - D[i + rows * k]);
      }
      L[i] = L[i] + aux;
    }
  }

  for (i = 0; i < rows; i++) {
    L[i] /= rows;
  }
}
2

There are 2 best solutions below

5
chux - Reinstate Monica On

How to speed up the computation of absolute loss matrix

  • Enable compiler optimizations @Jesper Juhl.

  • If able, use float types and float functions. Sometimes up to 4x faster. For me, 8% faster.

  • Use restrict to let compiler know referenced data does not overlap. Otherwise compiler must assume L[i] = ...; may change D[] and that prevents some optimizations.

  • For referenced data, use const where able.

  • Use consistent indexing types.

  • Change index increment. @DevSolar

  • Index type: For me size_t and unsigned about the same. unsigned short was 5% faster.


void absolute_loss(const float * restrict D, float * restrict L,
    mwSize cols, mwSize rows) {
  mwSize rows_cols = rows*cols;
  for (mwSize i = 0; i < rows; i++) {
    for (mwSize j = 0; j < rows; j++){
      float aux = 0.0;
      for (mwSize k = 0; k < rows_cols; k += rows) {
        aux = aux + fabsf(D[j + k] - D[i + k]); // Note: fabsf
      }
      L[i] = L[i] + aux;
    }
  }
  for (mwSize i = 0; i < rows; i++) {
    L[i] /= rows;
  }
}

Notes:

I'd expect the following, or the like, at the start of the function.

for (mwSize i = 0; i < rows; i++) {
  L[i] = 0.0;
}

Tip, rather than rows, cols, i, j, use M, N, m, n to match the formula. I am not sure you have it right.


Candidate re-write that takes advantage of variable length arrays and sample usage:

#include <math.h>

typedef unsigned short mwSize;

// Note re-ordered parameters.
void absolute_loss(mwSize m_rows, mwSize n_cols, //
    float D[restrict m_rows][n_cols], float L[restrict m_rows]) {

  for (mwSize ell = 0; ell < m_rows; ell++) {
    float ell_sum = 0.0;
    for (mwSize n = 0; n < n_cols; n++) {
      float d_ell_n = D[ell][n];
      for (mwSize m = 0; m < m_rows; m++) {
        ell_sum += fabsf(D[m][n] - d_ell_n);
      }
    }
    L[ell] = ell_sum / (float) m_rows;
  }
}

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(void) {
  // Usually M is around 20,000 and N takes values around {10, 30, 144}.
  mwSize m_rows = (mwSize) (rand() % 1000 + (20000 - 1000));
  mwSize n_cols = (mwSize[3]) {10, 30, 144}[rand() % 3];

  float (*D)[m_rows][n_cols] = malloc(sizeof *D);
  assert(D);
  float (*L)[m_rows] = malloc(sizeof *L);
  assert(L);
  for (mwSize m = 0; m < m_rows; m++) {
    for (mwSize n = 0; n < n_cols; n++) {
      (*D)[m][n] = (float) (rand() % 1000 + 1);
    }
  }

  clock_t t0 = clock();
  absolute_loss(m_rows, n_cols, *D, *L);
  clock_t t1 = clock();
  // Print some of L
  for (mwSize ell = 0; ell < m_rows; ell++) {
    printf(" %-7g", (*L)[ell]);
    if (ell > 10) {
      printf("\n");
      break;
    }
  }
  printf("\n%g seconds.\n", (double) (t1 - t0) / CLOCKS_PER_SEC);
  free(L);
  free(D);
}

My time: 4.906 seconds.

2
Simon Goater On

You appear to have an unusual arrangement of rows and column with your matrix D. Your function accesses the data using indexes that jump around a lot which won't make best use of your memory cache. Rearranging the loops can make it process mostly contiguous elements improving performance significantly. On my computer, this runs almost 3 times faster than the function you posted for M=10000, N=30.

void absolute_loss2(const double * D, double * L, , mwSize cols, mwSize rows) {

  double Dtemp;
  int i, j, k, rowstimesk;
  for (i = 0; i < rows; i++) {
    L[i] = 0.0;
  }
  for (i = 0; i < rows; i++) {
    for  (k = 0; k < cols; k++) {
      rowstimesk = rows * k;
      Dtemp = D[i + rowstimesk];
      for (j = 0; j < rows; j++){
        L[j] += fabs(D[j + rowstimesk] - Dtemp);
      }
    }
  }

  for (i = 0; i < rows; i++) {
    L[i] /= rows;
  }
}

You could probably build an even faster function with SIMD at the cost of simplicity, especially if you want to maintain portability. This wouldn't be a trivial re-factoring though.