How to speed up the computation of absolute loss matrix in C?

Question

How to speed up the computation of absolute loss matrix in C?

111 Views Asked by andro1666 At 23 May 2023 at 15:19

I am running a Monte-Carlo experiment and evaluate the absolute loss function described below. Since it is very computationally intensive I would like to optimise my code and improve the speed further. My main code is in MATLAB, but I am evaluating the function in C using the MEX functionality of MATLAB.

The mathematical problem is as follows: I have a matrix D with dimensions (M times N). Usually M is around 20,000 and N takes values around {10, 30, 144}.

Definition of matrix D

Effectively, I need to obtain L column vector with dimensions (M times 1) defined as

Definition of matrix L

My C function looks like this:

void absolute_loss(double *D, double *L, mwSize cols, mwSize rows)
{

  double aux;
  int i;
  int j;
  int k;
  for (i = 0; i < rows; i++) {
    for (j = 0; j < rows; j++){
      aux = 0;
      for  (k = 0; k < cols; k++) {
        aux = aux + fabs(D[j + rows * k] - D[i + rows * k]);
      }
      L[i] = L[i] + aux;
    }
  }

  for (i = 0; i < rows; i++) {
    L[i] /= rows;
  }
}

Original Q&A

There are 2 best solutions below

**chux - Reinstate Monica** · Answer 1 · 2023-05-23T16:35:07.743000

How to speed up the computation of absolute loss matrix

Enable compiler optimizations @Jesper Juhl.
If able, use float types and float functions. Sometimes up to 4x faster. For me, 8% faster.
Use restrict to let compiler know referenced data does not overlap. Otherwise compiler must assume L[i] = ...; may change D[] and that prevents some optimizations.
For referenced data, use const where able.
Use consistent indexing types.
Change index increment. @DevSolar
Index type: For me size_t and unsigned about the same. unsigned short was 5% faster.

void absolute_loss(const float * restrict D, float * restrict L,
    mwSize cols, mwSize rows) {
  mwSize rows_cols = rows*cols;
  for (mwSize i = 0; i < rows; i++) {
    for (mwSize j = 0; j < rows; j++){
      float aux = 0.0;
      for (mwSize k = 0; k < rows_cols; k += rows) {
        aux = aux + fabsf(D[j + k] - D[i + k]); // Note: fabsf
      }
      L[i] = L[i] + aux;
    }
  }
  for (mwSize i = 0; i < rows; i++) {
    L[i] /= rows;
  }
}

Notes:

I'd expect the following, or the like, at the start of the function.

for (mwSize i = 0; i < rows; i++) {
  L[i] = 0.0;
}

Tip, rather than rows, cols, i, j, use M, N, m, n to match the formula. I am not sure you have it right.

Candidate re-write that takes advantage of variable length arrays and sample usage:

#include <math.h>

typedef unsigned short mwSize;

// Note re-ordered parameters.
void absolute_loss(mwSize m_rows, mwSize n_cols, //
    float D[restrict m_rows][n_cols], float L[restrict m_rows]) {

  for (mwSize ell = 0; ell < m_rows; ell++) {
    float ell_sum = 0.0;
    for (mwSize n = 0; n < n_cols; n++) {
      float d_ell_n = D[ell][n];
      for (mwSize m = 0; m < m_rows; m++) {
        ell_sum += fabsf(D[m][n] - d_ell_n);
      }
    }
    L[ell] = ell_sum / (float) m_rows;
  }
}

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(void) {
  // Usually M is around 20,000 and N takes values around {10, 30, 144}.
  mwSize m_rows = (mwSize) (rand() % 1000 + (20000 - 1000));
  mwSize n_cols = (mwSize[3]) {10, 30, 144}[rand() % 3];

  float (*D)[m_rows][n_cols] = malloc(sizeof *D);
  assert(D);
  float (*L)[m_rows] = malloc(sizeof *L);
  assert(L);
  for (mwSize m = 0; m < m_rows; m++) {
    for (mwSize n = 0; n < n_cols; n++) {
      (*D)[m][n] = (float) (rand() % 1000 + 1);
    }
  }

  clock_t t0 = clock();
  absolute_loss(m_rows, n_cols, *D, *L);
  clock_t t1 = clock();
  // Print some of L
  for (mwSize ell = 0; ell < m_rows; ell++) {
    printf(" %-7g", (*L)[ell]);
    if (ell > 10) {
      printf("\n");
      break;
    }
  }
  printf("\n%g seconds.\n", (double) (t1 - t0) / CLOCKS_PER_SEC);
  free(L);
  free(D);
}

My time: 4.906 seconds.

**Simon Goater** · Answer 2 · 2023-05-23T20:29:09.190000

You appear to have an unusual arrangement of rows and column with your matrix D. Your function accesses the data using indexes that jump around a lot which won't make best use of your memory cache. Rearranging the loops can make it process mostly contiguous elements improving performance significantly. On my computer, this runs almost 3 times faster than the function you posted for M=10000, N=30.

void absolute_loss2(const double * D, double * L, , mwSize cols, mwSize rows) {

  double Dtemp;
  int i, j, k, rowstimesk;
  for (i = 0; i < rows; i++) {
    L[i] = 0.0;
  }
  for (i = 0; i < rows; i++) {
    for  (k = 0; k < cols; k++) {
      rowstimesk = rows * k;
      Dtemp = D[i + rowstimesk];
      for (j = 0; j < rows; j++){
        L[j] += fabs(D[j + rowstimesk] - Dtemp);
      }
    }
  }

  for (i = 0; i < rows; i++) {
    L[i] /= rows;
  }
}

You could probably build an even faster function with SIMD at the cost of simplicity, especially if you want to maintain portability. This wouldn't be a trivial re-factoring though.

How to speed up the computation of absolute loss matrix in C?

There are 2 best solutions below

Related Questions in C

Related Questions in MATRIX

Related Questions in MEX

Trending Questions

Popular # Hahtags

Popular Questions