Low performance for an Embarrassingly parallel code

317 Views Asked by Tracy Maxen At 17 March 2015 at 13:57

I have this very simple parallel code that I am using to learn openmp which is embarrassingly parallel. However, I don't get the superlinear or at least linear performance increase expected.

#pragma omp parallel num_threads(cores) 
{
   int id = omp_get_thread_num(); 
   cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, row, column, column, 1.0, MatrixA1[id], column, MatrixB[id], column, 0.0, Matrixmultiply[id], column); 
}

On Visual studio using intel c++ compiler xe 15.0 and computing sgemm (matrix multiplication) for 288 by 288 matrices, i get 350microsecs for cores=1 and 1177microsecs for cores=4, which just seems like a sequential code. I set the Intel MKL property to Parallel (also tested with sequential) and Language settings to Generate Parallel Code (/Qopenmp). Anyway to improve this? I am running in a quad core haswell processor

Original Q&A

There are 1 best solutions below

a3mlord On 17 March 2015 at 15:19

If your input size takes only some microseconds to be computed, as you say, there is no way 4 threads take less than that. Essentially, your input data is too small for parallelization, because there is overhead in creating threads.

Try to increase the input data so it takes some good seconds and repeat the experiment.

You might also then have false sharing for example, but at this point that is nothing to be considered.

What you can do to improve performance that is to vectorize the code (but in this case you can't because you are using a library call, i.e. you'd have to write the function by yourself).

Low performance for an Embarrassingly parallel code

There are 1 best solutions below

Related Questions in C++

Related Questions in PARALLEL-PROCESSING

Related Questions in OPENMP

Related Questions in INTEL-MKL

Related Questions in EMBARRASSINGLY-PARALLEL

Trending Questions

Popular # Hahtags

Popular Questions