Questionable vectorization with column-by-column addressing order (C)

60 Views Asked by At

For some reason, the code with the order of addressing by columns is vectorized. But after looking at the compiler's explanations, it is unclear what exactly is being vectorized.

Column order example

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

#define s_parameter     6
#define NMMax_Si        30000000

double* p_M[s_parameter];

void Inter(){
   long int k, s, t;
   double VR, VRR;
   double VRC[3];

   s = rand();
   t = rand();

   for (k = 0; k < 3; k++) { VRC[k] = p_M[k][s] - p_M[k][t]; }
   VRR = VRC[0] * VRC[0] + VRC[1] * VRC[1] + VRC[2] * VRC[2];
   VR = sqrt(VRR);

   printf ("%f", VR);
}

int main()
{
   int i;
   for (i = 0; i<s_parameter; i++) p_M[i] = (double*)aligned_alloc(64, NMMax_Si * sizeof(double));
   Inter();
   return 0;
}

After compilation using

gcc -g -lm -Wall -Wno-unused-but-set-variable -std=c17 -fopenmp -march=native -O3 -mavx2 -ftree-vectorize -fopt-info-vec-all main2.c

I got:

**src/main2.c:21:18: optimized: loop vectorized using 16 byte vectors**
src/main2.c:13:6: note: vectorized 1 loops in function.
src/main2.c:18:8: missed: statement clobbers memory: _1 = rand ();
src/main2.c:19:8: missed: statement clobbers memory: _2 = rand ();
src/main2.c:21:45: missed: statement clobbers memory: vect__7.13_58 = __builtin_ia32_gatherdiv2df ({ 0.0, 0.0 }, _54, vect_57, {  Nan,  Nan }, 1);
src/main2.c:21:57: missed: statement clobbers memory: vect__11.14_63 = __builtin_ia32_gatherdiv2df ({ 0.0, 0.0 }, _59, vect_57, {  Nan,  Nan }, 1);
src/main2.c:23:9: missed: statement clobbers memory: VR_34 = sqrt (VRR_25);
src/main2.c:25:4: missed: statement clobbers memory: printf ("%f", VR_33);

1. What exactly has been vectorized if column-by-column addressing order is used? Row order example below has almost the same output but without missed: statement clobbers memory at the loop 21.

Row order example

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

#define s_parameter     6
#define NMMax_Si        30000000

double* p_M[NMMax_Si];

void Inter(){
   long int k, s, t;
   double VR, VRR;
   double VRC[3];

   s = rand();
   t = rand();

   for (k = 0; k < 3; k++) { VRC[k] = p_M[s][k] - p_M[t][k]; }
   VRR = VRC[0] * VRC[0] + VRC[1] * VRC[1] + VRC[2] * VRC[2];
   VR = sqrt(VRR);

   printf ("%f", VR);
}

int main()
{
   int i;
   for (i = 0; i<NMMax_Si; i++) p_M[i] = (double*)aligned_alloc(64, s_parameter * sizeof(double));
   Inter();
   return 0;
}

with output:

src/main.c:21:18: optimized: loop vectorized using 16 byte vectors
src/main.c:13:6: note: vectorized 1 loops in function.
src/main.c:18:8: missed: statement clobbers memory: _1 = rand ();
src/main.c:19:8: missed: statement clobbers memory: _2 = rand ();
src/main.c:23:9: missed: statement clobbers memory: VR_35 = sqrt (VRR_26);
src/main.c:25:4: missed: statement clobbers memory: printf ("%f", VR_34);

2. Is row order approach has different result at vectorization?

3. Is there some way to vectorize all calculations to determine the final value of VR?

   for (k = 0; k < 3; k++) { VRC[k] = p_M[s][k] - p_M[t][k]; }
   VRR = VRC[0] * VRC[0] + VRC[1] * VRC[1] + VRC[2] * VRC[2];
   VR = sqrt(VRR);

4. Will the extra zero data (padding) help to improve the situation?

  for (k = 0; k < 4; k++) { VRC[k] = p_M[s][k] - p_M[t][k]; }
  // p_M[:][3] == 0
       VRR = VRC[0] * VRC[0] + VRC[1] * VRC[1] + VRC[2] * VRC[2] + VRC[3] * VRC[3];
       VR = sqrt(VRR);
0

There are 0 best solutions below