Intel classic compiler reports non-unit strided load in simple assignment

63 Views Asked by At

Consider the following loop, where I initialize an (aligned) array of complex numbers and would like to default-initialize them. I want to make use of SIMD for the sake of speedup:

constexpr auto alignment = 16u;
struct alignas(alignment) Complex { double re; double im; };

// ...

constexpr auto size = 32u;
auto* cv1 = static_cast<Complex*>(aligned_alloc(alignment, size));

#pragma omp simd
#pragma vector aligned
for (auto i = 0u; i < size; ++i) {
    cv1[i] = Complex{0.0, 0.0}; // THIS IS THE PROBLEMATIC LINE
}

I am using #pragma omp simd to generate SIMD instruction and also Intel's #pragma vector aligned to indicate that my memory is aligned. If I enable vectorization reports, the compiler displays the following message (see here on godbolt):

remark #15328: vectorization support: non-unit strided load was emulated for the variable <U5_V>, stride is 16
...
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 108 
remark #15477: vector cost: 119.500 
remark #15478: estimated potential speedup: 0.900 
remark #15485: serialized function calls: 1
remark #15488: --- end vector cost summary ---
...
remark #15489: --- begin vector function matching report ---
remark #15490: Function call: ?1memset with simdlen=4, actual parameter types: (vector,uniform,uniform)   [ <source>(26,9) ]
remark #26037: Library function call   [ <source>(26,9) ]
remark #15493: --- end vector function matching report ---

Apparently the non-unit strided load hampers proper vectorization and the estimated speedup is less than 1. Now let's write the loop like this:

constexpr auto zero = Complex{0.0, 0.0};
#pragma omp simd
#pragma vector aligned
for (auto i = 0u; i < size; ++i) {
    cv1[i] = zero;
}

Instead of assigning Complex{...} within the loop, I create a constant first and then assign it within the loop (see it on godbolt). Now the compiler reports:

remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 6 
remark #15477: vector cost: 1.500 
remark #15478: estimated potential speedup: 3.910 
remark #15488: --- end vector cost summary ---

which is what I would expect for such a simple loop.

Can anyone explain why this happens? Shouldn't the results be identical for both cases?

What I understood so far is that the compiler tries to be smart and sees that cv1 could be replaced by a call so memset, which seems to impair optimization (quick verification: replace 0.0 by some other number). Is there a way to disable this "optimization"?

0

There are 0 best solutions below