Is it good or bad (performance-wise) to use std::vector<Vec8d>

766 Views Asked by At

I am using Agner Fog's vectorclass library to use SIMD instructions (AVX specifically) in my application. Since it is best to use struct-of-array datastructures for easily employing SIMD, I quite often use:

std::vector<Vec8d> some_var;

or even

struct some_struct {
    std::vector<Vec8d> a;
    std::vector<Vec8d> b;
}

I wonder if this is bad (performance-wise or even just completely wrong?) considering that the std::vector internal Vec8d* array may in fact not be aligned?

3

There are 3 best solutions below

4
Peter Cordes On

I would generally use vector<double>, and standard SIMD load/store intrinsics to access the data. That avoids tying the interface and all code that touches it to that specific SIMD vector width and wrapper library. You can still pad the size to a multiple of 8 doubles so you don't have to include cleanup handling in your loops.

However, you might want to use a custom allocator for that vector<double> so you can get it to align your doubles. Unfortunately, even if that allocator's underlying memory allocation is compatible with new/delete, it will have a different C++ type than vector<double> so you can't freely assign / move it to such a container if you use that elsewhere.

I'd worry that if you do ever want to access individual double elements of your vector, doing Vec8vec[i][j] might lead to much worse asm (e.g. a SIMD load and then a shuffle or store/reload from VCL's operator[]) than vecdouble[i*8 + j] (presumably just a vmovsd), especially if it means you need to write a nested loop where you wouldn't otherwise need one.

avec.load (&doublevec[8]); should generate (almost or exactly) identical asm to avec = Vec8vec[1];. If the data is in memory, the compiler will need to use a load instruction to load it. It doesn't matter what "type" it had; types are a C++ thing, not an asm thing; a SIMD vector is just a reinterpretation of some bytes in memory.


But if this is the easiest way you can convince a C++17 compiler to align a dynamic array by 64, then it's maybe worth considering. Still nasty and will cause future pain if/when porting to ARM NEON or SVE, because Agner's VCL only wraps x86 SIMD last I checked. Or even porting to AVX2 will suck.

A better way might be a custom allocator (I think Boost has some already-written) that you can use as the 2nd template param to something like std::vector<double, aligned_allocator<64>>. This is also type-incompatible with std::vector<double> if you want to pass it around and assign it to other vector<>s, but at least it's not tied to AVX512 specifically.

If you aren't using a C++17 compiler (so std::vector doesn't respect alignof(T) > alignof(max_align_t) i.e. 16), then don't even consider this; it will fault when compilers like GCC and Clang use vmovapd (alignment-required) to store a __m512d.

You'll want to get your data aligned; 64-byte alignment makes a bigger difference with AVX512 than with AVX2 on current AVX512 CPUs (Skylake-X).

MSVC (and I think ICC) for some reason choose to always use unaligned load/store instructions (except for folding loads into memory source operands even with legacy SSE instructions, thus requiring 16-byte alignement) even when compile-time alignment guarantees exist. I assume that's why it happens to work for you.

For an SoA data layout, you might want to share a common size for all arrays, and use aligned_alloc (compatible with free, not delete) or something similar to manage sizes for double * members. Unfortunately there's no standard aligned allocator that supports an aligned_realloc, so you always have to copy, even if there was free virtual address space following your array that a non-crappy API could have let your array grow into without copying. Thanks, C++.

0
A Fog On

std::vector will be properly aligned under C++17 which is required for the vector class library anyway. This will work OK. The std::vector template is relatively efficient. Several other standard container templates are very inefficient because they are implemented as linked lists with an awful lot of dynamic memory allocations and de-allocations.

If the size of the array is known at compile time, or if you have a sensible upper limit to the array size, then it may be more efficient to just make an old fashioned C array.

const int arraysize = 0x100;
alignas(64) double myarray[arraysize];  // AVX-512 benefits a lot from alignment
...
Vec8d a;
for (int i=0; i < arraysize/a.size(); i += a.size()) {
    a.load(myarray+i);
    // do your calculations here

}

If the array size is not known at compile time, then you may simply allocate your own array:

Vec8d * mydynamicarray = new Vec8d[mysize];

It is good practice to wrap the memory allocation in a container class with a destructor that cleans up the allocation:

~myContainerClass() {
    if (mydynamicarray != 0) delete[] mydynamicarray;
}
1
James Griffin On

It depends on how you intend to use some_struct, but rather than use two vectors, one for each member you may prefer:

struct alignas(64) some_struct {
    double[8] a;
    double[8] b;
};

std::vector<some_struct> vector_of_struct_of_arrays{};

I find that my code is usually cleaner with this layout and as was mentioned this allows for the use of a different library in future if you couldn't use vectorclass for some reason.