What is the best way to handle the leftover part of a row of data that's too small to fill the registers?
Consider an AVX512 loop working on 32-bit pixel data:
fnAVX512(npixels) {
while (npixels >= 16) {process_row; npixels -= 16;}
}
When this finishes, npixels might not be zero. Or the function might have been called with narrow data in the first place.
I can think of three possible solutions:
Wrangle the data into the registers anyway by performing the load with a mask that excludes data outside of current interest. This might get messy, especially if the initial data wasn't wide enough to begin with: you can't backtrack the pointer to ensure you're in a "safe zone". Does, for example,
_mm256_maskz_loadu_epiNguarantee not reading from memory locations that correspond to0mask bits, or could it cause an exception even if the mask is set up safely?Fall back to scalar without regard for how many pixels are left; in this case anywhere between 15 to 1? This is easy, but feels wrong; the scalar code looks bad in comparison and performs accordingly. (In my case it's around 10% of the speed of AVX.)
If you already have 128 and 256-bit vector code for supporting older architectures, you could do fall through those:
fnAVX512(npixels) {
while (npixels >= 16) {process_row; npixels -= 16;}
fnAVX2(npixels);
}
fnAVX2(npixels) {
while (npixels >= 8) {process_row; npixels -= 8;}
fnSSE(npixels)
}
...
And ultimately, with 3 or fewer pixels, do scalar. On the topic of trying to avoid that (and maybe this should be a separate question): how is MMX in 2023? Even though it doesn't have an extensive set of instructions, it should(?) be useful for certain types of data, such as 8-bit color channels packed into 32-bit integers, where scalar code needs a large number of operations.
If your problem allows it (pure vertical SIMD so reprocessing the same element twice is ok), a final vector that ends at the end of the array is good. It might or might not overlap with earlier data. If you're copying the result to a separate destination, this works very well, as long as your inputs are always wider than one vector. Otherwise it can take some care to get right and efficient when operating in-place.
You can use 8-byte or 4-byte loads / stores with XMM registers as part of a cleanup strategy, like SSE2
movq xmm, [rdi]. No need to involve MMX and need to run a slowemmsinstruction!The recent intrinsics like
_mm_storeu_si64and_mm_loadu_si32with avoid*operand are cleaner wrappers than earlier intrinsics for the same instructions._mm_loadl_epi64(__m128i *)exists formovq, but there's no older intrinsic formovd. (Perhaps back in the bad old days, Intel thought you should use_mm_cvtsi32_si128after dereferencing anint*, because their compiler (ICC) and MSVC don't care about strict-aliasing, and maybe also not alignment UB?)Beware that some GCC versions have a broken definition for
_mm_loadu_si32, shuffling the data to the high 32 bits after loading. See _mm_loadu_si32 not recognized by GCC on Ubuntu - fortunately they also fixed the strict-aliasing and alignment UB bugs when fixing the more obvious bug as well, so there aren't any GCC versions that have a silently-unsafe version that appears to work.SIMD masked loads and stores like AVX-512
_mm256_maskz_loadu_epiNdo suppress faults from masked elements in unmapped pages. With only AVX2, you only have 32-bit or 64-bit masking granularity, and need a vector mask. (Except for SSE2_mm_maskmoveu_si128; it's implementation-dependent whether it suppresses faults. Also, it's an NT store bypassing cache and evicting, and partial-line NT stores suck. It's generally not useful on modern CPUs because it's also slow even in the best case. It's only available as a store.)If you have control of how you allocate your buffers, you can round up allocation sizes to a multiple of the vector width to make things easier.
For images, it might not matter if you load a vector that has some pixels from the end of one row, some from the start of the next. But if you do need to do something different for each row, it can make some sense to pad the storage geometry so the stride between rows is a multiple of the vector width. i.e. have some padding pixels at the end of each row, if the actual image width isn't a multiple of 4 pixels / 16 bytes. With wide vectors and unrolled loops, this could waste a lot of cache footprint, so maybe only pad up to a multiple of 16 bytes, and have your loop handle odd sizes down to a multiple of 16 bytes but not narrower.
See also