AVX2 narrowing conversion, from uint16_t to uint8_t

Question

AVX2 narrowing conversion, from uint16_t to uint8_t

137 Views Asked by Robinson At 08 January 2024 at 09:26

I'd like to narrow a 2d array from 16 to 8 bits, using AVX2. The C++ code that works is as follows:

  auto * s = reinterpret_cast<uint16_t *>(i_frame.Y);
  auto * d = narrowed.data();

  for (auto y = 0; y < i_frame.Height; y++, s += i_frame.Pitch_Luma / 2, d += o_frame.Width)
  {
      for (auto x = 0; x < i_frame.Width; x++)
      {
          d[x] = static_cast<uint8_t>(s[x]);
      }
  }

Then I thought perhaps it would be more efficient to use AVX2 (all our systems have AVX2 support):

 auto * s = reinterpret_cast<uint16_t *>(i_frame.Y);
 auto * d = narrowed.data();

 for (auto y = 0; y < i_frame.Height; ++y, s += i_frame.Pitch_Luma / 2, d += o_frame.Width)
 {
     for (auto x = 0; x < i_frame.Width; x += 16)
     {
         auto src = _mm256_load_si256(reinterpret_cast<const __m256i *>(s + x));            
         auto v = _mm256_packus_epi16(src, _mm256_setzero_si256());

         v = _mm256_permute4x64_epi64(v, _MM_SHUFFLE(3, 1, 2, 0));

         _mm_store_si128(reinterpret_cast<__m128i *>(d + x), _mm256_extracti128_si256(v, 0));
     }
 }

Question is whether my AVX2 conversion code is optimal and/or the correct way to do this. I may be missing an AVX2 command that makes this very easy. At least I was with the widening conversion.

Original Q&A

There are 2 best solutions below

Peter Cordes On 08 January 2024 at 14:36

If you have AVX-512 available, you might still just use the same _mm256_packus_epi16 / _mm256_permute4x64_epi64 strategy as in harold's answer if your data is in the 0..255 range, or if i16 to u8 saturation works for your use-case. On Intel CPUs, it will produce one vector of results every other clock cycle, from 2 vectors of inputs, if it doesn't bottleneck on cache bandwidth. (Both shuffles run on port 5 only, even for the YMM version.)

Optimizing for throughput per uop makes the most sense if you're doing this on the fly with some other work in the same pass over the data, or as you generate it; or if you can cache-block this so inputs and output are hot in L1d cache. (An average per cycle of loading 32 bytes and storing 16 probably does saturate L2 bandwidth on Intel, but shifting or ANDing would slow it a bit with front-end issue or back-end port bottlenecks.) If you're doing something nasty like bottlenecking on DRAM bandwidth, the best you can do is let the other logical core have more execution resources by doing this efficiently. And let out-of-order exec see farther for better HW prefetch.

You might use 512-bit vectors so that's 64 bytes of results per 2 cycles = 32 bytes per cycle on Intel. That would need a vector constant for _mm512_permutexvar_epi64 because vpermq zmm, zmm, imm8 reuses the imm8 to do the same independent 256-bit shuffle in the two halves of the vector, because an 8-bit immediate only has room for four 2-bit fields.

On Zen 4, pack/permute gives 32 result bytes per cycle with either 32-byte or 64-byte vectors. (The YMM versions of the shuffles have 0.5c throughput on Zen 4, and L1d load/store throughput is 64 bytes loaded + 32 stored. 512-bit vectors occupy the execution unit for 2 cycles but are single-uop even for lane-crossing shuffles.)

Where AVX-512 shines is when you want to ignore the upper byte of your uint16_t input (truncating bit-patterns i.e. taking modulo), or do actual unsigned saturation that treats the input as unsigned as well as the output range. Or on Zen 4, for a single-uop lane-crossing 2-input shuffle with byte granularity. (It's 3 uops on Intel so not better than pack/permute if the upper bytes are zero so that's equivalent. :/)

If AVX-512VBMI is available, vperm2tb (_mm256_permutex2var_epi8) can pick bytes from 2 source vectors without restrictions, getting the whole job done in one instruction. It costs 3 uops on Ice Lake and Alder Lake (including two port-5-only). Only a single uop on Zen 4, though. https://uops.info/.

With AVX-512 but not AVX-512VBMI, or if you don't want to load a 64-byte shuffle constant, you might prefer _mm256_cvtepi16_epi8 (vpmovwb) which produces a __m128i from one __m256i with truncation ignoring the high byte, but unfortunately narrowing so you need twice as many stores as with 2x and / pack / permute or with vperm2b.

Or for unsigned-to-unsigned saturation there's _mm256_cvtusepi16_epi8 (vpmovuswb) which unlike pack instructions treats the input as uint16_t, saturating into the output 0..255 range of the uint8_t output elements. (Or the __m512i to __m256i versions if you want to use 512-bit vectors.)

But unfortunately, Intel CPUs (Ice Lake and Sapphire Rapids/Alder Lake at least) implement both of these narrowing conversions as 2 uops. For a YMM source, p15 + p5, or 2p5 for a ZMM source since the port-1 vector ALU is shut down when there are 512-bit uops in flight, so at best you can produce 16 bytes of results per clock cycle this way, vs. 32 per clock with vpermt2b zmm (producing 64 bytes every other cycle).

vpmovqd is 1 uop on Intel, but all the other AVX-512 narrowing moves are 2 uops, involving byte or word elements or any saturation. Intel CPUs since Ice Lake can sustain 1/clock 64-byte store to L1d or 2/clock 256-bit stores to the same cache line which coalesce as they commit (so twice as many 256-bit stores isn't terrible, and is better if your data isn't 64-byte aligned). Bursts of 64-byte stores can enter the store buffer at 2/clock, but commit to L1d is 1/clock. But even Alder Lake can "only" load 2x 512-bit per clock, so narrowing

Zen 4 does vpmov[us]wb in one uop (with 1c throughput cost for ZMM vs. 0.5c for YMM/XMM, and lower latency than vpermt2b but same ports). So 1/clock shuffle would limit this to 1 source vector per cycle on Zen 4 with this strategy (32 bytes of results per cycle), vs. 2 source vectors per clock for vpermt2b producing 64 bytes/cycle of results. If you're doing computations on the results, having them in one 512-bit vector is generally better, and so is 1 uop per 2 source vectors. But if you're just loading and storing the results back to memory, you're limited to L1d cache bandwidth 64 bytes loaded (2x 256 or 1x 512) and 32 bytes stores ( But that requires loading a vector constant, and

If your data does work with pack / permute, that might still be the best strategy even with with 512-bit vectors. (Except on Zen 4 although it's not bad there.)

https://uops.info/ data on the relevant instructions

Alder Lake P-cores are Golden Cove, same as the cores in Sapphire Rapids, so uops.info's Alder Lake test data applies to SPR, from early Alder Lake systems where it was possible to enable AVX-512 when there were no E-cores.

Unless otherwise noted, each execution unit can accept a new uop every cycle, except on Zen 4 with 512-bit uops which occupy a port for 2 cycles. (So 1 uop for either of ports FP1 or FP2 (aka FP12) give 0.5c throughput for 128 and 256-bit, 1c for 512-bit. uops.info confirms those throughput numbers; I didn't make separate table columns for them since the instructions we're interested in don't have back-end bottlenecks other than ports so the back-end cost is fully captured by the uops. (Except for stores: commit from the store buffer to L1d cache is only 1/clock on Intel for 64-byte stores, vs. 2/clock for 256-bit and narrower if it can coalesce two stores to the same cache line in Ice Lake and later. Ice Lake and later can execute 2x 512-bit stores per clock, but will soon bottleneck on the store buffer being full.)

Intel CPUs shut down the vector ALUs on port 1 when any 512-bit uops are in flight (but not the integer ALUs on port 1: those include the imul and popcnt units and every integer uop with > 1 cycle of latency). So for example a p015 uop becomes a p05 for the ZMM version of vpermt2b, or if using 512-bit uops in code surrounding vptermt2b ymm. (Hrm, does the first 512-bit uop have to wait for existing port-1 vector uops to drain from the scheduler or something? Otherwise how does it avoid sending a uop to port 0 while a uop for port 1 is expecting to use the high half of the port 0 execution unit via port 1?)

ports and/or recip tput by uarch	Cascade Lake	Ice Lake	Alder Lake / SPR	Zen 4
`vpackuswb`	1 p5 [fused load]	1 p5 [fused load]	1 p5 [fused load]	1 FP12
`vpermq` imm8 or vec control	1 p5	1 p5	1 p5	1 FP12
`vpermt2b` `..._permutex2var_epi8`	not supported	1p015+2p5	1p015+2p5	1 FP12
`vpand` / `vpandd`	p015 [fused load]	p015 [fused load]	p015 [fused load]	1 FP0123 (0.25c ymm / 0.5c zmm)
`vpsrlw v,v/m,imm8`	p01 [no fusion]	p01 [no fusion]	p01 [no fusion]	1 FP23
`vpmovwb` `..._cvtepi16_epi8`	2p5	p15+p5	p15+p5	1 FP12
`vpmovuswb` `..._cvtusepi16_epi8`	same as above	"	"	"
loads/stores
`_mm256_load_si256`	0.5c	0.5c	0.333c	0.5c
`_mm256_store_si256`	1c (store-addr can compete w. loads)	0.5c burst / 1c sustained	0.5c burst / 1c sustained	1c despite 2 ports
`_mm512_load_si512`	0.5c	0.5c	0.5c despite 3 ports	1c
`_mm512_store_si512`	1c (store-addr can compete w. loads)	0.5c burst / 1c sustained	0.5c burst / 1c sustained	2c despite 2 ports

(load/store intrinsics can optimize away depending on how they're used, e.g. on the same location twice. I'm using them as shorthand for the corresponding uops, either as a memory source operand or vmovdqa or vmovdqa32, or even vmovdqu with an address that happens not to cross a cache line, etc. This is of course assuming L1d cache hits.)

[fused load] means that if there's a memory source operand, it can micro-fuse so they only take one slot in the front-end and ReOrder Buffer (ROB), unless an indexed addressing mode makes Intel CPUs unlaminate which is a frequent problem with compilers. I think AMD CPUs can always keep memory source operands fused even with indexed addressing modes; uops.info still reports 1 total uop for these instructions with a memory source.

vpermq's memory source operand is the shuffle control, not data, so I didn't count it. On Intel, instruction that are more than 1 uop on their own generally can't micro-fuse a load; that's the case for vpermt2b (3, or 4 with a memory source) vs. vpermt2d (1, or 1-micro-fused).

vpmovwb can't take a memory source, but it can take a memory destination. Narrowing instructions like AVX512F vpmovdb [rdi]{k1}, zmm0 originally existed as a way for Xeon Phi to do byte-masked and word-masked stores without AVX512VL or AVX512BW. Even single-uop vpmovqd can't micro-fuse the store with the ALU, but all forms let the store-address and store-data uops micro-fuse with each other so you save on code-size vs. a separate store. (But one 3-uop vpmovwb [rdi], zmm0 instruction could be worse for packing into the uop cache than a 2 + 1 since all 3 uops have to be in the same 6-uop line.) With intrinsics, it's up to the compiler to decide whether to fold _mm_storeu_si128 into a memory destination for a vpmovwb when storing the result of a _mm256_cvtepi16_epi8.

Note that vpsrlw on Zen 4 competes with shuffles for port FP2, so AVX2 with 2 shifts / 2 shuffles could only run 3 of the 4 back-end vector ALU uops per clock. vpand wouldn't have this problem; it could schedule to the two ports shuffles can't run on. (We'd also bottleneck on the front end with the 2 loads + 1 store making a total of 7 uops; The AVX1/2 versions are just vpsrlw v,v,imm8. Only the AVX-512 version of vpsrlw v,v/m, imm8 allows a memory source which would get us down to 5 uops on Zen where it fused unlike Intel. Plus there's loop overhead modulo how much we unroll. AVX2 vpand ymm,ymm,ymm/m256 doesn't have this problem, and can micro-fuse even on Intel if you avoid indexed addressing modes.) Either way this is only a bottleneck if data is hot in L1d cache.

On Intel, vector shifts are p01 (or p0 for the ZMM version), vs. the shuffles we're interested in being port 5 only. So port 5 limits use to 2 cycles per result whether we need shift or AND; 7 front-end uops per 2 cycles if we need to shift is close to SKX / Cascade Lake's limit of 4/clock, but Ice Lake can handle a couple uops of loop overhead. (And with shifts not micro-fusing anyway on Intel, but can on AMD, we can use indexed addressing modes, scaling by 2 in the loads and by 1 in the store. With vpand, there is a benefit to pointer increments on Intel if we unroll, though.)

We don't really get a choice between shift and AND; if your input bit-layout is super flexible, choose to put the significant bits at the bottom, zero-extended, so you can use vpackuswb without any shift or AND.

If the shift count is 8, you can use vpermt2b with a control vector that grabs the upper bytes of each u16 instead of the lower.

I omitted a Gracemont column for Alder Lake E-cores: we don't have a choice there because it doesn't support AVX10 (for 256-bit-only AVX-512 new instructions.)

**harold** · Accepted Answer · 2024-01-08T10:11:55.343000

vpackuswb and vpermq are fine for this, but you can arrange things so you get double the work done with those same instructions:

for (size_t x = 0; x < width; x += 32)
{
    auto src1 = _mm256_load_si256(reinterpret_cast<const __m256i *>(s + x));
    auto src2 = _mm256_load_si256(reinterpret_cast<const __m256i *>(s + x + 16));
   // sources are known to be in the 0..255 range so no saturation happens
    auto v = _mm256_packus_epi16(src1, src2);

    v = _mm256_permute4x64_epi64(v, _MM_SHUFFLE(3, 1, 2, 0));

    _mm256_store_si256(reinterpret_cast<__m256i *>(d + x), v);
}

This may not be quite a drop-in replacement since the unroll factor changed, and so this may require additional care near the edge of the image. You may also need an unaligned store, if the destination was only 16-aligned (or increase the alignment if possible).

vpackuswb interprets the source data as signed int16_t, and saturates values outside the 0..255 range as it packs down to uint8_t. For inputs that never have the highest bit set (e.g. 10-bit or 12-bit unsigned in uint16_t elements), values above 255 with saturate to 255. But if the high bit is set, like full-range uint16_t input, it's treated as signed-negative and saturated to 0. (packs to do signed saturation to the -128 .. +127 isn't much more helpful when you want unsigned output.)

To truncate the bit-patterns (modulo instead of saturate), you'd want _mm256_and_si256(v, _mm256_set1_epi16(0x00FF)) on both inputs separately before packing.

Or if you want to keep the most-significant 8 bits of each uint16_t, you could shift them like _mm256_srli_epi16(src1, 2) to discard the low 2 bits of 10-bit data and put the rest at the bottom, ready for a saturating pack.

Shift Right Logical shifts in zeros, so this is usable on full-range uint16_t. With the shift-count being 8 for full-range u16, it's tempting to want to use whole-byte tricks like an unaligned load so the bytes we want are already in the bottom of each word element, but then we'd have to and. That could cost fewer uops (e.g. with a memory source operand for vpand but not shift-immediate until AVX-512), and non-shuffle uops that can run on more ports, but every other load will be a cache-line split which may be a worse bottleneck than the front-end.

AVX2 narrowing conversion, from uint16_t to uint8_t

There are 2 best solutions below

https://uops.info/ data on the relevant instructions

Related Questions in SIMD

Related Questions in AVX

Related Questions in AVX2

Related Questions in NARROWING

Trending Questions

Popular # Hahtags

Popular Questions