SSE Instruction to load Bytes with Zero Extension?

167 Views Asked by At

Let's say I have a pointer to a bunch of uint8_t's in RDI and I want to load 4 uint8_ts into XMM0 and use SIMD instructions to multiply it with XMM1 where I have 4 float values stored.

How can I load the initial 4 uint8_ts into XMM0, so it's always "aligned", meaning that each "compartment" has it's lower 8 bit with the uint8_t and the upper 24 bits are 0? Is there an instruction for that?

I hope my issue is understandable and I am sorry for my very naive explanation of my issue.

movdqu xmm0, [rdi]

would result in a QWORD loaded, not what I need.

2

There are 2 best solutions below

11
Homer512 On

For simplicity I ignore the floating point multiplication. I assume using mulps isn't really that hard. The real challenge is the conversion, if you can't use fixed-point 16-bit integer instead.

The intel intrinsics actually come with an intrinsic that expands into a significant sequence of operations just for that: _mm_cvtpu8_ps. But that's for MMX+SSE1 and isn't a single instruction, and compiles very inefficiently with modern compilers1. In the early days of Intel's intrinsics, they provided more "helper function" intrinsics beyond the _mm_set ones, with the same naming scheme as the wrappers for single instructions.

For SSE2 there is no straightforward operation. A manual unpack sequence is good:

#include <immintrin.h>

void u8ps_sse2(float* out, unsigned char* in)
{
    __m128i in_low = _mm_loadu_si32(in);   // unaligned / aliasing-safe movd load
    __m128i zero = _mm_setzero_si128();
    __m128i as_u16 = _mm_unpacklo_epi8(in_low, zero);
    __m128i as_u32 = _mm_unpacklo_epi16(as_u16, zero);
    __m128 as_float = _mm_cvtepi32_ps(as_u32);
    _mm_storeu_ps(out, as_float);
   // or just return as_float if you want to do more with it.
}
u8ps_sse2(float*, unsigned char*):
        movd    xmm0, DWORD PTR [rsi]
        pxor    xmm1, xmm1
        punpcklbw       xmm0, xmm1
        punpcklwd       xmm0, xmm1
        cvtdq2ps        xmm0, xmm0
        movups  XMMWORD PTR [rdi], xmm0
        ret

SSSE3 can use a pshufb to implement the zero-extension.

void u8ps_ssse3(float* out, unsigned char* in)
{
    __m128i in_low = _mm_loadu_si32(in);
    __m128i shuffle = _mm_set_epi8(
        -1, -1, -1, 3,
        -1, -1, -1, 2,
        -1, -1, -1, 1,
        -1, -1, -1, 0
    );
    __m128i zero_extended = _mm_shuffle_epi8(in_low, shuffle);
    __m128 as_float = _mm_cvtepi32_ps(zero_extended);
    _mm_storeu_ps(out, as_float);
}
u8ps_ssse3(float*, unsigned char*):
        movd    xmm0, DWORD PTR [rsi]
        pshufb  xmm0, XMMWORD PTR .LC0[rip]
        cvtdq2ps        xmm0, xmm0
        movups  XMMWORD PTR [rdi], xmm0
        ret
.LC0:
        .byte   0
        .byte   -1
        .byte   -1
        .byte   -1
        .byte   1
        .byte   -1
        .byte   -1
        .byte   -1
        .byte   2
        .byte   -1
        .byte   -1
        .byte   -1
        .byte   3
        .byte   -1
        .byte   -1
        .byte   -1

And finally, SSE4.1 gave us the proper instruction with pmovzxbd.

void u8ps_sse41(float* out, unsigned char* in)
{
    __m128i in_low = _mm_loadu_si32(in);
    __m128i zero_extended = _mm_cvtepu8_epi32(in_low);
    __m128 as_float = _mm_cvtepi32_ps(zero_extended);
    _mm_storeu_ps(out, as_float);
}
u8ps_sse41(float*, unsigned char*):
        pmovzxbd        xmm0, DWORD PTR [rsi]
        cvtdq2ps        xmm0, xmm0
        movups  XMMWORD PTR [rdi], xmm0
        ret

Footnote 1: using MMX+SSE1 _mm_cvtpu8_ps
// Highly not recommended.
void u8ps_sse1(float* out, unsigned char* in)
{
                       // strict aliasing violation and not alignment safe
    __m64 in_low = _mm_cvtsi32_si64(*(const int*) in);
    __m128 as_float = _mm_cvtpu8_ps(in_low);
    _mm_storeu_ps(out, as_float);
}
1
Aki Suihkonen On

On regular SSE, you need to

  // load 4 bytes (likely you can actually load 8 or 16 bytes for future savings)
  __m128i data = _mm_loadu_si32(input_stream);

  // interleave every byte with zero to cast to (u)int16_t
  data = _mm_unpacklo_epi8(data, _mm_setzero_si128());
  // interleave every word with zero to cast to (u)int32_t
  data = _mm_unpacklo_epi16(data, _mm_setzero_si128());
  // convert the integers to float
  __m128 fdata = _mm_cvtepi32_ps(data);

On SSE4.1, there's a single instruction to expand uint8_t to uint32_t

    __m128i data = _mm_cvtepu8_epi32(_mm_loadu_si32(input_stream));

On SSSE3, one can use vshufb as in

    __m128i data = _mm_cvtsi32_si128(input_stream);
    __m128i shuf = _mm_set_epi8(-1,-1,-1,3,-1,-1,-1,2,-1,-1,-1,1,-1,-1,-1,0);
    data = _mm_shuffle_epi8(data, shuf);

If the next few operations contain addition with a constant (after multiplication with a constant), then one might be able to convert the original data with _mm_shuffle_epi8 into a floating point number of the format float_big + int_small.

Some examples are float_big_23 = 1<<23, float_big_15 = 1<<15, where the formats are 0x4b0000.. or 0x4700..00. One needs a register to contain both the floats and bytes from the stream -- floatX .... d0 d1 d2 d3 d4 d5 d6 d7, as in after reading only the top 8 bytes of a register with __m128 _mm_loadh_pi(input_stream). Then with a proper shuffle the 4 floats of floatX + d0, floatX + d1, floatX + d2, floatX + d3 are generated, requiring the bias to be subtracted. AFAIK, conversion by subtracting the magic value is not faster on any modern x64 than the direct int->float conversion, but in this operation one can bake in further offsets, saving possibly one sub/add instruction, while taking a possible penalty of mixing integer/floating point pipelines.