NEON emulation of VNNI instructions

289 Views Asked by ErmIg At 10 March 2020 at 12:26

There is new AVX-512 VNNI instructions in Cascade Lake Intel CPU which can accelerate inference of neural networks on CPU. I integrated them into Simd Library to accelerate Synet (my small framework for inference of neural networks) and obtained significant performance boost.

In fact I used only one instruction _mm512_dpbusd_epi32 (vpdpbusd) which allows to perform multiplication of 8-bit signed and unsigned integers and then accumulates them into 32-bit integer accumulators.

It will be great to to perform analogue optimizations for NEON (ARM platform).

So there is a question:

Is exist any analogue of NEON instruction to emulate vpdpbusd? If there is no analogue what is the best way to emulate the instruction ?

There is a scalar implementation below (to best understand what the function must do):

inline void pdpbusd(int32x4_t& sum, uint8x16_t input, int8x16_t weight)
{
    for (size_t i = 0; i < 4; ++i)
        for (size_t j = 0; j < 4; ++j)
            sum[i] += int32_t(input[i * 4 + j]) * int32_t(weight[i * 4 + j]);
}

Original Q&A

There are 1 best solutions below

mstorsjo On 10 March 2020 at 12:38

The most straightforward implementation of that requires 3 instructions; vmovl.s8, vmovl.u8 to extend the signed and unsigned 8 bit values to 16 bit, followed by vmlal.s16, to do a signed lengthening 16 bit multiplication, accumulated into a 32 bit register. And as the vmlal.s16 only handles 4 elements, you'd need a second vmlal.s16 to multiply and accumulate the following 4 elements - so 4 instructions for 4 elements.

For aarch64 syntax, the corresponding instructions are sxtl, uxtl and smlal.

Edit: If the output elements should be aggregated horizontally, one can't use the fused multiply-accumulate instructions vmlal. Then the solution would be vmovl.s8 and vmovl.u8, followed by vmul.i16 (for 8 input elements), vpaddl.s16 (to aggregate two elements horizontally), followed by another vpadd.i32 to get the sum of 4 elements horizontally. So 5 instructions for 8 input elements, or 10 instructions for a full 128 bit vector, followed by one final vadd.s32 to accumulate the final result to the accumulator. On AArch64, the equivalent of vpadd.i32, addp, can handle 128 bit vectors, so it's one instruction less there.

If you're using instrinsics, the implementation could look something like this:

int32x4_t vpdpbusd(int32x4_t sum, uint8x16_t input, int8x16_t weight) {
    int16x8_t i1 = vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(input)));
    int16x8_t i2 = vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(input)));
    int16x8_t w1 = vmovl_s8(vget_low_s8(weight));
    int16x8_t w2 = vmovl_s8(vget_high_s8(weight));
    int16x8_t p1 = vmulq_s16(i1, w1);
    int16x8_t p2 = vmulq_s16(i2, w2);
    int32x4_t s1 = vpaddlq_s16(p1);
    int32x4_t s2 = vpaddlq_s16(p2);
#if defined(__aarch64__)
    int32x4_t s3 = vpaddq_s32(s1, s2);
#else
    int32x4_t s3 = vcombine_s32(
        vpadd_s32(vget_low_s32(s1), vget_high_s32(s1)),
        vpadd_s32(vget_low_s32(s2), vget_high_s32(s2))
    );  
#endif
    sum = vaddq_s32(sum, s3);
    return sum;
}

NEON emulation of VNNI instructions

There are 1 best solutions below

Related Questions in C++

Related Questions in SIMD

Related Questions in NEON

Related Questions in SIMD-LIBRARY

Related Questions in SYNET

Trending Questions

Popular # Hahtags

Popular Questions