There is new AVX-512 VNNI instructions in Cascade Lake Intel CPU which can accelerate inference of neural networks on CPU. I integrated them into Simd Library to accelerate Synet (my small framework for inference of neural networks) and obtained significant performance boost.
In fact I used only one instruction _mm512_dpbusd_epi32 (vpdpbusd) which allows to perform multiplication of 8-bit signed and unsigned integers and then accumulates them into 32-bit integer accumulators.
It will be great to to perform analogue optimizations for NEON (ARM platform).
So there is a question:
Is exist any analogue of NEON instruction to emulate vpdpbusd? If there is no analogue what is the best way to emulate the instruction ?
There is a scalar implementation below (to best understand what the function must do):
inline void pdpbusd(int32x4_t& sum, uint8x16_t input, int8x16_t weight)
{
for (size_t i = 0; i < 4; ++i)
for (size_t j = 0; j < 4; ++j)
sum[i] += int32_t(input[i * 4 + j]) * int32_t(weight[i * 4 + j]);
}
The most straightforward implementation of that requires 3 instructions;
vmovl.s8,vmovl.u8to extend the signed and unsigned 8 bit values to 16 bit, followed byvmlal.s16, to do a signed lengthening 16 bit multiplication, accumulated into a 32 bit register. And as thevmlal.s16only handles 4 elements, you'd need a secondvmlal.s16to multiply and accumulate the following 4 elements - so 4 instructions for 4 elements.For aarch64 syntax, the corresponding instructions are
sxtl,uxtlandsmlal.Edit: If the output elements should be aggregated horizontally, one can't use the fused multiply-accumulate instructions
vmlal. Then the solution would bevmovl.s8andvmovl.u8, followed byvmul.i16(for 8 input elements),vpaddl.s16(to aggregate two elements horizontally), followed by anothervpadd.i32to get the sum of 4 elements horizontally. So 5 instructions for 8 input elements, or 10 instructions for a full 128 bit vector, followed by one finalvadd.s32to accumulate the final result to the accumulator. On AArch64, the equivalent ofvpadd.i32,addp, can handle 128 bit vectors, so it's one instruction less there.If you're using instrinsics, the implementation could look something like this: