I'm new to ARM NEON intrinsics and was looking over the documentation for it. They provided a great set of examples including one for matrix multiplication, which uses their vector FMA instruction. I was however rather confused by the last parameter. Here's an excerpt from the code.
C0 = vfmaq_laneq_f32(C0, A0, B0, 0);
C0 = vfmaq_laneq_f32(C0, A1, B0, 1);
C0 = vfmaq_laneq_f32(C0, A2, B0, 2);
C0 = vfmaq_laneq_f32(C0, A3, B0, 3);
The 0, 1, 2, 3 at the end is the part that is confusing me. From the documentation for it found here: https://developer.arm.com/architectures/instruction-sets/intrinsics/#q=vfmaq_laneq_f32 this refers to the lane. From the other documentation I've read, the lane refers to whether the packed variable is divided up into 64, 32 , 16, or 8 bit sized data types, which does not make sense in this context. I'm probably missing something, but to me, it seems like they're using the same word here but with a different meaning.
So what does lane mean in this context? What would happen if I reversed the order? What would happen if I set them all to 0?
Note: Here is the link to the matrix multiplication example https://developer.arm.com/documentation/102467/0201/Example---matrix-multiplication
Weird, the corresponding asm instruction (https://developer.arm.com/documentation/ddi0596/2021-03/SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-Multiply-Add-to-accumulator--vector--?lang=en linked from the FMLA hyperlink in the intrinsic doc you linked) doesn't mention an index into one of the input vectors.
But FMLA (by element) does: it broadcasts one element of the second multiplier vector instead of pure vertical.
Its Operation pseudocode is:
Notice that
element2is indexed outside the loop and used for each of the 4 elements inside the loop. (Or 2 elements for double-precision 128-bit, or for single-precision 64-bit vectors, or 8 elements for 128-bit half-precision).I think the intrinsic uses a different operand-order than asm: in asm, the accumulator is the first operand, so it can also be destination. But the intrinsic I think matches the usual ISO C
fma(mul1, mul2, add)operand-order.So this is using each element of
B0as a multiplier for different A vectors, instead of doing separate broadcast-loads for each element of that row. That's something a matmul would want to do.So the intrinsic docs are confusing because they linked (and copied pseudocode from) the pure-vertical FMLA instruction, not the lane-broadcast version.
There's also
vfmaq_n_f32where the last arg is a scalarfloat32_t. It might be the same asvfmaq_laneq_f32(a,b,c, 0)but without having to create a C vector type from the scalar.For pure vertical FMA, you want
vfmaq_f32which just takes 3float32x4_tvectors, no immediate. (Theqis to distinguish from the version that take 3float32x2_t64-bit vectors in D registers, instead of 128-bit Q registers, in 32-bit code which used register widths and different mnemonics instead of just arrangement specifiers withvnames.)I found it by searching the asm instruction mnemonic (FLMA) in the search bar in the intrinsics guide.