I am trying to implement the booth's multiplication algorithm in ARM assembly language.
Algorithm 2: Booth’s Algorithm to multiply two 32-bit numbers to produce a 64-bit result
Data: Multiplier in V , U = 0, Multiplicand in N
Result: The lower 64 bits of UV contain the result
1 i←0
2 prevBit ← 0
3 fori<32do
4 i←i+1
5 currBit ← LSB of V
6 if (currBit,prevBit) = (1,0) then
7 U←U−N
8 end
9 else if (currBit,prevBit) = (0,1) then
10 U←U+N
11 end
12 prevBit ← currBit
13 UV ← UV ≫ 1 (arithmetic right shift)
14 end
How do I perform the 13th step of the algorithm ? How do I perform asr on a 64 bit number stored as 32 bit parts on two registers ?
I have tried performing asr on both of the registers and then for the lower 32 bits I replace the MSB with the LSB of the upper 32 bits (stored before performing asr).
Ask a compiler:
int64_t asr_1(int64_t a){ return a>>1; }.Godbolt compiler explorer with GCC and clang for ARM
In the standard calling convention, the first arg and the return value are both in R1:R0 so compilers will make asm that operates in-place. For a shift-count of 1, Clang uses the Carry flag to get the bit from the bottom of the high half to the top of the low half.
GCC doesn't use the carry flag, using the same strategy it and clang use for shift counts between 2 and 31. The low half of an
int64_tis unsigned; a logical shift leaves zeros where you can OR in some bits from the high half. This strategy is less efficient than therrxtrick for a count of1, unlessrrxis slow on some CPUs.Clang with
-mthumbstrangely usesasrs.wbeforerrx, but still plainasrs(16-bit) forreturn a>>(32+5);which is easy:asrs r0, r1, #5;asrs r1, r1, #31(high half of the return val is either all-0 or all-1, according to the sign bit.)AFAICT there's no correctness reason for using
asrs.w.asrsis encodable as a 16-bit Thumb instruction, althoughrrxis only available with Thumb 2. This doesn't need ARMv8 or anything, in factclang -march=armv4tstill uses this; I just tend to use-mcpu=cortex-a77ora53because it's recent and easy to think of, and I want to know if there are any new tricks, and for tuning choices appropriate for modern ARM cores.cortex-m3orm0are also relevant for some projects. M0 notably lacks most Thumb 2 encodings:Variable-count shifts are harder, but compilers can still show you how, using predicated execution. GCC
-marmhas the shortest out of any of it or clang's versions forint64_t asr(int64_t a, int c){ return a>>c; }ARM shifts with counts >= 32 shift out all the bits, producing zero for logical shifts. So the
r1, lsl r3source operand is zero if it's actually the other bits we need, I think. Andasr r1, r1, r2to set the high half works likeasr r1, #31for r2==32 or higher; it's still the original input shift count.