Compare signed integers and return either 0 or -1 in Thumb2 assembly

81 Views Asked by At

In thumb2 assembly, when r0 and r1 have signed integers, I like to have r1=-1 (i.e. 0xffffffff) if r0 < r1, otherwise r1=0.

I can simply code:

4288            cmp     r0, r1
bfb4            ite     lt
f04f 31ff       movlt.w r1, #-1
2100            movge   r1, #0

But I wonder if there is more optimized way either in cycles or in space.

If it were an unsigned comparison, I could use the carry flag:

4288            cmp     r0, r1
4189            sbcs    r1, r1

In ARM64, cset r1, lt would return 0 or 1, but I like to code in thumb2 assembly.

1

There are 1 best solutions below

0
Peter Cordes On

If your inputs have limited range so subtraction won't have signed overflow, you can use Jester's suggestion of using the sign bit of a subtraction:

@@ With limited-range inputs
   rsb r1, r0          @ must not overflow, oVerflow flag must be 0 after this
   asr r1, #31         @ copy the sign bit of r0-r1 to all other bits

This works as long as r0-r1 doesn't overflow a 2's complement signed integer. Then the sign of the result will indeed be negative when r0 < r1.

Failing cases include -10 < INT_MAX, where the mathematical result is -2147483657, but truncated to 32-bit we get 0x7ffffff7 (+2147483639). The V flag will be set, indicating signed overflow, and the N flag (sign bit) will be clear because the truncated result is not Negative, opposite of the sign of the mathematical non-truncated result.

That's why signed compare conditions like lt check N != V instead of just N, so for example cmp / blt works correctly with these inputs.


If your code has to work correctly with arbitrary inputs (full-range), I don't think there's any room for improvement, not even in code-size. Using an lt condition, either branch or it predication, seems the only reasonable option. Emulating 2's complement comparison manually is not going to be shorter than this.

Even outside an IT block, ARM Thumb2 doesn't have a 2-byte instruction for setting a register to -1. (At least not that compilers know about or use.) movs doesn't sign-extend its immediate, and mvn/mvns-immediate is a 4-byte instruction. So is orrs r0, #-1, not that you'd want that false dependency for performance anyway. So even if we could produce the result in a different register than either input, there's no savings.

Current GCC and clang (Godbolt) prefer unconditionally setting a register and then predicating one mov-immediate to overwrite it. But that might just be a heuristic for Thumb mode that saves code-size if one of the constants allows a shorter instruction outside an IT block, or of predicating fewer instructions in case it fills up an IT block and needs another IT, or couldn't combine into one ITE. That could happen in a larger function, or if other things are predicated on the same condition, but isn't a problem here.

In ARM mode (-marm), GCC prefers cmp ; movge r0, #0 ; mvnlt r0, #0 for all -mcpu= that I've looked at (cortex-a8, cortex-a53, cortex-a76 and unset). (I'm looking at a function, so it returns in r0, but the inputs are r0 and r1 so it's still the same situation as yours.)

So that's exactly the same as your strategy for Thumb mode. Unless an instruction inside an IT block is slower than being outside, probably best to do what you're doing.

@ GCC -O3, probably no better than yours, but same size
foo:
 cmp    r1, r0
 mov.w  r0, #4294967295 ; 0xffffffff        @ 4-byte instruction
 it ge
 movge  r0, #0                              @ 2-byte instruction
@ r0 and r1 are swapped vs. your version since functions return in r0