Can I force a Cortex-M4 ARM processor to use conditional instructions outside an IT block?

278 Views Asked by At

I need to profile different machine instruction for a project, so I'm running some instructions in a loop of ~200 instructions per time (using .rept in an __asm__ directive). The processor I'm using is an ARM Cortex-M4. I need now to test ARM's conditional instructions. If I enter something like

        ".rept 200\n\t"
        "addeq r1, r1, r1\n\t"
        ".endr\n\t"

I get

Error: thumb conditional instruction should be in IT block -- `addeq r1,r1,r1'

Now, IT blocks can have up to 4 instructions, so the best I could do with them is something like

        ".rept 200\n\t"
        "ITTTT EQ\n\t"
        ".rept 4\n\t"
        "addeq r1, r1, r1\n\t"
        ".endr\n\t"
        ".endr\n\t"

yielding a binary like

 80003ae:   bf01        itttt   eq
 80003b0:   1849        addeq   r1, r1, r1
 80003b2:   1849        addeq   r1, r1, r1
 80003b4:   1849        addeq   r1, r1, r1
 80003b6:   1849        addeq   r1, r1, r1

This way, however, 1 in 5 instruction will not be the one I want to profile (causing some noise in the measures I take). Since I heard that IT blocks are enforced by the Thumb-2 ISA, and that complete ARM can use conditional instructions even without them, my question is: can I instruct the assembler to use them? Moreover, if I heard correctly and Thumb-2 requires them, is there a way to further reduce the "noise"? (better than 1/5 instructions?)

Thanks!


EDIT: I got a lot of useful comments (thanks!), but I realized I missed some important information to better understand my goal, I apologize for that. I'm trying to profile the power consumption of the CPU, so effectively it does a difference if the IT block is "executed" or not, which is the resulting binary encoding ecc., while the clock cycles needed are not the focus here.

I think this means (but correct me if I'm wrong) that even if Thumb-2 cleverly hides the IT block complexity, I should see a power difference, multimeter at hand.

1

There are 1 best solutions below

4
fuz On

The IT instruction is what makes the subsequent instructions conditional. You'll find that if you remove it, the instructions become unconditional. This is how the instruction encoding works. You can think of the IT block as a prefix to one or more instructions, modifying their behaviour to possibly no longer set flags and to execute conditionally. If you remove the prefix, execution is no longer conditional.

For a benchmark, I'd use IT blocks with one instruction each as that is the most common use case. Some ARM processors have a decoder with special support for this case, parsing the IT instruction and subsequent conditional instruction as one in the usual cases.

The Cortex-M4 on the other hand does something else: if the preceding instruction (!) is a 16 bit instruction, the IT instruction is folded into it and effectively executed in zero cycles. This may solve your measurement problem for the case of measuring 16 bit conditional execution at least.

Another thing you could do is to run the benchmark with IT blocks of different sizes and to then use arithmetic to compute how long the IT instructions took. Then you can remove that time from the total runtime. Generally speaking, conditionally executed instructions take the same time as unconditional instructions, though there may be exceptions (e.g. for control transfer instructions or those that write PC by some other means).