Let label1, label2 be two set of instructions both ending with a RET instruction, and such that label2 branches to label1 with a link. In other words, we have a code that looks like this (I will enumerate some positions for clarity in what follows):
label1:
# Some operations...
# (1)
RET
label2:
#Some operations
BL label1
#(2)
RET
I want to call label2 in main :
main:
# Some operations...
BL label2
#(3)
The behavior I wish is that, after branching to label2 in main, the execution continues from #(3). However, this is not the case. When in main I call BL label2 the link register holds the address of the desired return point #(3), and label2 is executed. However, inside label2, I call BL label1, replacing the link register with #(2)#. This makes it so that the RET instruction after #(1) takes the execution to #(2), and the RET instruction after #(2) points to #(2) once more. You can see the problem.
High-level programming languages allow for nested use of functions. I can define a function f inside of which I call a function g, each with return statements. So the functionality I desire must be possible somehow. How can I make sequential or nested calls of BL some_label and RET that return to the first place where BL was called?
I'm very new to assembly so forgive me if the problem is somewhat trivial.
You basically tagged multiple architectures. So I will chose an architecture.
Short answer is you save the return address to the stack.
If there is no nested call then you do not need to save the return address (shown as lr or link register in this assembly language). The extra register r4 is not because r4 is special here but because the calling convention the compiler is using dictates a 64 bit aligned stack so they toss in some other register, that won't affect the code/convention to make it aligned.
The second one calls a function so one of two things, in this case it could have done a tail? optimization and done this
but it did not. Perhaps only obvious to a few but I am using an older gcc (not cutting edge) and left the default which is armv4t obviously. so perhaps for that reason the toolchain is not willing to deal with arm/thumb mode switching, but with bl they will put a veneer/trampoline in for you in the linker.
The third one I forced it to not be able to do a tail optimization and this resembles the kind of thing you typically see and that most people will write if done by hand.
Actually this is what you expect the compiler to produce:
with this stack frame at the beginning and end of the function
then fill in the guts of the function.
Same gcc but different arm archtecture, over time more thumb interwork support was added.
If writing by hand one could do it without stack frames, just before the nested branch preserve what you need to preserver, then just after restore what you need to restore, so instead of a stack frame on the edges of the function you save and restore as you go through code. nothing wrong with that, a very hand assembly way of doing it vs a high level language compiled way of doing it.
Other architectures, not arm for example, have different architectures; the call and return may for example use the stack and not a return register. So in that case as far as the return address goes you just keep making calls (8088/86 for example). You will eventually need to preserve other things to make nested calls to not trash things other than the return address. This is where compiled languages make/choose a calling convention that as a set of rules allows the construction of a function in a generic way so that you can nest or recurse indefinitely.