What's the best way to handle a CALL in x64 assembly, that should return to a slightly shifted return address? Mainly concerning efficiency/execution speed. I'll briefly explain what I'm trying to do.
Background
I have a custom, interpreted visual scripting language, that gets compiled to native code. This language has builtin stack-based coroutines, and previously they were still handled semi-interpreted (with a separate stack-class to store the coroutine-data). I'm in the process of nativizing it entirely, so that only RSP is used.
One part of those coroutines is the ability for nested yielding, meaning if a coroutine calls another yielding method, that method can internally yield to suspend the entire invokation. This information is handled via a "YieldState" struct, stored in an register. That means, that for the new fully nativized variant, we can just call a yielding method from a coroutine with a call-instruction:
call 12345; // [rip+12345] => yieldingMethod
At least, in theory. As our coroutines are stack-based, we store local variables plainly on the stack, not in some sort of class like stackless coroutines might do. This requires cleanup (in case the coroutine is destroyed before finishing) to be handled via another method, which I called "interrupt handler". Such interrupt-handler being invoked is quite common in my practical use-case, but not overly so. So my goal was to provide something that is faster than an exception-handler (which usually requires some global lookup of the frame), but doesn't require explicitely setting this address for each call. So what I did was embedd the interrupt-handler address between the call and the return-address - since for the old version of the code, we had to load the return manually, this was not a problem:
lea rcx,[rip+25]; // 25 is the assumed byte-size up until the return address
mov rdx,rbx; // load non-native call stack
call prepareMethodYielding; // stores return-address on stack
jmp 12345; // actually call our "yieldingMethod"
mov r15,interruptAddress;
The last instruction is never executed - we lea the return address to actually skip it. We only have it here to be able to lookup the interrupt-handler. Given a resume-address, we can just decrement the pointer by 8, and we have the address of that resumes interrupt. The "mov r15" in our case is just to allow us to disassemble the code properly; we could just embedd the address alone, but that would confuse any external disassembler.
The actual problem
Now in the new version, there is no "prepareMethodYielding", but only a call - at least, optimally. But "call" in itself doesn't allow us to do a modified return-address, so here I'm faced with a few options, and I want to know which one is the best.
Option A - lea + push + jmp
Our first option is to simulate the "call", but push the return-address manually:
lea rax,[rip+10h]
push rax
jmp A6 // yieldingMethod
This requires 3 instructions, but no access to memory.
Option B - push from memory
We could reduce the number of options, by storing the return-address in some area of constant-memory:
push qword ptr[rip+1234] // return-address stored here
jmp A6 // yieldingMethod
Now we need only one push an no intermediate register, though now we need an access to memory, which could potentially be further away in the data-section.
Option C - modify the return address in the called function
Another option that I see would be to adjust the return-address that is produced by call inside the called method. All those methods here are compiled using my own calling convention, so they don't adhere to x64 or any other.
// caller
call A6 // yielding method
// callee, first instruction
add qword ptr[rsp],10 // size of interrupt-embedding is always the same
This would also only be one instruction, with a small encoding. Though just from a design point of view, I don't like it very much, since it couples the information about the embedding of the callee into the caller - though, if this was the most efficient variant, I might still go for it.
Option D - don't modify the return-address at all
Our last option is to not modify the return-address at all, but instead change how lookup and return is handled.
call 12345; // yieldingMethod
mov r15,interruptAddress; // is actually executed now (but value is not used)
So here, we would change where we lookup the interrupt-address (as the return-address now points in front of the fake instruction, instead of behind it). Then, upon return from the call, we would execute the movabs instruction, but discard the value is loaded. The upside here is that overall code-size is the smallest, since we don't need to add any additional instructions that aren't already there. However, we are executing a 10-byte mov instruction, which could be slower than some of the other variants. It kind of depends here on what the CPU is doing - if it already decodes the fake instruction, even if it doesn't directly reach it, it might be the best idea to just execute it, instead of modifying the return address. Same thing, if the CPU can somehow detect that the instruction has no effect, as it's value is never read, during register renaming, then it could effectively be free - atm, I'm using a register that is not used, to distinquish for my own assembler; but then it would probably make sense to use a register that is overwritten soon after, I assume. Though I'm unsure on what would actually happen here.
Conclusion
So, which of those 4 options seems the most efficient to you? I'm also open to other ideas, though the general design of how the coroutines are done is finished and functional, so something like using a statemachine-based approach which IIRC some coroutines use, is not really an option here.
Here's a variant of option D that could work:
The x86 architecture has a long nop instruction
nop r/m32, which performs no effect. The operand of this instruction is ignored and can be a memory operand. If you use this instruction with a modr/m operand that has a 32 bit displacement, you can effectively embed a 32 bit number in the instruction stream with no harm.While your interrupt address is a 64 bit address, it could be possible to express it as a 32 bit distance from some base address, permitting you to get away with a shorter encoding. Or use a pair of such long nop instructions to encode the full 64 bits.
This could look like:
An advantage to this is that running one NOP after return is even cheaper than an instruction to modify the return address. More importantly, it avoids a mispredict from the return-address stack predictor which assumes that
callandretwill be paired the normal way.