The way PLT usage is specified in the SystemV ABI (and implemented in practice), is schematically somtehing like this:
# A call from somewhere in code is into a PLT slot
# (In reality not a direct call, in x64 typically an rip-relative one)
0x500:
call 0x1000
...
0x1000:
.PLT1: jmp [0x2000] # the slot for f in the binary's GOT
pushq $index_f
jmp .PLT0
...
0x2000:
# initially jumps back to .PLT to call the lazy-binding routine:
.GOT1: 0x1005
# but after that is called:
0x3000 # the address of the real implementation of f
...
0x3000:
f: ....
My question is:
isn't the 1st jmp in the PLT slot redundant? Couldn't this work with an indirect call into the GOT instead? For example:
0x500:
call [0x2000]
...
0x1000:
.PLT1: pushq $index_f
jmp .PLT0
...
0x2000:
# initially jumps back to .PLT to call the lazy-binding routine:
.GOT1: 0x1005
# but after that is called:
0x3000 # the address of the real implementation of f
...
0x3000:
f: ....
This might have marginal performance benefits - but the reason I'm asking is a recent scramble in the linkers/elf community to come up with extra bytes in a 16-byte PLT slot to accommodate intel IBT (the search failed, and resulted in an extra .plt.sec indirection. 1, 2)
The basic issue is that the original call (at 0x500) is being generated by the compiler, and at that point, the compiler does not know whether this symbol will eventually be in this dynamic object or not. So it generates a simple call (direct, PC relative) as that is the most efficient for the common case of a local call within a dynamic object.
It is not until the linker runs that we know if this is a symbol in another dynmic object or a globally visible one in this object (that might be overridden) or a local function call. For the latter case it will just make it a direct call, but for the former cases, it will create a PLT entry for the symbol and make the call go to the PLT entry.
Your suggestion would save a jump, but would require knowing at compile time for every call whether it needs a PLT entry or not, or would require switching between a direct and indirect call at link time based on whether the PLT was needed or not. On x86, direct and indirect calls are different sizes, so being able to change would be pretty tricky.