Avoiding self-modifying code (SMC) machine clears when writing to executable memory

345 Views Asked by At

I have run into a weird issue, where the CPU believes that I am modifying currently executed code, and repeatedly triggers self-modifying code (SMC) machine clears.

My (simplified) program does the following:

  1. Allocate an executable buffer.
  2. Copy a 64-byte payload to some position X in the buffer.
  3. Call payload at position X.
  4. Go back to 2.

...for 100'000'000 iterations.

main.c:

#include <stdint.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>

extern void smc(void *bufferPtr, void *bufferEndPtr);

int main()
{
    const int BUFFER_LENGTH = 4096;
    
    void *bufferPtr = mmap(0, BUFFER_LENGTH, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
    void *bufferEndPtr = bufferPtr + BUFFER_LENGTH;
    printf("Instruction block buffer: %p, %s\n", bufferPtr, strerror(errno));
    
    smc(bufferPtr, bufferEndPtr);
    
    return 0;
}

smc.asm:

[section .text]

align 64
payload:
    ret

%define BUFFER_STEP 64

align 64
[global smc]
; rdi: bufferPtr
; rsi: bufferEndPtr
smc:
    push r10
    push r11
    push r12

    mov rax, 100_000_000
    mov r10, rdi ; r10 points to begin of buffer
    mov r11, rdi ; r11 points to current buffer position
    mov r12, rsi ; r12 points to end of buffer

.loop:
    ; Done?
    dec rax
    je .end
    
    mov rcx, 64
    mov rdi, r11
    lea rsi, [rel payload]
    
    ; Store
    rep movsb
    
    ; Call
    call r11
    
    ; Move buffer pointer
    lea r11, [r11 + BUFFER_STEP]
    cmp r11, r12
    jb .next
    mov r11, r10

.next:
    jmp .loop

.end:
    pop r12
    pop r11
    pop r10
    ret

Compile with:

nasm smc.asm -f elf64 -o smc.o
gcc -c main.c -O2 -o main.o
gcc main.o smc.o -o prog

I measure the program's execution time and the MACHINE_CLEARS.SMC performance counter using

sudo perf stat -e r04c3 ./prog

Results on an Intel Core i7-7567U:

BUFFER_LENGTH (bytes) BUFFER_STEP (bytes) MACHINE_CLEARS.SMC Execution time (seconds)
1 x 4K 0 199'999'982 14.53
1 x 4K 64 199'999'740 14.91
256 x 4K 2048 105'550'699 7.89
256 x 4K 4096 130'573'069 9.83

Although I am shifting the store destination (writing to a different location each time), I still get millions of SMC machine clears, leading to a massive performance penalty.

Adding various fences and/or serializing instructions before/after the store does not yield any considerable improvement. Note that, while the shifting somewhat reduces the number of machine clears, it also leads to a large number of branch target mispredictions at the call instruction.

When I run the same program with a 4K buffer, 0 byte steps, mfence after the store, and call payload instead of call r11, it only takes around 1.74 seconds, which is expected, given the total number of executed instructions.

What is causing this huge number of machine clears, and how can I work around that?

0

There are 0 best solutions below