Why does RDRAND lead to data cache accesses and misaligned loads on Zen 3?

101 Views Asked by At

I am currently benchmarking and optimizing a program that makes heavy use of rdrand instructions.

When looking for suspected performance penalties from misaligned loads/stores, I noticed an excessively high value of the ls_misal_loads.ma64 (64-byte misaligned loads) performance counter, which clearly wasn't caused by the program's memory accesses alone. In fact, the value seemed to directly depend on the number of rdrand instructions executed.

Even further, perf reported a very high number of data cache accesses, which are seemingly caused by rdrand as well.


Take the following minimalistic example program (rd.asm):

bits 64
global main
main:

  mov rdx, 1000000  ; counter

.loop:
  rdrand rax
  dec rdx
  jne .loop

  ret

Compile with

nasm -f elf64 rd.asm
gcc rd.o

Then

perf stat -e instructions,all_data_cache_accesses,ls_misal_loads.ma64 -- ./a.out

yields for counter = 1,000,000:

 Performance counter stats for './a.out':

         3,666,525      instructions
        24,422,483      all_data_cache_accesses
         3,022,185      ls_misal_loads.ma64

...and for counter = 2,000,000:

 Performance counter stats for './a.out':

         6,695,889      instructions
        48,458,162      all_data_cache_accesses
         6,016,069      ls_misal_loads.ma64

So, doubling the number of executed rdrand instructions seems to double the number of data cache accesses and misaligned loads.

The measurements were done on an AMD EPYC 7763 CPU.


My questions:

  • What is going on here? Why does rdrand (seem to) produce cache accesses, though it is supposed to be implemented solely on the CPU?
  • Can this high number for the given performance counter be dismissed as an artifact, or does it imply a further performance penalty besides the one caused by the latency of rdrand itself?
0

There are 0 best solutions below