why does the results of reading hardware counters with papi depend on PAPI_library_init position?

211 Views Asked by At

I am using PAPI library for reading hardware counters. I have noticed that the order of calling PAPI_library_init(PAPI_VER_CURRENT) initialization has an influence on the results I get. My initialization and read of the array is like this:

int retval;

/*
     retval = PAPI_library_init(PAPI_VER_CURRENT);

     if (retval != PAPI_VER_CURRENT) {
       fprintf(stderr, "PAPI library init error!\n");
       exit(1);
     }
*/

      for(int i=0; i < arr_size; i++){
        array[i].value = 1;
        //_mm_clflush(&array[i]); flushing does not make difference. 
      }
      _mm_mfence();


      for(int i=0; i < arr_size; i++){
        temp = array[i].value ;
      }
      _mm_mfence();



      retval = PAPI_library_init(PAPI_VER_CURRENT);

      if (retval != PAPI_VER_CURRENT) {
        fprintf(stderr, "PAPI library init error!\n");
        exit(1);
      }

The necessity of second loop to read the array is for coherence protocol I believe but it should not be a big deal here. After this, I add native events of MEM_LOAD_RETIRED to the Eventset I want to read and I use PAPI_read around this third loop (I read it before and after the loop and at the end print the difference) :

for(int i=0; i < arr_size; i++){
       temp = array[i].value ;
     } 

where arr_size is 1000 and each element of the array is 64 byte size(equal to cache line). I have disabled all the prefetchers . I compile with gcc -O3 flag for optimization and -lpapi library. with this code, for third loop I get:

L1_HIT: 64, L1_MISS: 1011, L2_HIT: 15, L2_MISS: 996.

However if I uncomment PAPI_library_init before the array initialization and comment it after, the results I get is:

L1_HIT: 73, L1_MISS: 1004, L2_HIT: 990, L2_MISS: 14.

I am testing this in skylake server, cache sizes are:

L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              22528K

Now I am a bit confused why would papi initialization influence this results. it's L2 hit and miss that change. All I need is third loop, and the effect of first two loop on counters is not taken into account, I believe.

So any hint for this would be helpful as all the documentation says is this: "PAPI_library_init() initializes the PAPI library. It must be called before any low level PAPI functions can be used. If your application is making use of threads PAPI_thread_init (3) must also be called prior to making any calls to the library other than PAPI_library_init()."

0

There are 0 best solutions below