I am currently studying how the Page Miss Handler(PMH)/Page-structure cache works after a TLB/STLB miss, and I noticed that my measurements using the perf tool counters produced unexpected results. I benchmarked a program that writes to a 1GB array randomly using different configurations of 2MB and 4KB pages.
The results show a correlation between the L1 TLB/DTLB hits and the runtime, which is contrary to what I expected.

I suspect that there might be other TLB/STLB accesses that I did not take into account during the page-walk or while searching in the PDP/PDE/PML4 caches after a TLB/STLB miss..
This is the flow I imagine happens according to my lecture notes :

How can I determine if this is the case ?
EDIT:
1.Benchmark code : https://gist.github.com/HodBadichi/e08c1039cc22dd97c33cc9ab52fa97c4
2.It was tested over Haswell Xeon E3 with the following spec AFAIK:
