I keep coming up against this issue, so I would be grateful for shared insights from the resident experts. I can't be the only one who keeps encountering this.
I know I'm using old hardware, but I'm interested in the relevant issues for this sort of hardware and newer. I'm using a 3.2 GHz Ivy Bridge Xeon with 32 GB ECC RAM. It has only 8MB L3 Cache, but since I'm performing random reads and writes on multi-gigabyte buffers, I don't think the L3 cache size is very relevant to this question.
I've found it takes ~100 to read and update a 64 bit integer in a random position in a large buffer of the same. The buffer is aligned so it should only need to load one cache lane for each update. In a single threaded application, that means I get around 10-15 million updates per second. Supposing I had millions of independent such updates to perform, is there a faster way of getting that work done than processing them in sequence? I know I could multi-thread the application, but with only a few hundred cycles latency, does that mean it wouldn't be worth having more threads than cpu cores? I'm looking more for 10X+ speedup not just 2-4X. I have an OpenCL program which does this on an old Tesla GPU with GDDR5 RAM and I can get well over a billion updates per second, but efficient GPU programming is difficult and not friendly to branching code.
My Ivy Bridge CPU doesn't support AVX Gather/Scatter instructions. Would these instructions issue requests for all elements before necessarily receiving any of the responses from memory, or do they serialise the memory requests? The reported cycles per instruction are meaningless in this context because they would be mostly requesting data not in cache already. If they can be used to hide the latency and get multiple elements at once though, that could be really useful.
I don't think there is, but am I missing a trick here? Is there some way in c to use non-blocking memory reads/writes? Many Thanks.