So, I am currently working on graph sampling and developed an algorithm to do this on multiple CPU cores. It uses a lock-free hash set and a locking job queue. Basically, a thread gets a job from the queue and during processing the job, it might add jobs to the queue. This is done until all jobs are finished. The hash set is used to produce unique samples across all threads and the main activity a thread does, is inserting a sample into the hash set. That's why I implemented the hash-set to be lock-free. This is what I followed.
I ran it on a machine with 16 cores and on a machine with 24 cores, however, the single-threaded version of the algorithm is always an order of magnitude faster than the multi-threaded one.
And I don't really understand why and I ran out of ideas to try.
I first checked with perf if there is an issue with cache misses. Here are the outputs: Single-threaded:
146.062.677 cpu_core/cache-references/ (99,82%)
94.803.081 cpu_atom/cache-references/ (0,20%)
67.115.478 cpu_core/cache-misses/ (99,82%)
49.006.160 cpu_atom/cache-misses/ (0,20%)
8.206.501.423 cpu_core/cycles/ (99,82%)
5.437.332.727 cpu_atom/cycles/ (0,20%)
12.932.996.905 cpu_core/instructions/ (99,82%)
6.916.605.356 cpu_atom/instructions/ (0,20%)
2.751.128.600 cpu_core/branches/ (99,82%)
1.550.092.039 cpu_atom/branches/ (0,20%)
89.326 faults
68 migrations
1,688575537 seconds time elapsed
1,577569000 seconds user
0,112396000 seconds sys
Multi-threaded:
264.637.972 cpu_core/cache-references/ (81,19%)
57.264.669 cpu_atom/cache-references/ (33,65%)
116.415.033 cpu_core/cache-misses/ (81,19%)
1.638.935 cpu_atom/cache-misses/ (33,65%)
26.938.532.172 cpu_core/cycles/ (81,19%)
14.019.200.396 cpu_atom/cycles/ (33,65%)
23.240.603.882 cpu_core/instructions/ (81,19%)
2.915.516.452 cpu_atom/instructions/ (33,65%)
4.629.780.022 cpu_core/branches/ (81,19%)
563.800.488 cpu_atom/branches/ (33,65%)
229.810 faults
474 migrations
2,308616324 seconds time elapsed
2,295040000 seconds user
4,175755000 seconds sys
So, I am not sure if I am interpreting these stats correctly, the cache references count how often something is fetched from cache and cache misses count how often those fetches have to go to the next level of memory. (I think this refers to L3 caches, right?) In the multi-thread version there are definitely more misses however also more references and the ratio is pretty much the same as for the single-threaded version. Or does the ratio not matter at all?
I did also try using a lock-free job queue, however, the timing results were pretty much the same (albeit it, there is a bug there, but it only happens when the GC cleans up before the program finishes, so it shouldn't matter for the result). This however made sense to me, compared to how frequently the threads insert into the hash set, they rarely touch the queue.
I did compare it with a Python library that also does graph sampling and funnily enough, it's as slow as my multi-threaded version (at least in the order of magnitude), so I am not sure if this is a scaling issue where you need quite a lot of cores to overcome the larger overhead for thread-safe operations?
And yeah, I am currently stuck. I don't necessarily expect it to get working, but I would like to understand why there is no speedup.