Why does Rust performance of `Arc` vs `Rc` vary with the size of the referenced vector?

1.1k Views Asked by At

While doing the rustlings exercises on smart pointers, I was wondering what the performance cost of using an Arc which references a list of numbers is if multiple threads are reading this list.

So I created a simple program based on the Arc exercise. It calculates the sum of numbers in a vector, first ten times in one thread using Rc and taking the average time, then ten times in ten parallel threads using Arc, and compares the times taken:

use std::sync::Arc;
use std::rc::Rc;
use std::thread;
use std::time::Instant;

fn main() {
    let numbers: Vec<_> = (0..1_000_000_u128).collect();
    let shared_arc = Arc::new(numbers.clone());
    let shared_rc = Rc::new(numbers);

    let mut joinhandles = Vec::new();

    // Test Rc
    let start = Instant::now();

    for _ in 0..10 {
        (*shared_rc).iter().sum::<u128>();
    }

    let t1 = start.elapsed();
    println!("{:?}", t1 / 10);


    // Test Arc
    let t2 = Instant::now();

    for _ in 0..10 {
        let thread_numbers = Arc::clone(&shared_arc);

        joinhandles.push(thread::spawn(move || {
            (*thread_numbers).iter().sum::<u128>();
        }))
    }

    // Join all threads
    for handle in joinhandles {
        handle.join().unwrap();
    }

    let t3 = t2.elapsed();
    println!("{:?}",  t3);
}

I ran this program for varying sizes of the numbers vector and consistently found run times such as:

Size numbers Time Rc Time Arc
1 000 000 4.7 ms 8.5 ms
10 000 000 47 ms 99 ms
100 000 000 497 ms 1088 ms

The threads using the Arc typically take twice as long as the thread that just uses Rc!

From my understanding, Arc only differs from Rc in that Arc updates its reference count in a thread-safe way, but Rc doesn't. I'd then expect the performance cost of Arc to vary with the number of threads (specifically the number of calls to Arc::clone in the code above), but not with the size of the vector that either reference counter is referring to.

I guess the solution is either of the following:

  • The performance cost is actually due to the threaded nature of the above test with Arc; I'm not sure how threads works exactly
  • The (A)Rc-object is somehow edited every time a thread dereferences it. If so, why?

But which is it? I'd love to hear your thoughts!

Edit:

To clarify, this was run without compiler optimisation. My CPU is an i7 8700K (6 cores, 12 threads, 12 MB cache).

1

There are 1 best solutions below

3
prog-fh On

You are not comparing Arc against Rc but rather parallel vs serial execution. In your first loop, if you use shared_arc instead of shared_rc, you will probably not see any difference in performances.

Moreover, the sums you wrote could not be generated by the compiler, because their result is discarded. You should wrap them in std::hint::black_box(). You run with --release, don't you ?

During serial execution, the 16MB array probably stands in L3 cache and since the traversal is straightforward, the prefetch will probably be optimal for L1/L2 caches. Then this computation runs at full speed.

Depending on your hardware architecture, especially if you have more threads than hardware cores, these threads will probably «fight» for the cores. These context switches will probably degrade performance because of L1/L2 cache-trashing : prefetched data for one thread will be evicted from the L1/L2 cache after the context switch in order to load the relevant data for the new running thread, and so on at the next context switch. When the array is bigger, the same problem also occurs with the L3 cache.

In general, when a problem is memory-bound (much of the time is spent in loading data, not actually computing), parallelisation does not help much (the memory bandwidth cannot go further, except on NUMA systems) and could even degrade performance due to cache-trashing if too many threads are involved. Moreover, if the OS schedules two threads on the same hardware core, the competition will be tough between them. From my experience, I achieve optimal parallel performance in computation when binding carefully the threads to distinct CPU cores (assuming there are no other CPU-intensive tasks in the system).


Hey, but I didn't realise at first sight: your serial performance result is divided by ten, but not the parallel one. Then the parallel execution is actually much faster than the serial one. The acceleration is not ideal (×N for N threads) because there are always some overheads (thread management/scheduling, competition for cache and memory bandwidth...).
But, above all here, the problem is clearly memory-bound (perf says so), thus we cannot expect the ten threads to obtain ten times more data in the same amount of time.