I'm quite new to Rust, so I might be missing something simple. I am using Rust 1.70.0-nightly. Here is the necessary code:
fn selection_sort(original_vec: &mut Vec<i32>) -> Vec<i32> {
for i in 0..original_vec.len()-1 {
let mut smallest: usize = i;
for j in i+1..original_vec.len() {
if original_vec[j] < original_vec[smallest] {
smallest = j;
}
}
if smallest != i {
original_vec.swap(i, smallest);
}
};
original_vec.to_vec()
}
// helper function for testing (uses builtin sort function)
fn rust_sort<A, T>(mut array: A) -> A
where
A: AsMut<[T]>,
T: Ord,
{
let slice = array.as_mut();
slice.sort();
array
}
const TEST_VECS: [[i32; 10]; 6] = [
[1, 3, 2, 9, 6, 7, 4, 10, 8, 5],
[1, 2, 7, 2, 9, 9, 7, 10, 2, 1],
[0, 4, 1, 3, 9, 12, 3, 0, 13, 8],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[-1, -5, -10, 1, 10, 2, -123, -34, 0, -32],
[i32::MAX, -32984, i32::MIN, 31648, 5349857, -30954, -2343285, 0, 1, i32::MIN],
];
#[bench]
fn bench_rust_sort(b: &mut Bencher) {
b.iter(|| {
for i in TEST_VECS {
(&mut i.to_vec()).sort()
}
})
}
#[bench]
fn bench_selection_sort(b: &mut Bencher) {
b.iter(|| {
for i in TEST_VECS {
selection_sort(&mut i.to_vec());
}
})
}
When I run cargo bench:
$ cargo bench
Compiling rust-algs v0.1.0 (/home/josia/projects/rust-algs)
Finished bench [optimized] target(s) in 0.25s
Running unittests src/lib.rs (target/release/deps/rustalgs-ae260c07593c3aad)
running 3 tests
test test_selection_sort ... ignored
test bench_rust_sort ... bench: 106 ns/iter (+/- 8)
test bench_selection_sort ... bench: 102 ns/iter (+/- 9)
test result: ok. 0 passed; 0 failed; 1 ignored; 2 measured; 0 filtered out; finished in 7.65s
I tried this many times, and even renamed the testing function to change the order of testing. No matter what, my custom selection sort function still performs faster. I'm guessing the problem lies within the fact that I had to call a function to wrap the main default sort function. Calling the actual function wouldn't work, because even if I clone the TEST_VECS constant in the benching function as a vector, the sorting function would just keep sorting it, which won't let the other benching iterations to sort it. If I cloned the constant within the benching closure, it would affect the bench iteration's performance greatly and I wouldn't be able to just benchmark the code I'm trying to run.
Is there a problem with the way I am benchmarking these functions, or is my custom function somehow just faster?
TL:DR:
Rustc
-C opt-level=3fully inlines and fully unrolls yourselection_sortfor this compile-time constant size=10. Withopt-level=2it inlines but compiles to a loop like the source.Rust's
.sortuses Insertion Sort for sizes this small, but for some reason it doesn't inline. So that's actually running a not-unrolled loop with a size that's a runtime-variable (as far as it knows). Note the caller doingmov esi, 10/ ... /call core::slice::sort::insertion_sort_shift_left. I think this is the key thing that's making.sortslightly slower.Copying from
TEST_VECSis fine although maybe a bit small; modern CPUs might be learning the pattern of branching with the branch predictors and doing better than you'd expect as part of a larger program for unsorted inputs of that size. You could have more than 6 slices and still not be getting and L1d cache misses.It would be better to avoid alloc/dealloc inside the repeat loop. (Use
.clone_from_slice(as described here) into a slice or vector that gets allocated outside the repeat loop). Both loops are paying for the same alloc/dealloc overhead, and that should balance out. The compiler already does the actual copying efficiently here.Rust's
.sortdoesn't internally alloc anything for slice sizes this small (using Insertion Sort), and passing/returning a Vec optimizes away to just in-place sorting, no extra copying.black_boxcould make this more robust against potential optimizations that defeat the repeat-loop, but it looks like that isn't happening with current versions.Selection Sort has O(N^2) time complexity (best and worst case). Insertion Sort is O(N) for already sorted arrays, which is one reason it's the standard choice for small problems (and sub-problems of divide-and-conquer algorithms for larger arrays), but in general is O(N^2) average and worst case. For large problems, all O(N^2) sorts are garbage, so
.sortand.sort_unstableuse O(N log N) algorithms that are much faster thanselection_sort. Murasame's answer measured a factor of 15 speed difference for size=1000. (https://doc.rust-lang.org/std/vec/struct.Vec.html#method.sort / https://doc.rust-lang.org/std/primitive.slice.html#method.sort_unstable)If your CPU is Skylake-family, you should look for a Rust equivalent of How can I mitigate the impact of the Intel jcc erratum on gcc? - without controlling for that factor, one microbenchmark might get bitten by it while another doesn't.
So we can say that Selection Sort compiles to faster asm with current Rustc with
-C opt-level=3, for small compile-time-constant Vecs that aren't (close to) already sorted. Based on what we're seeing in the asm as the apparent reason for these results, I expect it would be different if any of those factors were different. Perhaps even different tuning defaults for different ISAs could make it not fully unroll. Also, it's somewhat important that mainstream x86-64 CPUs can handle such large loops fully efficiently (via a large-enough uop cache).selection_sortinlined and fully unrolled,.sortdidn'tYour measurements are probably correct for this specific case, of size=10 as compile-time constant that the compiler can see.
rustc -C opt-level=3is fully inlining and fully unrolling yourselection_sort, but emitting acallto the standard library's Insertion Sort. So.sort()includes the loop overhead of dealing with a runtime-variable slice size.(I don't know how to get the Matt Godbolt's compiler explorer to show me asm for the actual
#[bench]functions, so I made my own simple counted-loop caller on the assumption that inlining choices would be basically the same intob.iter(|| { ... })loops. Godbolt - change which call is uncommented to see asm for the other one.)Normally Insertion Sort is just as good as Selection Sort for unsorted data like yours (both N^2 with similar constant factor), but it does have more data movement which maybe doesn't lend itself as well to fully unrolling for a compile-time-constant
.len(). Or some difference in the way it comes from the standard library vs. yourselection_sortbeing simple and fully there in the source file, and written fori32specifically? But.sortdoes inline and optimize away the logic to chooseinsertion_sort_shift_leftwithout any of the large-slice code still being present, so nothing is inherently stopping the compiler, it just chose not to.Anyway, I think fully unrolling the outer and inner loops is what's making
selection_sortfast here. After copying a TEST_VEC into a newly-allocated Vec, the asm goes right into comparing, likeTest data
Sorting the same data repeatedly is more or less fine, although "only" 6x 10-element sorts might be small enough for your CPU to learn the pattern of branching. (Modern IT-TAGE predictors use previous history to index a prediction make it possible to predict patterns for the same branch, so this doesn't necessarily give the fully-unrolled selection_sort a big advantage, especially since you're sorting 6 different slices so even full unrolling doesn't make a branch go the same way every time.) Your testing with
perf statfinding only about 1.5% mispredict rate seems lowish, but that is including loop branches. Try making the arrays a bit longer and/or adding another few.A few years ago, I was playing with a hand-written asm Bubble Sort from a code-golf exercise, and my Skylake was able to learn its pattern of branching for input sizes up to about 14 elements IIRC, near-zero mispredicts until I made the size larger. And that's Bubble Sort, one of the worst O(N^2) sorts, plus its loop branching. I was benchmarking it with a caller that just did a couple
vmovdqa ymmstores from vector regs, which is the cheapest possible way to re-generate a small input for an in-place sort. Rust is loading that data each repeat-loop iteration since there are 6 different sources, but that's only negligibly more expensive. Scalar loads inside the sort function can load right away without even waiting for the vector stores to commit to L1d cache; modern x86 CPUs have good store-to-load forwarding for scalar reloads of vector stores. Everything hits in L1d cache.Another option would be a vectorized xorshift+ like in What's the fastest way to generate a 1 GB text file containing random digits? to get fresh data to sort every iteration. Using the sorted array as the seed for an LFSR (like Murasame's answer) would be a way to benchmark latency instead of throughput, but will also have a store-forwarding stall when you do a SIMD vector load of the array that was just written by scalar stores. (Assuming the compiler auto-vectorizes your loop.)
O(N^2) sorts like Selection Sort are slow on bigger arrays
I assume you already know this, but Selection Sort has O(N^2) time complexity so it's garbage for larger sizes. Rust's
.sortand.sort_unstableuse O(N log N) sorts for larger problems (iterative merge "inspired by timsort", or pattern-defeating quicksort respectively), but use Insertion Sort for small sizes, or sub-problems of the larger sorts. Here the size is a compile-time constant and Rust's.sortfunction can inline enough to optimize away branching on the size to pick a strategy, so it's just your selection_sort vs. the standard library's Insertion Sort (core::slice::sort::insertion_sort_shift_left) for problems of this size.(Surprisingly,
.sort_unstableinlines less (Godbolt), callingcore::slice::sort::recursewhich checks the size at run-time for being less than21, otherwise continuing into its quicksort.)Insertion Sort and Selection Sort do well for small arrays where more complex algorithms would spend too much time sub-dividing an already-small problem. That's why real-world sorts use Insertion Sort once they've sub-divided and partially-sorted enough to create small sub-problems. (Or a sorting network, perhaps using SIMD vectors.)
Stable sorting is irrelevant when the objects being sorted are simple integers, not a struct with a key and some other data. Equal
i32objects are indistinguishable. But it seems.sortdoesn't try to detect that since we get different asm from.sortand.sort_unstable.black_boxYou don't do anything to stop the sort from being optimized away; the resulting vector is unused. But it appears current versions of
rustc -C opt-level=3miss that optimization so your benchmark appears valid.Using
black_box(sort_result)would be safer / future-proof, although if the compiler did fully see through your benchmark it could just copy from already-sorted arrays to satisfy theblack_boxrequirement to materialize a sort result somewhere. So to be fully safe, you'd useblack_boxon the input slice or vector and then on the output; hopefullyblack_boxmakes the compiler forget what values are there, i.e. treat it like a function call that might have modified the slice, not just read it.And avoiding alloc/dealloc
The copying is super cheap, just 2 vector (
movups) and 1 scalar (mov rdx, [r15 + 32]/mov [rax+32], rdx) load+store.But the alloc/dealloc could be avoided with a better benchmark. I was wrong about this in comments; I was looking for a
call some_longer_namebut actually Rustc likes to load library function pointers into registers (if they're going to be called inside loops) so you have to watch for thecallmnemonic to spot instructions likecall rbp.How can I copy a vector to another location and reuse the existing allocated memory? describes how to use
.clone_fromor.clone_from_slice()to copy the data from a slice into an already-allocatedVec. I looked at the asm to confirm that there were nocallinstructions inside thefor i in TEST_VECS {loop, unlike the original.(Godbolt)
If I understand correctly
.as_slice()on aVecis like C++.data()on astd::vector, except it's a slice so it has a length. Anyway, the idea is to avoid asking the compiler to actually store pointer and length to stack space. But I'm not sure that worked: I am still seeing stores likemov [rsp + 24], r14;mov qword ptr [rsp + 32], 10etc.Using
Vecin the first place is unnecessary; for these small sizes, just a slice in stack space would be fine, and probably would have avoided any alloc/dealloc overhead even if we had been cloning from constant slices since the compiler would know it could reuse stack space.This function does a 240-byte
memcpyofTEST_VECSto stack space every outer iteration. (It copies from there to theVecobject.) I don't know why that's happening, but I barely know Rust. It happens without any use ofblack_boxso it's not being triggered by that.