I'm trying an experiment with intrinsics in Rust where I make a big vector of floats, then record the time it takes to multiply all of them by a constant. Next I try the same thing with SSE intrinsics. On a relatively new laptop with 50 million floats, it comes out to about 23ms either way. When I do something similar with AVX, it's even slower, like 67ms. I allocate all the memory I need for both the input and output before I start the timer.
Here's how I do it with simple f32s:
let mut rng = rand::thread_rng();
let x: Vec<f32> = (0..N).map(|_| rng.gen::<i16>() as f32).collect();
let mut z_native: Vec<f32> = (0..N).map(|_| 0.0).collect();
let t0 = Instant::now();
for i in 0..N {
*z_native.get_unchecked_mut(i) = *x.get_unchecked(i) * std::f32::consts::PI;
}
println!("Native {:?}", t0.elapsed());
Here's how I do it with SSE:
let mut x_sse: Vec<__m128> = vec![];
let y_sse: __m128 = x86_64::_mm_set_ps1(std::f32::consts::PI);
let mut z_sse: Vec<__m128> = vec![];
for i in 0..(N/4) {
x_sse.push(x86_64::_mm_set_ps(
x[(i*4)+3], x[(i*4)+2],
x[(i*4)+1], x[(i*4)+0],
));
z_sse.push(x86_64::_mm_setzero_ps());
}
let t0 = Instant::now();
for i in 0..(N/4) {
*z_sse.get_unchecked_mut(i) = x86_64::_mm_mul_ps(*x_sse.get_unchecked(i), y_sse);
}
println!("SSE {:?}", t0.elapsed());
Is there something I'm missing about how this is supposed to work? I know about how the memory alignment makes a difference, but I read that Vec is always aligned on the size of whatever it contains. I also print out randomly-selected values from the results at the end to make sure they don't somehow get optimized out. Thanks!
Edit:
I built with cargo run --release so I don't know exactly what compiler flags I'm using, but it's whatever Cargo defaults to in release mode.