No Speedup in Float Multiply with Rust SSE Intrinsics

99 Views Asked by At

I'm trying an experiment with intrinsics in Rust where I make a big vector of floats, then record the time it takes to multiply all of them by a constant. Next I try the same thing with SSE intrinsics. On a relatively new laptop with 50 million floats, it comes out to about 23ms either way. When I do something similar with AVX, it's even slower, like 67ms. I allocate all the memory I need for both the input and output before I start the timer.

Here's how I do it with simple f32s:

let mut rng = rand::thread_rng();
let x: Vec<f32> = (0..N).map(|_| rng.gen::<i16>() as f32).collect();

let mut z_native: Vec<f32> = (0..N).map(|_| 0.0).collect();

let t0 = Instant::now();
for i in 0..N {
    *z_native.get_unchecked_mut(i) = *x.get_unchecked(i) * std::f32::consts::PI;
}
println!("Native {:?}", t0.elapsed());

Here's how I do it with SSE:

let mut x_sse: Vec<__m128> = vec![];
let y_sse: __m128 = x86_64::_mm_set_ps1(std::f32::consts::PI);
let mut z_sse: Vec<__m128> = vec![];

for i in 0..(N/4) {
    x_sse.push(x86_64::_mm_set_ps(
        x[(i*4)+3], x[(i*4)+2],
        x[(i*4)+1], x[(i*4)+0],
    ));
    z_sse.push(x86_64::_mm_setzero_ps());
}

let t0 = Instant::now();
for i in 0..(N/4) {
    *z_sse.get_unchecked_mut(i) = x86_64::_mm_mul_ps(*x_sse.get_unchecked(i), y_sse);
}
println!("SSE {:?}", t0.elapsed());

Is there something I'm missing about how this is supposed to work? I know about how the memory alignment makes a difference, but I read that Vec is always aligned on the size of whatever it contains. I also print out randomly-selected values from the results at the end to make sure they don't somehow get optimized out. Thanks!

Edit: I built with cargo run --release so I don't know exactly what compiler flags I'm using, but it's whatever Cargo defaults to in release mode.

0

There are 0 best solutions below