How is Rust --release build slower than Go?

280 Views Asked by At

I'm trying to learn about Rust's concurrency and parallel computing and threw together a small script that iterates over a vector of vectors like it was an image's pixels. Since at first I was trying to see how much faster it gets iter vs par_iter I threw in a basic timer -- which is probably not amazingly accurate. However, I was getting crazy high numbers. So, I thought I would put together a similar piece of code on Go that allows for easy concurrency and the performance is ~585% faster!

Rust was tested with --release

I also tried using native thread pool but the results were the same. Looked at how many threads I was using and for a bit I was messing around with that as well, to no avail.

What am I doing wrong? (don't mind the definitely not performant way of creating a random value filled vector of vectors)

Rust code (~140ms)

use rand::Rng;
use std::time::Instant;
use rayon::prelude::*;

fn normalise(value: u16, min: u16, max: u16) -> f32 {
    (value - min) as f32 / (max - min) as f32
}

fn main() {
    let pixel_size = 9_000_000;
    let fake_image: Vec<Vec<u16>> = (0..pixel_size).map(|_| {
        (0..4).map(|_| {
            rand::thread_rng().gen_range(0..=u16::MAX)
        }).collect()
    }).collect();

    // Time starts now.
    let now = Instant::now();

    let chunk_size = 300_000;

    let _normalised_image: Vec<Vec<Vec<f32>>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<Vec<f32>> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            vec![r, g, b, a]
        }).collect();

        normalised_chunk
    }).collect();

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed: {:.2?}", elapsed);
}

Go code (~24ms)

package main

import (
    "fmt"
    "math/rand"
    "sync"
    "time"
)

func normalise(value uint16, min uint16, max uint16) float32 {
    return float32(value-min) / float32(max-min)
}

func main() {
    const pixelSize = 9000000
    var fakeImage [][]uint16

    // Create a new random number generator
    src := rand.NewSource(time.Now().UnixNano())
    rng := rand.New(src)

    for i := 0; i < pixelSize; i++ {
        var pixel []uint16
        for j := 0; j < 4; j++ {
            pixel = append(pixel, uint16(rng.Intn(1<<16)))
        }
        fakeImage = append(fakeImage, pixel)
    }

    normalised_image := make([][4]float32, pixelSize)
    var wg sync.WaitGroup

    // Time starts now
    now := time.Now()
    chunkSize := 300_000
    numChunks := pixelSize / chunkSize
    if pixelSize%chunkSize != 0 {
        numChunks++
    }

    for i := 0; i < numChunks; i++ {
        wg.Add(1)

        go func(i int) {
            // Loop through the pixels in the chunk
            for j := i * chunkSize; j < (i+1)*chunkSize && j < pixelSize; j++ {
                // Normalise the pixel values
                _r := normalise(fakeImage[j][0], 0, ^uint16(0))
                _g := normalise(fakeImage[j][1], 0, ^uint16(0))
                _b := normalise(fakeImage[j][2], 0, ^uint16(0))
                _a := normalise(fakeImage[j][3], 0, ^uint16(0))

                // Set the pixel values
                normalised_image[j][0] = _r
                normalised_image[j][1] = _g
                normalised_image[j][2] = _b
                normalised_image[j][3] = _a
            }

            wg.Done()
        }(i)
    }

    wg.Wait()

    elapsed := time.Since(now)
    fmt.Println("Time taken:", elapsed)
}
3

There are 3 best solutions below

2
Mark Saving On BEST ANSWER

The most important initial change for speeding up your Rust code is using the correct type. In Go, you use a [4]float32 to represent an RBGA quadruple, while in Rust you use a Vec<f32>. The correct type to use for performance is [f32; 4], which is an array known to contain exactly 4 floats. An array with known size need not be heap-allocated, while a Vec is always heap-allocated. This improves your performance drastically - on my machine, it's a factor of 8 difference.

Original snippet:

    let fake_image: Vec<Vec<u16>> = (0..pixel_size).map(|_| {
        (0..4).map(|_| {
            rand::thread_rng().gen_range(0..=u16::MAX)
        }).collect()
    }).collect();

... 

    let _normalised_image: Vec<Vec<Vec<f32>>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<Vec<f32>> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            vec![r, g, b, a]
        }).collect();

        normalised_chunk
    }).collect();

New snippet:

    let fake_image: Vec<[u16; 4]> = (0..pixel_size).map(|_| {
    let mut result: [u16; 4] = Default::default();
    result.fill_with(|| rand::thread_rng().gen_range(0..=u16::MAX));
    result
    }).collect();

...

    let _normalised_image: Vec<Vec<[f32; 4]>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<[f32; 4]> = chunk.iter().map(|i| {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            [r, g, b, a]
        }).collect();

        normalised_chunk
    }).collect();

On my machine, this results in a roughly 7.7x speedup, bringing Rust and Go roughly to parity. The overhead of doing a heap allocation for every single quadruple slowed Rust down drastically and drowned out everything else; eliminating this puts Rust and Go on more even footing.

Second, there is a slight error in your Go code. In your Rust code, you calculate a normalized r, g, b, and a, while in your Go code, you only calculate _r, _g, and _b. I don't have Go installed on my machine, but I imagine this gives Go a slight unfair advantage over Rust, since you're doing less work.

Third, you are still not quite doing the same thing in Rust and Go. In Rust, you split the original image into chunks and, for each chunk, produce a Vec<[f32; 4]>. This means you still have a bunch of chunks sitting around in memory that you'll later have to combine into a single final image. In Go, you split the original chunks and, for each chunk, write the chunk into a common array. We can rewrite your Rust code further to perfectly mimic the Go code. Here is what this looks like in Rust:

let _normalized_image: Vec<[f32; 4]> = {
    let mut destination = vec![[0 as f32; 4]; pixel_size];
    
    fake_image
        .par_chunks(chunk_size)
        // The "zip" function allows us to iterate over a chunk of the input 
        // array together with a chunk of the destination array.
        .zip(destination.par_chunks_mut(chunk_size))
        .for_each(|(i_chunk, d_chunk)| {
        // Sanity check: the chunks should be of equal length.
        assert!(i_chunk.len() == d_chunk.len());
        for (i, d) in i_chunk.iter().zip(d_chunk) {
            let r = normalise(i[0], 0, u16::MAX);
            let g = normalise(i[1], 0, u16::MAX);
            let b = normalise(i[2], 0, u16::MAX);
            let a = normalise(i[3], 0, u16::MAX);
            
            *d = [r, g, b, a];

            // Alternately, we could do the following loop:
            // for j in 0..4 {
            //  d[j] = normalise(i[j], 0, u16::MAX);
            // }
        }
    });
    destination
};

Now your Rust code and your Go code are truly doing the same thing. I suspect you'll find the Rust code is slightly faster.

Finally, if you were doing this in real life, the first thing you should try would be using map as follows:

    let _normalized_image = fake_image.par_iter().map(|&[r, b, g, a]| {
    [ normalise(r, 0, u16::MAX),
      normalise(b, 0, u16::MAX),
      normalise(g, 0, u16::MAX),
      normalise(a, 0, u16::MAX),
      ]
    }).collect::<Vec<_>>();

This is just as fast as manually chunking on my machine.

0
trust_nickol On
use rand::Rng;
use std::time::Instant;
use rayon::prelude::*;

fn normalise(value: u16, min: u16, max: u16) -> f32 {
    (value - min) as f32 / (max - min) as f32
}

type PixelU16 = (u16, u16, u16, u16);
type PixelF32 = (f32, f32, f32, f32);

fn main() {
    let pixel_size = 9_000_000;
    let fake_image: Vec<PixelU16> = (0..pixel_size).map(|_| {
        let mut rng =
            rand::thread_rng();
        (rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX), rng.gen_range(0..=u16::MAX))
    }).collect();

    // Time starts now.
    let now = Instant::now();

    let chunk_size = 300_000;

    let _normalised_image: Vec<Vec<PixelF32>> = fake_image.par_chunks(chunk_size).map(|chunk| {
        let normalised_chunk: Vec<PixelF32> = chunk.iter().map(|i| {
            let r = normalise(i.0, 0, u16::MAX);
            let g = normalise(i.1, 0, u16::MAX);
            let b = normalise(i.2, 0, u16::MAX);
            let a = normalise(i.3, 0, u16::MAX);

            (r, g, b, a)
        }).collect::<Vec<_>>();

        normalised_chunk
    }).collect();

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed: {:.2?}", elapsed);
}

I have switched using arrays to tuple and the solution is already 10 times faster than the solution you provided on my machine. Speed could maybe even increased by cutting the Vec and using an Arc<Mutex<Vec<Pixel>>> or some mpsc channel by reducing the amount of heap allocations.

1
YthanZhang On

Vec<Vec<T>> is usually not recommended, because it's not very cache friendly, since you have Vec<Vec<Vec<T>>> the situation is even worse.

The process of memory allocation also cost a lot of time.

A slight improvement is to change the type to Vec<Vec<[T; N]>>, since the inner most Vec<T> should be a fixed size of 4 u16 or f32. This reduced the processing time on my PC from ~110ms down to 11ms.

fn rev1() {
    let pixel_size = 9_000_000;
    let chunk_size = 300_000;

    let fake_image: Vec<[u16; 4]> = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| rand::thread_rng().gen_range(0..=u16::MAX))
        })
        .collect();

    // Time starts now.
    let now = Instant::now();

    let _normalized_image: Vec<Vec<[f32; 4]>> = fake_image
        .par_chunks(chunk_size)
        .map(|chunk| {
            chunk
                .iter()
                .map(|rgba: &[u16; 4]| rgba.map(|v| normalise(v, 0, u16::MAX)))
                .collect()
        })
        .collect();

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r1): {:.2?}", elapsed);
}

However, this still requires a lot of allocation and copies. If a new vector is not needed, in place mutation can be even faster. ~5ms

pub fn rev2() {
    let pixel_size = 9_000_000;
    let chunk_size = 300_000;
    let mut fake_image: Vec<Vec<[f32; 4]>> = (0..pixel_size / chunk_size)
        .map(|_| {
            (0..chunk_size)
                .map(|_| {
                    core::array::from_fn(|_| {
                        rand::thread_rng().gen_range(0..=u16::MAX) as f32
                    })
                })
                .collect()
        })
        .collect();

    // Time starts now.
    let now = Instant::now();

    fake_image.par_iter_mut().for_each(|chunk| {
        chunk.iter_mut().for_each(|rgba: &mut [f32; 4]| {
            rgba.iter_mut().for_each(|v: &mut _| {
                *v = normalise_f32(*v, 0f32, u16::MAX as f32)
            })
        })
    });

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r2): {:.2?}", elapsed);
}

Here the Vec<Vec<T>> is still not ideal, while flattening it doesn't produce a significant performance improvement in this particular situation. Accessing an element in this nested array structure will be slower than a flat array.

/// Create a new flat Vec from fake_image
pub fn rev3() {
    let pixel_size = 9_000_000;
    let _chunk_size = 300_000;

    let fake_image: Vec<[u16; 4]> = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| rand::thread_rng().gen_range(0..=u16::MAX))
        })
        .collect();

    // Time starts now.
    let now = Instant::now();

    let _normalized_image: Vec<[f32; 4]> = fake_image
        .par_iter()
        .map(|rgba: &[u16; 4]| rgba.map(|v| normalise(v, 0, u16::MAX)))
        .collect();

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r3): {:.2?}", elapsed);
}

/// In place mutation of a flat Vec
pub fn rev4() {
    let pixel_size = 9_000_000;
    let _chunk_size = 300_000;

    let mut fake_image: Vec<[f32; 4]> = (0..pixel_size)
        .map(|_| {
            core::array::from_fn(|_| {
                rand::thread_rng().gen_range(0..=u16::MAX) as f32
            })
        })
        .collect();

    // Time starts now.
    let now = Instant::now();

    fake_image.par_iter_mut().for_each(|rgba: &mut [f32; 4]| {
        rgba.iter_mut()
            .for_each(|v: &mut _| *v = normalise_f32(*v, 0f32, u16::MAX as f32))
    });

    // Timer ends.
    let elapsed = now.elapsed();
    println!("Time elapsed (r4): {:.2?}", elapsed);
}