fmin and fmax are much slower than simple conditional operator

Question

fmin and fmax are much slower than simple conditional operator

404 Views Asked by danry At 02 June 2023 at 06:56

I was working on some C++ code to process video frame, and found std::fmin and std::fmax is much slower than simply conditional operator. I've simplify my code as following (modify my code more C++ style as mentioned in comments):

#include <cmath>
#include <chrono>
#include <iostream>
#include <memory>

void func()
{
    std::unique_ptr<uint8_t[]> ptr(new uint8_t[1280 * 720 * 3]);
    auto mem = ptr.get();

    auto start = std::chrono::steady_clock::now();
    for (int i = 0; i != 720; ++i) {
        for (int j = 0; j != 1280; ++j) {
            for (int k = 0; k != 3; ++k) {
                float tmp = i + j + k;
                // *mem++ = (tmp > 0.f ? (tmp < 255.f ? tmp : 255.f) : 0.f) + 0.5f;
                *mem++ = std::round(fmin(255.f, fmax(0.f, tmp)));
            }
        }
    }
    auto end = std::chrono::steady_clock::now();
    std::cout << "cost " << std::chrono::duration<double, std::milli>(end - start).count() << "ms\n";
}

int main() {
    int i = 5;
    while (i--) func();
}

The output cost of this code is about 20ms on my machine (with g++ -O3 test.cpp):

cost 21.1324ms
cost 20.8892ms
cost 19.9664ms
cost 19.9693ms
cost 19.9603ms

And if I replace all std lib math functions with my own code (by uncomment code above), the output cost is just about 4ms:

cost 3.90695ms
cost 3.48335ms
cost 3.02623ms
cost 2.65635ms
cost 2.76906ms

I've tried std::fmin and std::fmax (and std::round too) separately, they are all much slower. For example: *mem++ = fmax(0.f, tmp);

cost 9.31014ms
cost 8.86421ms
cost 7.8366ms
cost 7.86914ms
cost 7.82036ms

vs *mem++ = tmp > 0.f ? tmp : 0.f;

cost 3.50026ms
cost 3.05906ms
cost 2.33485ms
cost 2.36281ms
cost 2.38488ms

For std::round here, if I simply delete it, just run *mem++ = fmin(255.f, fmax(0.f, tmp));, the time cost improve about 7ms:

cost 13.4067ms
cost 13.2468ms
cost 12.1877ms
cost 12.2698ms
cost 12.1878ms

std::fmin, std::fmax, std::round are all constexprs, and I thought there should not be function invoke overhead.
I know std::round does more than simply +0.5f and assign to integer, but it's still much slower than my expectation.

g++ -v on my system (Ubuntu 20.04, x86-64):

Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04.1' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

Original Q&A

There are 1 best solutions below

**RandomBits** · Answer 1 · 2023-06-02T19:30:08.377000

As always, with performance questions, the hardware and software stack are both important and empirical measurement is the arbiter of truth. In this particular instance, the platform makes a big difference that is not intuitive or easily predicted.

I used nanobench to test three different options for the computation in question. Here are the results for two different platforms (one arm64, one x86).

M1, Mac OSX 13.4, Clang-16

ns/op	op/s	err%	total	benchmark
3,166,542.00	315.80	13.1%	0.04	`compare`
1,988,667.00	502.85	7.3%	0.02	`round-fminmax`
1,911,292.00	523.21	3.6%	0.02	`clamp`

Xeon, Ubuntu 20.04, Clang-17

ns/op	op/s	err%	total	benchmark
6,763,898.00	147.84	0.5%	0.08	`compare`
10,629,358.00	94.08	0.2%	0.13	`round-fminmax`
5,131,994.00	194.86	0.0%	0.06	`clamp`

Sample Code

#include <algorithm>
#include <iostream>
#include <cmath>
#include <chrono>
#include <iostream>
#include <memory>
#include "nanobench.h"

using std::cin, std::cout, std::endl;

template<class T>
auto clamp(const T& v, const T& lo, const T& hi) {
    return v < lo ? lo : hi < v ? hi : v;
}

template<class Op>
void func(uint8_t *mem, Op&& op)
{
    for (int i = 0; i != 720; ++i) {
        for (int j = 0; j != 1280; ++j) {
            for (int k = 0; k != 3; ++k) {
                float tmp = i + j + k;
                *mem++ = op(tmp);
            }
        }
    }
}

int main() {
    std::unique_ptr<uint8_t[]> ptr(new uint8_t[1280 * 720 * 3]);
    auto *mem = ptr.get();

    ankerl::nanobench::Bench().run("compare", [&]() {
        func(mem, [](float x) {
            return (x > 0 ? (x < 255 ? x : 255) : 0) + 0.5;
        });
    });

    ankerl::nanobench::Bench().run("round-fminmax", [&]() {
        func(mem, [](float x) {
            return std::round(fmin(255.f, fmax(0.f, x)));
        });
    });

    ankerl::nanobench::Bench().run("clamp", [&]() {
        func(mem, [](float x) {
            return clamp(x + 0.5f, 0.0f, 255.0f);
        });
    });
}

fmin and fmax are much slower than simple conditional operator

There are 1 best solutions below

Sample Code

Related Questions in C++

Related Questions in GLIBC

Related Questions in STD

Related Questions in MATH.H

Trending Questions

Popular # Hahtags

Popular Questions