fmin and fmax are much slower than simple conditional operator

404 Views Asked by At

I was working on some C++ code to process video frame, and found std::fmin and std::fmax is much slower than simply conditional operator. I've simplify my code as following (modify my code more C++ style as mentioned in comments):

#include <cmath>
#include <chrono>
#include <iostream>
#include <memory>

void func()
{
    std::unique_ptr<uint8_t[]> ptr(new uint8_t[1280 * 720 * 3]);
    auto mem = ptr.get();

    auto start = std::chrono::steady_clock::now();
    for (int i = 0; i != 720; ++i) {
        for (int j = 0; j != 1280; ++j) {
            for (int k = 0; k != 3; ++k) {
                float tmp = i + j + k;
                // *mem++ = (tmp > 0.f ? (tmp < 255.f ? tmp : 255.f) : 0.f) + 0.5f;
                *mem++ = std::round(fmin(255.f, fmax(0.f, tmp)));
            }
        }
    }
    auto end = std::chrono::steady_clock::now();
    std::cout << "cost " << std::chrono::duration<double, std::milli>(end - start).count() << "ms\n";
}

int main() {
    int i = 5;
    while (i--) func();
}

The output cost of this code is about 20ms on my machine (with g++ -O3 test.cpp):

cost 21.1324ms
cost 20.8892ms
cost 19.9664ms
cost 19.9693ms
cost 19.9603ms

And if I replace all std lib math functions with my own code (by uncomment code above), the output cost is just about 4ms:

cost 3.90695ms
cost 3.48335ms
cost 3.02623ms
cost 2.65635ms
cost 2.76906ms

I've tried std::fmin and std::fmax (and std::round too) separately, they are all much slower. For example: *mem++ = fmax(0.f, tmp);

cost 9.31014ms
cost 8.86421ms
cost 7.8366ms
cost 7.86914ms
cost 7.82036ms

vs *mem++ = tmp > 0.f ? tmp : 0.f;

cost 3.50026ms
cost 3.05906ms
cost 2.33485ms
cost 2.36281ms
cost 2.38488ms

For std::round here, if I simply delete it, just run *mem++ = fmin(255.f, fmax(0.f, tmp));, the time cost improve about 7ms:

cost 13.4067ms
cost 13.2468ms
cost 12.1877ms
cost 12.2698ms
cost 12.1878ms
  1. std::fmin, std::fmax, std::round are all constexprs, and I thought there should not be function invoke overhead.
  2. I know std::round does more than simply +0.5f and assign to integer, but it's still much slower than my expectation.

g++ -v on my system (Ubuntu 20.04, x86-64):

Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.4.0-1ubuntu1~20.04.1' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-Av3uEd/gcc-9-9.4.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
1

There are 1 best solutions below

0
RandomBits On

As always, with performance questions, the hardware and software stack are both important and empirical measurement is the arbiter of truth. In this particular instance, the platform makes a big difference that is not intuitive or easily predicted.

I used nanobench to test three different options for the computation in question. Here are the results for two different platforms (one arm64, one x86).

M1, Mac OSX 13.4, Clang-16

ns/op op/s err% total benchmark
3,166,542.00 315.80 13.1% 0.04 compare
1,988,667.00 502.85 7.3% 0.02 round-fminmax
1,911,292.00 523.21 3.6% 0.02 clamp

Xeon, Ubuntu 20.04, Clang-17

ns/op op/s err% total benchmark
6,763,898.00 147.84 0.5% 0.08 compare
10,629,358.00 94.08 0.2% 0.13 round-fminmax
5,131,994.00 194.86 0.0% 0.06 clamp

Sample Code

#include <algorithm>
#include <iostream>
#include <cmath>
#include <chrono>
#include <iostream>
#include <memory>
#include "nanobench.h"

using std::cin, std::cout, std::endl;

template<class T>
auto clamp(const T& v, const T& lo, const T& hi) {
    return v < lo ? lo : hi < v ? hi : v;
}

template<class Op>
void func(uint8_t *mem, Op&& op)
{
    for (int i = 0; i != 720; ++i) {
        for (int j = 0; j != 1280; ++j) {
            for (int k = 0; k != 3; ++k) {
                float tmp = i + j + k;
                *mem++ = op(tmp);
            }
        }
    }
}

int main() {
    std::unique_ptr<uint8_t[]> ptr(new uint8_t[1280 * 720 * 3]);
    auto *mem = ptr.get();

    ankerl::nanobench::Bench().run("compare", [&]() {
        func(mem, [](float x) {
            return (x > 0 ? (x < 255 ? x : 255) : 0) + 0.5;
        });
    });

    ankerl::nanobench::Bench().run("round-fminmax", [&]() {
        func(mem, [](float x) {
            return std::round(fmin(255.f, fmax(0.f, x)));
        });
    });

    ankerl::nanobench::Bench().run("clamp", [&]() {
        func(mem, [](float x) {
            return clamp(x + 0.5f, 0.0f, 255.0f);
        });
    });
}