Consider the following trivial example:
#include <QCoreApplication>
#include <QDebug>
#include <QElapsedTimer>
#include <QRandomGenerator>
#include <QDateTime>
#include <functional>
inline void reader(int * d, int c, std::function<void(int *)> foo) {
while (c--) foo(d++);
}
template <class FOO>
inline void readerT(int * d, int c, FOO foo) {
while (c--) foo(d++);
}
inline int summer(int * d, int c) {
int s = 0;
reader(d, c, [&s](int * dd){ s += *dd; });
return s;
}
inline int summerT(int * d, int c) {
int s = 0;
readerT(d, c, [&s](int * dd){ s += *dd; });
return s;
}
int main(int argc, char* argv[]) {
QCoreApplication a(argc, argv);
int tcount = 100000000;
QVector<int> vec; vec.resize(tcount);
QRandomGenerator gen = QRandomGenerator::securelySeeded();
gen.generate(vec.begin(), vec.end());
QElapsedTimer t;
t.restart();
qDebug() << summer(vec.data(), tcount) << t.elapsed(); // 110 ms
t.restart();
qDebug() << summerT(vec.data(), tcount) << t.elapsed(); // 15 ms
t.restart();
int sum = 0;
for (const auto & i : vec) { sum += i; }
qDebug() << sum << t.elapsed(); // 15 ms
return a.exec();
}
The test makes it evident that the non-template solution is considerably slower, where the template solution appears to be as fast as the regular loop, or zero overhead.
Why is the compiler not able to reach the same level of optimization for the non-template version?
Notes: compiled with GCC11 O3
The problem boils down to the fact that
std::function's call mechanism is too complex as that it can be reliably inlined. Among the things necessary for an efficientstd::functionimplementation are:For the compiler to undo all of these things, it would need to apply
... pretty much perfectly. Modern compilers are still quite bad at optimizing anything to do with function pointers. You're simply expecting too much.
On the other hand, the
templateversion is very easily inlined, as seen in the assembly output (https://godbolt.org/z/vvG3fnT5f). If you can inlinereader, that enables partial loop unrolling, auto-vectorization, data flow analysis, and other optimizations which improve performance by an order of magnitude. Inlining makes or breaks it all.Example of Missed Optimizations
Just to illustrate how bad modern compilers are in this field:
Neither Clang nor GCC inline this (https://godbolt.org/z/zqjeG5xff), and both emit something along the lines of:
The theoretical optimum in this very trivial example would be
This trivial example is asking too much, and inlining all of a
std::functioncall dispatch perfectly is asking even more.Heap elision and function pointer inlining are generally not worth spending much effort on, since you can assume that if a developer is using function pointers and dynamic allocations, they probably need those and you won't be able to just optimize them out.