C++: efficient compile-time looping and Boost::hana::while_() runtime cost

116 Views Asked by At

Problem statement

I have been experimenting with strategies to resolve loops at compile time with C++. The algorithmic problem I aim to solve is as follows:

given integer 4-tuple I = (i1,i2,i3,i4) all between 0 and 3(d-1) such that i1+i2+i3+i4 = 3(d-1), find all J = (j1,j2,j3,j4), K = (k1,k2,k3,k4) and L = (l1,l2,l3,l4) between 0 and d-1, summing to d-1, such that I = J + K + L.

Once I have found one such J,K,L, I have a mathematical expression to compute. So the flow of the program is:

input I
for J <= I
    for K <= I - J s.t. I - J - K >= 0
         L = I-J-K
         compute<J,K,L>(data)
    endfor
endfor

I have no trouble walking the set of multi-indices of interest and can do it with simple for loops (6 nested) free of breaks or continues. I have previously observed runtime looping of this sort to be too slow for my requirements; benchmark below illustrates this. At the opposite end, I have computed by hand the indices at play for d = 2 and a routine carrying out a sequence of compute<J,K,L> with hard-coded J,K,L is very fast. I am now looking for a solution using compile-time logic to conserve speed comparable to hand-coded solutions while allowing more generic implementations (for any d, though tuples are always 4-tuples).

Attempted solution

In a previous question, I asked about compile-time for loops. The solution was as follows:

[&]<int... jj1>(std::integer_sequence<int, jj1...>){([&]<int j1>(){
   // loop body with j1 as the constexpr index
}.template operator()<jj1>(),...);}(std::make_integer_sequence<int,N>{});

with N the bound. Since then, I have found an alternative method using Boost::hana. That is using hana::while_:

hana::while_(hana::less_equal.than(hana::int_c<N>), 0_c, [&](auto jj1){
  constexpr int j1 = jj1;
  // loop body 
return jj1+1_c;});

From reading the documentation, it was my understanding that hana is a template meta-programming library with exclusively compile-time logic. Using such hana::while_ loops, my program now looks like 6 of these loops and the single runtime instruction at the deepest level:

hana::while_(auto j1_c) 
    hana::while_(auto j2_c) 
        hana::while_(auto j3_c) 
            hana::while_(auto k1_c) 
                hana::while_(auto k2_c) 
                    hana::while_(auto k3_c) 
                        constexpr this;
                        constexpr that;
                        not_constexpr_function<this,that>(dynamicdata);
// 6 hana::while_ termination

Each of those hana::while is like the above, I wrote them more succintly for clarity. There is only one line in the entire program which is not constexpr: the call using dynamic data.

I implemented this, an alternate version using the other constexpr for loop construction (labelled integer_seq), a version with exclusively runtime logic (label runtime) and a hand-written routine where I copy pasted printf results from the hana version (or any other) and hardcoded the indices in the call to compute. To be clear, this means I just unrolled the loops by hand, and the computations are done all in the same order. Observed speeds are as follows:

gcc -Ofast -flto ; Linux clang -Ofast -flto ; MacOS
Runtime 673,963 /s 357,125 /s
Hana::while_ 3,567,962 /s 1,048,172 /s
integer_seq 1,373,362 /s compile fails
Hand-written 12,743,705 /s 3,485,922 /s

EDIT: I had originally posted handcoded times for a routine ported from Fortran with a different order of operations. I rewrote it in the exact order of operations as this C++ Boost::hana implementation and it happens to be faster by about 1.5x (probably fewer cache misses).

Hana::while is faster than runtime logic by a factor 3~5 but still slower than hand-written by a factor ~3. This is a lot of overhead for a supposedly compile-time only function. The fact that the inner function is templated by the indices makes me think the logic is really operating at compile time. But then, why the x3 slowdown?

EDIT2: I was asked about optimization flags. The initial table was generated using either gcc on Linux or clang on MacOS (and a different machine) with -Ofast and -flto. I hadn't enabled -march=native though, so here goes (only relevant ones):

gcc -Ofast -flto -march=native ; Linux clang -Ofast -flto -march=native; MacOS
Hana::while_ 4,224,602 /s 1,980,352 /s
Hand-written 14,318,259 /s 4,385,473 /s

This improves times slightly more in favour of hana::while_, but it still lags behind by 2~3x with respect to hand-written (which actually is now generated code).

EDIT3: @Jesper_Juhl mentions the dangers of -Ofast. It also seems to benefit intensive floating point operations the most, and this option won't be used in production. So let's try the more realistic -O3 and see if Hana catches up:

gcc -O3 -flto -march=native ; Linux clang -O3 -flto -march=native; MacOS
Hana::while_ 4,160,842 /s 2,003,093 /s
Hand-written 9,221,753 /s 5,218,938 /s

The hand-written routine is really line after line of floating point computations, so it's only natural it suffered so much from going from -Ofast to -O3. Curiously, hana::while_ was barely slowed down. This suggests that -Ofast optimizations are not being applied to the same degree. So it's either that they're not applied to the computational kernel at all (at least some of the heuristics), or that there are inter-call optimizations that cannot be done in this context. Clang also randomly decided to make the hand-written about 20% faster with -O3 than with -Ofast, this is also very puzzling. I recompiled all over again and reran with -Ofast and I do obtain about 4.3M/s versus 5.2M with -O3.

Question

How is hana::while_ introducing runtime overhead when all the logic is happening at compile time? I thought code within these constructs would be equivalent to unrolled loops with the indices as good as hard-coded.

Is it that calling a templated function with many different index combinations is creating so many functions in the program that it is somehow slowed down? Or that the compiler has trouble optimizing these function calls?

Is there a better way to go about this problem? I've since written a code generating routine, the resulting code is very fast but it'll be more troublesome to maintain. I'd prefer if I could find a purely-C++ and within-program solution to this performance problem.

In summary: how can I use C++ meta-programming concepts (compile time computations) to achieve the same performance as hand-written code?

1

There are 1 best solutions below

1
davidhigh On

This became too long for a comment and doesn't answer the original question.


I would suggest you first need better algorithms, and only then better implementations. In your case, for a single dimension of your 4-tuple, you are looking for the integer partition into exactly three components.

i1 = j1 + k1 + l1

This problem is solved in general in the book Combinatorial Algorithms by Kreher and Stinson. They also provide an algorithm and code which solves this problem, in particular see the routine Algorithm 3.4 from their book (you find both the book and code freely available online).

Your task is then to transform this algorithm into compile time meta-programming language. Let's assume you have it available as PART(N,k), you can apply it as in the following:

  • First, you restrict yourself only to ordered tuples i1<=i2<=i3<=i4, and the same for J,K,L. You can generate all those by iterating through PART(3*(d-1),4). The general partitions are obtained by permuting through the 4 components.
  • For any component of the tuple i_n, you generate the partitions PART(i_n,3) and thus obtain j_n, k_n and l_n.
  • The ordered solution is then obtained by combining all the components into the 4-tuple.

This might be more complex, but is nearly independent of the tuple length, whereas your solution scales polynomially in the tuple length.