Kiss_fft is a pretty simple FFT implementation; it's almost textbook. It breaks down larger DFT's into repeated smaller FFT's, by factoring the FFT size. It has optimized butterfly functions for radix 2,3,4 and 5.
The 2,3 and 5 butterflies are small primes, commonly found when factoring, but the radix-4 butterfly is an optimized case that saves a few multiplications compared to using the radix-2 twice*. For instance, for a 32-point FFT, the size 32 is factored into 4x4x2 - two stages of kf_bfly4 followed by one of kf_bfly2
However, this means you can only have one kf_bfly2 stage. If your FFT size was a multiple of 4, you wouldn't have had two kf_bfly2 stages, but a single kf_bfly4 stage. And that also means that the kf_bfly2 function works on "arrays" of length 1.
In code, the declaration is
static void kf_bfly2(kiss_fft_cpx * Fout, const size_t fstride, const kiss_fft_cfg st, int m)
where Fout is an "array" of size m, i.e. always 1. The butterfly loops over Fout, and the compiler of course can't do the numerical analysis to show that m==1. But since this is the last butterfly, it is called quite often.
Q1) Is my analysis correct? Is kf_bfly2 indeed called only with m==1?
Q2) Assuming that this is indeed a missed optimization, there are two possible approaches. We could just drop the m parameter from kf_bfly2, or we could change the factor analysis to factor 32 as 2x4x4 (move kf_bfly2 up front, call it once at top level for arrays of size 4x4=16). What would be best?
[*] The radix-4 butterfly ends up having factors +1,-1, +i and -i, which can be implemented as additions and subtractions instead. See Why is the kiss_fft's forward and inverse radix-4 calculation different?
Edit
The question has a one-off error. The function doesn't follow the usual convention;
m==1is the upper bound. The input is in fact an array of length 2, not a scalar. The remainder of the answer below is correct, and there's indeed an optimization possibleOptimized code (also for C++)
GitHub
Original answer
To understand the value of "m", it helps to understand how Kiss FFT works: it turns a large FFT recursively into smaller FFT's, by factoring the FFT size. And there are multiple possible factorizations: 32 is 4x4x2 but also 2x4x4.
Now that
mparameter inkf_bfly2and the other butterfly functions is the product of all the factors that follow. So, using 32 again as an example,kf_bfly4will be called with m=8 and m=2.I thought that
m=1inkf_bfly2because in my tests it always happened to be the last factor, if it was a factor. If you use a 221 point FFT, the last factor will be 17, not 2.But 34 is factored as 2x17, and in that case
m==17. There is no special reason for this, this just happens to be an artifact of the implementation ofkf_factor. It tests factorization by checkingn%4, n%2, n%3, n%5, n%7, n%9, n%11 .... That's not the most effective way, of course.Checking n%9is pointless as you already found two factors of 3 earlier.But note from the position of
n%4there is freedom in checking the factors. We can simply checkn%2last. If we do this, then 34 will factor as17x2instead of2x17. This may seem like a trivial optimization, but it does allow us to remove the loop inkf_bfly2. In addition, it now only uses the first twiddle factor - and that is a very convenient{1 + 0i}. So the only multiplication inkf_bfly2is also eliminated.The optimizations don't stop here - since
kf_bfly2is now a whole lot smaller, and there is only one call site*, your compiler will likely inlinekf_bfly2intokf_work. And we know from the changed factoring that2can only be the last factor, so we can stop the recursion inkf_workas well.TL;DR
When you change the order of factorization, you unlock multiple optimizations in a row.