I'm trying to write a C program that is as fast as numpy.sum on an array of doubles, but seem to fail.
Here is how I'm measuring numpy performance:
import numpy as np
import time
SIZE=4000000
REPS=5
xs = np.random.rand(SIZE)
print(xs.dtype)
for _ in range(REPS):
start = time.perf_counter()
r = np.sum(xs)
end = time.perf_counter()
print(f"{SIZE / (end-start) / 10**6:.2f} MFLOPS ({r:.2f})")
The output is:
float64
2941.61 MFLOPS (2000279.78)
3083.56 MFLOPS (2000279.78)
3406.18 MFLOPS (2000279.78)
3712.33 MFLOPS (2000279.78)
3661.15 MFLOPS (2000279.78)
Now trying to do something similar in C:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define SIZE 4000000
#define REPS 5
double *make_random_array(long array_size) {
double *array = malloc(array_size * sizeof(double));
if (array == NULL)
return NULL;
srand(0);
for (size_t i = 0; i < array_size; ++i) {
array[i] = (double)rand() / RAND_MAX;
}
return array;
}
double sum_array(const double *array, long size) {
double sum = 0.0;
for (size_t i = 0; i < size; ++i) {
sum += array[i];
}
return sum;
}
int main() {
double *xs = make_random_array(SIZE);
if (xs == NULL) return 1;
for (int i = 0; i < REPS; i++) {
clock_t start_time = clock();
double r = sum_array(xs, SIZE);
clock_t end_time = clock();
double dt = (double)(end_time - start_time) / CLOCKS_PER_SEC;
printf("%.2f MFLOPS (%.2f)\n", (double)SIZE / dt / 1000000, r);
}
free(xs);
return 0;
}
Compiling this with gcc -o main -Wall -O3 -mavx main.c and running it the output is:
1850.14 MFLOPS (1999882.86)
1857.01 MFLOPS (1999882.86)
1900.24 MFLOPS (1999882.86)
1903.86 MFLOPS (1999882.86)
1906.58 MFLOPS (1999882.86)
Clearly this is slower than numpy.
According to top CPU usage of the python process is around 100%, so it doesn't look like numpy is parallelizing anything.
The C code appears to use 256 bit AVX registers (when compiling with -S there are vaddsd instructions on xmm0). This seems to be the best option, as the machine I'm on doesn't appear to support AVX-512:
$ egrep 'model name|flags' /proc/cpuinfo | head -n2
model name : 13th Gen Intel(R) Core(TM) i9-13900K
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
What sort of trick does numpy do to beat this piece of C code?
Your loop is not optimized at all because strict FP math is the default. XMM0 is a 128-bit register, YMM0 is the corresponding 256-bit.
vaddsdis ADD Scalar Double, using the low element of XMM0. https://felixcloutier.com/x86/addsdWith
clang -O3 -ffast-math -march=nativeto let it vectorize and unroll (by 4x) to get a 16x speedup, 4x each from AVX and from instruction-level parallelism (wikipedia / Modern Microprocessors A 90-Minute Guide!), with an array small enough to not bottleneck on L3 cache bandwidth. (Another approximately 2x performance is available with an array that fits in L1d, not just L2, e.g. with#pragma clang loop interleave_count(8)to unroll more, for code you've cache-blocked to usually get hits in L1d cache.)Your Raptor Lake CPU has two fully-pipelined vector-FP add units, with pipeline length 3 (cycles of latency before the result is ready to be an input to another add). This answer includes my results on i7-6700k Skylake which has the same except the FP-add pipelines have 4 cycle latency.
@Jérôme Richard commented that NumPy just does scalar pairwise summation for sums of FP arrays, which gains some ILP over pure naive serial. Can be ok if you're bottlenecked on DRAM bandwidth. One upside is numerical consistency across ISAs and available SIMD features, achieved by not using them.
You're looking for
vaddpd ymm0, ymm0, [rdi](Packed Double on a 256-bit vector). GCC will do this with-ffast-mathwhich among other things lets it pretend FP math ops are associative, changing rounding error. (For the better in this case, e.g. if you compare against along doublesum or Kahan error-compensated summation; this is a step in the direction of the same idea as pairwise summation.) See also https://gcc.gnu.org/wiki/FloatingPointMathThis gives about a factor of 4 speedup since FP ALU latency (1 vector instead of 1 scalar per 3 cycles on your CPU) is still a worse bottleneck than L3 bandwidth, definitely worse than L2 cache bandwidth.
SIZE=4000000timessizeof(double)is 30.52 MiB, so it will fit in your high-end Raptor Lake's 36MiB L3 cache. But to go much faster, you're going to need to reduceSIZEand crank upREPS(and perhaps put a repeat-loop inside a timed region.) The whole program is pretty short to profile withperf stat, under 100 milliseconds on my i7-6700k Skylake with DDR4-2666, and much of that is startup. It's also pretty short to be timing withclock()instead ofclock_gettime.Your per-core cache sizes are 48 KiB L1d, 2 MiB L2 (on the Golden Cove P-cores, less/more on a single Gracemont E-core). https://en.wikipedia.org/wiki/Raptor_Lake / https://chipsandcheese.com/2022/08/23/a-preview-of-raptor-lakes-improved-l2-caches/ .
SIZE=6144would make the array the full size of L1d. If we aim for a 40 KiB footprint for just the array, leaving room for some other stuff,SIZE=5120. Might as well make it aligned by 32 bytes withaligned_allocso we can read it from L1d cache at 3 vectors (96 bytes) per clock cycle instead of having a cache-line split every other vector. (https://chipsandcheese.com/2021/12/02/popping-the-hood-on-golden-cove/ / https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#load-throughput-limit / https://uops.info/)To get anywhere near peak FLOPS (within a factor of 2 since we're not using FMA), we need to run 2
vaddpdinstructions every clock cycle. But it has a latency of 3 cycles on your Golden Cove P-cores (Alder/Raptor Lake), so thelatency * bandwidthproduct is 6vaddpdin flight at once. That's the minimum number of dependency chains, preferably at least 8. Anything less will leave loop-carried dependency chains as the bottleneck, not throughput. (Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators))So you're looking for additional instructions in the inner loop like
vaddpd ymm1, ymm1, [rdi+32]. Golden Cove's 3c latency / 0.5c reciprocal throughputvaddps/pdis due to dedicated SIMD-FP add ALUs instead of the 4 cycle pipeline for mul/FMA execution units, which were also used for add/sub since Skylake. Unlike Haswell, Golden Cove (Alder/Raptor Lake P-core) has two such ALUs so the throughput is still as good as FMA.GCC's
-funroll-loopsis useless here, unrolling the loop but still with only one accumulator vector. (Even with#pragma omp simd reduction (+:sum)and-fopenmp.) Clang will unroll with 4 accumulators by default. With-march=raptorlakeit will unroll by 20, but still only 4 accumulators, so doing 5 adds into each vector. And using indexed addressing modes like[rdi + 8*rcx + 32]so eachvaddpd ymm, ymm, [reg+reg*8]un-laminates to 2 uops, not reducing front-end cost nearly as much as it could. There's only one array involved so there wouldn't even be any extra cost to use a pointer increment instead of an index, it's not doing anything clever like indexing relative to the end of the array with a negative index that counts up toward zero. But this isn't a bottleneck; Golden Cove's wide front-end (6 uops) can issue 3 suchvaddpd [mem+idx]instruction per cycle, so stay ahead of the back-end (2/clock). Even 4-wide Skylake can keep up with this limited unroll.#pragma clang loop interleave_count(8)before thefor()gets clang to unroll with more accumulators. (With more than8it ignores it and just does 4 :/) This is maybe only a good idea for code you expect to get L1d hits; if you expect your array needs to come from L2 or farther away, the default is good. Of course, the non-interleaving part of the unroll is just a waste of code-size in that case, too, and would cost more cleanup code ifnweren't a compile-time constant. Docs: https://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizationsWith the defaults, no pragmas but
-O3 -ffast-math -march=native(on Skylake also using-mbranches-within-32B-boundaries), we get the same unroll by 20 with an interleave of 4 accumulators as Clang uses on Raptor Lake. (It also fully unrolls theREPStiming/printing loop, so this big loop is repeated 5 times. This is almost certainly worse than spending 1 register and a couple instructions to recycle code that's already hot in cache.)vs. with the pragma, unroll by 16, with 8 accumulators, when inlined into
main.4000isn't quite a multiple of 16 x 4, so the loop exit condition is in between sets of 8 adds, in the middle of the loop.I tried changing the source to encourage the compiler to increment a pointer, but clang doesn't take the hint, inventing an index counter in a register it uses like
[rdi + r8*8 + 0x20]Updated microbenchmark source code
With a
const int inner_reps = 1000000;repeat count of sums inside each timed interval added, and some measures to make sure the optimizer doesn't defeat it (Godbolt - alsoSIZEreduced to4000to fit in my 32KiB L1d), on my Skylake at 4.2 GHz, I get a 16x speedup as expected.GCC 13.2.1, clang 16.0.6 on Arch GNU/Linux, kernel 6.5
(With
perf stat -d, I also confirmed the L1d cache miss rate was under 1%. With a larger array size, like20000which fits in Skylake's 256K L2 cache but not L1d, I still got fairly close to 1 vector per clock throughput.)The JCC erratum workaround (Skylake-family only, not your CPU) gave negligible further speedup in this case, the front-end wasn't the bottleneck even with legacy decode: un-lamination happens at issue so the decoders aren't choking on 2-uop instructions. And average
uops_issued.anythroughput was still only 2.18 / clock with 4x unroll.So we get a factor of 16 speedup on Skylake from vectorizing with AVX (4x) and instruction-level parallelism of 4 accumulators. This is still only averaging just slightly better than 1
vaddpdper clock cycle (since there's ILP between repeat-loop iterations), but clang's 4 dep chains is only half of Skylake's 4 cycle latency x 2 insn/cycle throughput = 8 FP math instructions in flight.Unrolling by 4 leaves another factor of 2 performance left on the table (for Skylake, less for Alder Lake and later. Update: we got most of it with the
pragma). But it's only attainable with data hot in L1d cache, with careful cache-blocking, or if you're doing more work with data while it's in registers (higher computational intensity, not just 1 add per load). Getting another full 2x would also require an optimizer that's aware of Sandybridge-family un-lamination, which clang's obviously isn't. Clang's default 4 accumulators seems reasonable, and more accumulators would mean more init and cleanup work, although unrolling by 20 with only 4 accumulators seems excessive, like a waste of I-cache / uop-cache footprint.Perf counter results
Counting in user-space only on i7-6700k Skylake (EPP=performance) with Linux kernel & perf 6.5. This is for the entire process, including startup, but the inner repeat count of 1 million means the vast majority of its total time is spent in the loop we care about, not startup.
Scalar loop:
Performance counter stats for './a.out' (GCC O3-native without fast-math):
Note the 0 counts for
fp_arith_inst_retired.256b_packed_double- no 256-bit SIMD instructions.Vectorized but not unrolled:
Performance counter stats for './a.out' (GCC O3-native-fast-math):
Vectorized, unrolled by 20 with 4 accumulators:
Performance counter stats for './a.out': (Clang -O3-native-fast-math JCC-workaround)
Note the slightly more 256-bit vector instructions: that's 3x
vaddpdto reduce 4 accumulators down to 1 before doing the horizontal sum work down to 1 scalar. (Which starts with avextractf128of the high half, then using 128-bit vector instructions. So this counter doesn't count them, but they still compete with the work of the next iteration starting.)Vectorized, unrolled by 16 with 8 accumulators:
Performance counter stats for './a.out' (clang -O3 native fast-math
#pragma ... interleave_count(8)):Even more cleanup work after the loop, 7x
vaddpdto get down to 1 vector. And not a 2x speedup, instead bottlenecking on16.040 uops / 4.180 GHz~= 3.87 average uops issued / clock, most cycles issuing Skylake's max of 4. This is because Clang/LLVM doesn't know how to tune for Intel CPUs, using indexed addressing modes. (uops executed is actually lower than uops issued, so very little micro-fusion of loads with ALU, and 8xvxorpszeroing of 8 vectors before each iteration that need an issue slot but no back-end execution unit.)7.180 / 4.18 GHz= an average of 1.71 256-bit FP instructions executed per clock cycle.(The CPU was probably running at 4.20 GHz the whole time, but that frequency is derived from counts for cycles (in user-space only) divided by task-clock. Time spent in the kernel (on page-faults and interrupts) isn't being counted because we used
perf stat --all-user)Hand-written asm loop:
Fixing the front-end bottleneck by avoiding indexed addressing modes gets up up to 1.81
vaddpd/clock, up from 1.71. (Not 2.0 because imperfect uop scheduling loses a cycle with no slack to make it up.) That's about 30281.47 MFLOP/S at 4.2 GHz on a single core.As a starting point, I used
clang -O3 -fno-unroll-loops -S -march=native -ffast-math -Wall arr-sum.c -masm=intel -o arr-sum-asm.Son the C version with the unroll pragma, so that loop was still unrolled with 8 accumulators, but only unrolled by 8 not 16.And the outer repeat loop stayed rolled up, so I only had to hand-edit one copy of the asm loop (inlined into main.) The
dsprefixes on a couple instructions are to work around the JCC erratum. Note that none of these instructions need disp32 addressing modes since I put the pointer increment in the right place to benefit from the full -0x80 .. +0x7f, actually going from -0x80 to +0x60. So machine-code size is much smaller than clang's, with instruction lengths of 5 (or 4 for[rdi+0]). Theaddends up needing an imm32, but there's just one of it. And critically, these stay micro-fused, cutting front-end uop bandwidth almost in half.uops_issued.anyis now about 2.32 / cycle, plenty of headroom for uop-cache fetch and other front-end bottlenecks.Unrolling by 10 instead of 8 only brings a tiny speedup, like 662.49 ms total time on a good run, and best MFLOPS 30420.80. IPC around 2.41.
L1d cache bandwidth is the last bottleneck on SKL before saturating the FP pipes, not quite sustaining 2x 32-byte loads/clock. Changing 3 of the insns to add the register to itself (
sum7 += sum7) speeds up the 10-accumulator version to 619.11 ms total time, best MFLOPS 32424.90, 2.58 IPC, average 1.957 256-bit uops/clock. (And the FP ports have to compete with a couple 128-bit adds in the cleanup.)Raptor Lake can do 3 load/clock even for vectors so shouldn't be a problem there.