SSE/AVX registers could be viewed as integer or floating point BigNums. That is, one could neglect that there exist lanes at all. Does there exist an easy way to exploit this point of view and use these registers as BigNums either singly or combined? I ask because from what little I've seen of BigNum libraries, they almost universally store and do arithmetic on arrays, not on SSE/AVX registers. Portability?
Example:
Say you store the contents of a SSE register as a key in a std::set, you could compare these contents as a BigNum.
I think it may be possible to implement BigNum with SIMD efficiently but not in the way you suggest.
Instead of implementing a single BigNum using a SIMD register (or with an array of SIMD registers) you should process multiple BigNums at once.
Let's consider 128-bit addition. Let 128-bit integers be defined by a pair of high and low 64-bit values and let's assume we want to add a 128-bit integer
(y_low, y_high)to a 128-bit integer(x_low, x_high). With the scalar 64-bit registers this requires only two instructionsWith SSE/AVX the problem, as others have explain, is that there is no SIMD carry flags. The carry flag has to be calculated and then added. This requires a 64-bit unsigned comparison. The only realistic option for this with SSE is from the AMD XOP instruction
vpcomgtuqThis uses four instructions to add two pairs of 128-bit numbers. With the scalar 64-bit registers this requires four instructions as well (two
addand twoadc).With AVX2 we can add four pairs of 128-bit numbers at once. But there is no 256-bit wide 64-bit unsigned instruction from XOP. Instead we can do the following for
a<b:The
sign64register can be precomputed so only three instructions are really necessary. Therefore, adding four pairs of 128-bit numbers with AVX2 can be done with six instructionswhereas the scalar registers need eight instructions.
AVX512 has a single instruction for doing 64-bit unsigned comparison
vpcmpuq. Therefore, it should be possible to add eight pairs of 128-bit numbers using only four instructionsWith the scalar register it would require 16 instructions to add eight pairs of 128-bit numbers.
Here is a table with a summary of the number of SIMD instructions (called nSIMD) and the number of scalar instructions (called nscalar) necessary to add a number of pairs (called npairs) of 128-bit numbers
Note that XOP2 does not exist yet and I am only speculating that it may exist at some point.
Note also that to do this efficiently the BigNum arrays needs to be stored in an array of struct of array (AoSoA) form. For example using
lto mean the lower 64-bits andhto mean the high 64-bits an array of 128-bit integers stores as an array of structs like thisshould instead be stored using an AoSoA like this