SIMD method to get all consecutive sums of 4 or 8 DWORD integers (prefix-sum within each vector)

20 Views Asked by At

I have 4 or 8 DWORD integer values. Call them vector V.

I want to calculate/accumulate all values, like so:

  1. V[0] + V[1]
  2. V[0] + V[1] + V[2]
  3. V[0] + V[1] + V[2] + V[3]
  4. etc, up to V[0] + ... + V[7]

I need all of those individual values, not just the scalar sum of the whole register. If there's more output than that, that's fine, and I can just ignore it if so.

Is there an SIMD way to do this that is likely to be more efficient than simply adding them the naive way?

Or is there a mathematical trick/shortcut to this instead or in conjunction with some wide instruction?

Overflow is guaranteed not to occur due to pre-conditions on the incoming values already that end up meaning the sum of all values can never exceed about 2^28-1, so I'm not worried about overflow, and the values are unsigned.

I feel like this is probably something that exists and I just don't know the right terminology for the operation. (Editor's note: this is a prefix sum, cumulative sum, inclusive scan, or "scan".)

Or is this something that is so simple and fast in x86 on modern silicon that the potential costs of explicitly using wide instructions aren't worth it?

The values are already vectorized for other operations, so I've already paid that cost, at least.

As for available CPU capabilities, CPU will always be Sapphire Rapids equivalent or later.

0

There are 0 best solutions below