I have a section of code whereby I am assigning one of 2 values to the X members in an array of POINTs, and then adding a second array to the first. The array sizes are fixed at 16. Initially I had something like:
const int icLeft = left + ((buttonWidth - iconLen) >> 1);
const int icRight = icLeft + iconLen;
mWData.iconPts[0].x = mWData.iconPts[2].x = mWData.iconPts[4].x = mWData.iconPts[7].x = icLeft;
mWData.iconPts[8].x = mWData.iconPts[9].x = mWData.iconPts[10].x = mWData.iconPts[14].x = icLeft;
mWData.iconPts[1].x = mWData.iconPts[3].x = mWData.iconPts[5].x = mWData.iconPts[6].x = icRight;
mWData.iconPts[11].x = mWData.iconPts[12].x = mWData.iconPts[13].x = mWData.iconPts[15].x = icRight;
...and then add the 2nd array with a function:
add16Pts(mWData.iconPts, pts);
...defined as:
static inline void add16Pts(LPPOINT p0, const LPPOINT pAdd)
{
__m256i *cp = (__m256i *) p0;
const __m256i *ap = (__m256i *) pAdd;
for (int i = 0; i < 4; i++) cp[i] = _mm256_add_epi32(cp[i], ap[i]);
};
Now this worked fine but then I got to thinking, assigning values to an array of 16 x 2 ints, surely this is something avx could eat for breakfast and still be hungry! So after some hours of research and experimention I replaced the above code with with:
__m256i rL;
rL.m256i_i64[0] = (u64)(left + ((buttonWidth - iconLen) >> 1));
rL.m256i_i64[1] = rL.m256i_i64[0] + iconLen;
__m256i *aa = (__m256i *) mWData.iconPts;
aa[0] = _mm256_permute4x64_epi64(rL, 0b01000100);
aa[1] = _mm256_permute4x64_epi64(rL, 0b00010100);
aa[2] = _mm256_permute4x64_epi64(rL, 0b01000000);
aa[3] = _mm256_permute4x64_epi64(rL, 0b01000101);
const __m256i *addp = (__m256i *) pts;
aa[0] = _mm256_add_epi32(aa[0], addp[0]);
aa[1] = _mm256_add_epi32(aa[1], addp[1]);
aa[2] = _mm256_add_epi32(aa[2], addp[2]);
aa[3] = _mm256_add_epi32(aa[3], addp[3]);
Now this works a treat but the question still bothering me is if it might be more performant to do:
__m256i *aa = (__m256i *) mWData.iconPts;
const __m256i *addp = (__m256i *) pts;
aa[0] = _mm256_add_epi32(_mm256_permute4x64_epi64(rL, 0b01000100), addp[0]);
aa[1] = _mm256_add_epi32(_mm256_permute4x64_epi64(rL, 0b00010100), addp[1]);
aa[2] = _mm256_add_epi32(_mm256_permute4x64_epi64(rL, 0b01000000), addp[2]);
aa[3] = _mm256_add_epi32(_mm256_permute4x64_epi64(rL, 0b01000101), addp[3]);
_mm256_add_epi32 has latency of 1 and a throughput of 0.33 so theoretically can do 3 per cycle (in a perfect world). Plus having each of the 16 bytes messed with at the same time (or nearly), might help as opposed to doing 4 assignments followed by 4 adds. This section of code might be run repeatedly thousands of times so performance matters. But I'm not an expert of the mysterious goings on of cpu caching and pipelining etc so I would be interested to hear which way would be better and also if there is anything else I can do to make it faster.