I am trying to convert the below chi-square function in c code to SSE2 intrinsics
I am getting the correct output for both the functions. and I have measured the time it takes for both functions to run using some random 4KB data that I have generated.and on average, I am seeing about 70-90 ms performance improvement
I am just wondering whether are there any further optimisations that I am missing that may further improve the performance. Any clues on this will be helpful
Normal C-Code:
int observed[256] = {0};
double chiSquare = 0.0;
double expected = (double)size / 256; // Constant expected value
// Calculate frequency of each byte value
for (int i = 0; i < size; i++) {
observed[data[i]]++;
}
// Calculate the chi-square statistic
for (int i = 0; i < 256; i++) {
double diff = observed[i] - expected;
chiSquare += (diff * diff) / expected;
}
return chiSquare;
SSE2 Intrinsics:
int observed[256] = {0};
const double expected = (double)size / 256; // Make 'expected' a constant
double chiSquare = 0.0;
// Process data in 16-byte (128-bit) chunks
for (int i = 0; i < size; i += 16) {
__m128i dataChunk = _mm_loadu_si128((__m128i*)(data + i));
// Unpack 8-bit values into 16-bit values for counting
__m128i dataUnpacked = _mm_unpacklo_epi8(dataChunk, _mm_setzero_si128());
// Extract and process 8 values in parallel
for (int j = 0; j <= 1; j++) {
uint16_t values[8];
_mm_storeu_si128((__m128i*)values, dataUnpacked);
for (int k = 0; k < 8; k++) {
observed[values[k]]++;
}
dataUnpacked = _mm_unpackhi_epi8(dataChunk, _mm_setzero_si128());
}
}
// Calculate the chi-square statistic using SSE2 intrinsics
__m128d sum = _mm_setzero_pd();
for (int i = 0; i < 256; i += 2) {
__m128d observedVec = _mm_set_pd(observed[i + 1], observed[i]);
__m128d diff = _mm_sub_pd(observedVec, _mm_set1_pd(expected));
__m128d squaredDiff = _mm_mul_pd(diff, diff);
__m128d result = _mm_div_pd(squaredDiff, _mm_set1_pd(expected));
sum = _mm_add_pd(sum, result);
}
// Sum up the results in the sum
double sumArray[2];
_mm_storeu_pd(sumArray, sum);
for (int i = 0; i < 2; i++) {
chiSquare += sumArray[i];
}
return chiSquare;
}**
Your SSE2 version of the function benchmarks slower (~25%) on my Westmere i5 laptop than your scalar function. I made a slight improvement (~20% with 4KB data) to the performance of your scalar function on my machine. Also, the SSE2 function doesn't work for all 'size' values, which I'm sure you already know. Anyway, my function is below.
I think SSE2 doesn't offer much hope for optimisation of the histogram stage as chtz pointed out. You might have more luck with AVX2 but I haven't investigated.