I was doing some research to find out the reason why data alignment on specific byte boundaries (4-byte, 8-byte, etc. dependent on the hardware) affects the computing performance. I came across this example by IBM: https://developer.ibm.com/articles/pa-dalign/
The test cases were not included, so I have written a small C++ script for the 8-byte access "granularity" (in IBM's terms) case to conduct the test (it is Listing 4 in the IBM webpage link that I have shared):
#include <iostream>
#include <chrono>
void Munge64( void *data, uint32_t size ) {
double *data64 = (double*) data;
double *data64End = data64 + (size >> 3); /* Divide size by 8. */
uint8_t *data8 = (uint8_t*) data64End;
uint8_t *data8End = data8 + (size & 0x00000007); /* Strip upper 29 bits. */
while( data64 != data64End ) {
*data64++ = -*data64;
}
while( data8 != data8End ) {
*data8++ = -*data8;
}
}
int main() {
const uint32_t bufferSize = 125000 ; // 125000*(8-bytes) = 1 MB
uint64_t Buffer[bufferSize];
auto start_time_aligned = std::chrono::high_resolution_clock::now();
Munge64(Buffer, bufferSize*8);
auto end_time_aligned = std::chrono::high_resolution_clock::now();
// Calculate the duration for aligned case
auto duration_aligned = std::chrono::duration_cast<std::chrono::microseconds>(end_time_aligned - start_time_aligned);
std::cout << "Aligned buffer execution time: " << duration_aligned.count() << " microseconds" << std::endl;
auto start_time_unaligned = std::chrono::high_resolution_clock::now();
Munge64(Buffer+4, bufferSize*8); // +4 to make the buffer access unaligned
auto end_time_unaligned = std::chrono::high_resolution_clock::now();
// Calculate the duration for unaligned case
auto duration_unaligned = std::chrono::duration_cast<std::chrono::microseconds>(end_time_unaligned - start_time_unaligned);
std::cout << "Unaligned buffer execution time: " << duration_unaligned.count() << " microseconds" << std::endl;
return 0;
}
As it can be seen in the C++ code above, the first while loop takes care of the 64-bit chunks whereas the second while loop takes care of the 8-bit chunks (in case size is not a multiple of 8-bytes).
In the webpage of IBM, it says that the test was conducted with 10 MB buffer size and it also says that the unaligned case is approximately 4,610% (!) slower. However, in my program, I cannot even go up beyond 1 MB since it gives segmentation fault (there may be problems in the while loops assigning the pointers). In my test, I have used 1 MB of buffer and used the first access address as Buffer+4. The results I am getting are nowhere similar to those of IBM webpage. Firstly, unaligned access and aligned access give very similar results with aligned access averaged as ~155 ms and unaligned access as ~147 ms.
An important side note is that the IBM test uses Powerbook G4 PC which is rather old (production stopped in 2006).
My question is: Is my test totally wrong or the processors have changed so much that the unaligned and aligned accesses give similar results? For instance, is my unaligned access approach true when I use the starting address as Buffer+4? Also, why do I get Segmentation Fault when I try to use 10 MB buffer?