Folowing snippet is from OpenCV find_obj.cpp which is demo for using SURF,
double
compareSURFDescriptors( const float* d1, const float* d2, double best, int length )
{
double total_cost = 0;
assert( length % 4 == 0 );
int i;
for( i = 0; i best )
break;
}
return total_cost;
}
As far as I can tell it checking the euclidian distance, what I do not understand is why is it doing it in groups of 4? Why not calculate the whole thing at once?
Usually things like this are done for making SSE optimizations possible. SSE registers are 128 bits long and can contain 4 floats, so you can do the 4 subtractions using one instruction, parallelly.
Another upside: you have to check the loop counter only after every fourth difference. That makes the code faster even if the compiler doesn't use the opportunity to generate SSE code. For example, VS2008 didn't, not even with -O2:
double t0 = d1[i] - d2[i]; 00D91666 fld dword ptr [edx-0Ch] 00D91669 fsub dword ptr [ecx-4] double t1 = d1[i+1] - d2[i+1]; 00D9166C fld dword ptr [ebx+ecx] 00D9166F fsub dword ptr [ecx] double t2 = d1[i+2] - d2[i+2]; 00D91671 fld dword ptr [edx-4] 00D91674 fsub dword ptr [ecx+4] double t3 = d1[i+3] - d2[i+3]; 00D91677 fld dword ptr [edx] 00D91679 fsub dword ptr [ecx+8] total_cost += t0*t0 + t1*t1 + t2*t2 + t3*t3; 00D9167C fld st(2) 00D9167E fmulp st(3),st 00D91680 fld st(3) 00D91682 fmulp st(4),st 00D91684 fxch st(2) 00D91686 faddp st(3),st 00D91688 fmul st(0),st 00D9168A faddp st(2),st 00D9168C fmul st(0),st 00D9168E faddp st(1),st 00D91690 faddp st(2),st