Is there any sort of difference in precision or performance between normal sqrtps/pd or the SVML version:
__m128d _mm_sqrt_pd (__m128d a) [SSE2]
__m128d _mm_svml_sqrt_pd (__m128d a) [SSE?]
__m128 _mm_sqrt_ps (__m128 a) [SSE]
__m128 _mm_svml_sqrt_ps (__m128 a) [SSE?]
I know that SVML Intrinsics like _mm_sin_ps are actually functions consisting of potentially multiple asm instructions, thus they should be slower than any single multiply or even divide. However, I'm curious as to why these function exist if there are hardware-level Intrinsics available.
Were these SVML functions created before SSE2? Or is there a difference in precision?
I've inspected the code gen in MSVC.
_mm_svml_sqrt_pdcompiles into a function call; the called function consists of a singlesqrtpdfollowed byret_mm_svml_sqrt_pscompiles into a function call; the called function consists of a singlesqrtpsfollowed byret_mm_sqrt_pdand_mm_sqrt_psintrinsics compile to inlinedsqrtpdandsqrtpsA possible explanation (just guess): SVML intended to have CPU dispatch, but the version compiled for MSVC has this CPU dispatch disabled. The goal may be to implement it differently for Xeon Phi, the Xeon Phi version may be not included in MSVC build of SVML.
Screenshot:
When using Intel compiler, it is using
svml_dispmd.dll, and there's actual dispatch function (real indirect jumpff 25 42 08 00 00), which ends up in vsqrtpd for me