I'd like to know if it is worth it optimizing my Vector3 class' operations with neon/simd like I did to my Vector2 class.
As far as I know, simd can only handle two or four floats at the same time, so to my Vector3 we would need something like this:
Vector3 Vector3::operator * (const Vector3& v) const
{
#if defined(__ARM_NEON__)
// extra step: allocate a fourth float
const float v4A[4] = {x, y, z, 0};
const float v4B[4] = {v.x, v.y, v.z, 0};
float32x4_t r = vmul_f32(*(float32x4_t*)v4A, *(float32x4_t*)v4B);
return *(Vector3*)&r;
#else
return Vector3(x * v.x, y * v.y, z * v.z);
#endif
}
Is this safe? Would this extra step still be faster than a non-simd code on most scenarios (say arm64 for instance)?