Dot products are a fundamental and ubiquitous operation in 3D applications: physics engines, modelling, robotics, etc.. Let me give you a quick rundown on how I’ve been exploring them lately.
So, first off, I started with the classic approach: coding a function that computes dot products and takes arguments by reference. This method is straightforward and great for general use. It looks something like this:
float dotProduct3D(float* A, float* B) {
float dot = 0.0f;
for (int i = 0; i < 3; i++) {
dot += A[i] * B[i];
}
return dot;
}
Then, I try another approach where the function takes arguments by value. It’s a bit different but equally effective:
typedef struct {
float x;
float y;
float z;
} Vector3D;
float dotProduct3D(Vector3D A, Vector3D B) {
return A.x * B.x + A.y * B.y + A.z * B.z;
}
But wait, there’s more! I delved into SIMD instructions to harness their power and optimize the computation. Even with SIMD, you can still pass the arguments by value, which in my benchmarks run x1.6 compared to pass by reference:
__m128 dotProduct3D(__m128 vecA, __m128 vecB) {
__m128 mulResult = _mm_mul_ps(vecA, vecB);
__m128 sum1 = _mm_add_ps(mulResult, mulResult);
__m128 sum2 = _mm_add_ps(sum1, sum1);
return sum2;
}
Pretty cool, huh? It’s one of those unexpected discoveries that make coding adventures even more exciting.
What is SIMD anyway?
SIMD instructions are like the cool kids on the block when it comes to speeding up computations. They’re all about doing multiple things at once, which is super handy for tasks like dot products. With SIMD, you can crunch numbers in parallel, making your code run faster and smoother. However the code becomes harder to read, write and port across platforms.
Review x86 assembly of dot product
Finally, here’s the side-by-side comparison on Godbolt of assembly with Clang 14.0:
dotProduct3D(float __vector(4), float __vector(4)):
mulps xmm0, xmm1
addps xmm0, xmm0
addps xmm0, xmm0
ret
dotProduct3D(Vector3D, Vector3D):
movaps xmm4, xmm0
mulps xmm4, xmm2
shufps xmm4, xmm4, 85
mulss xmm0, xmm2
addss xmm0, xmm4
mulss xmm1, xmm3
addss xmm0, xmm1
ret
dotProduct3D(float*, float*):
movss xmm0, dword ptr [rdi]
movss xmm1, dword ptr [rdi + 4]
mulss xmm0, dword ptr [rsi]
xorps xmm2, xmm2
addss xmm2, xmm0
mulss xmm1, dword ptr [rsi + 4]
addss xmm1, xmm2
movss xmm0, dword ptr [rdi + 8]
mulss xmm0, dword ptr [rsi + 8]
addss xmm0, xmm1
ret
The assembly for the SIMD version looks pretty slick and optimized compared to the other two. It’s shorter and leads to faster computations.