Optimize dot products on x86

Dot products are a fundamental and ubiquitous operation in 3D applications: physics engines, modelling, robotics, etc.. Let me give you a quick rundown on how I’ve been exploring them lately.

So, first off, I started with the classic approach: coding a function that computes dot products and takes arguments by reference. This method is straightforward and great for general use. It looks something like this:

float dotProduct3D(float* A, float* B) {
    float dot = 0.0f;
    for (int i = 0; i < 3; i++) {
        dot += A[i] * B[i];
    }
    return dot;
}

Then, I try another approach where the function takes arguments by value. It’s a bit different but equally effective:

typedef struct {
    float x;
    float y;
    float z;
} Vector3D;

float dotProduct3D(Vector3D A, Vector3D B) {
    return A.x * B.x + A.y * B.y + A.z * B.z;
}

But wait, there’s more! I delved into SIMD instructions to harness their power and optimize the computation. Even with SIMD, you can still pass the arguments by value, which in my benchmarks run x1.6 compared to pass by reference:

__m128 dotProduct3D(__m128 vecA, __m128 vecB) {
    __m128 mulResult = _mm_mul_ps(vecA, vecB);
    __m128 sum1 = _mm_add_ps(mulResult, mulResult);
    __m128 sum2 = _mm_add_ps(sum1, sum1);
    return sum2;
}

Pretty cool, huh? It’s one of those unexpected discoveries that make coding adventures even more exciting.

What is SIMD anyway?

SIMD instructions are like the cool kids on the block when it comes to speeding up computations. They’re all about doing multiple things at once, which is super handy for tasks like dot products. With SIMD, you can crunch numbers in parallel, making your code run faster and smoother. However the code becomes harder to read, write and port across platforms.

Review x86 assembly of dot product

Finally, here’s the side-by-side comparison on Godbolt of assembly with Clang 14.0:

dotProduct3D(float __vector(4), float __vector(4)): 
  mulps xmm0, xmm1
  addps xmm0, xmm0
  addps xmm0, xmm0
  ret

dotProduct3D(Vector3D, Vector3D): 
  movaps xmm4, xmm0
  mulps xmm4, xmm2
  shufps xmm4, xmm4, 85  
  mulss xmm0, xmm2
  addss xmm0, xmm4
  mulss xmm1, xmm3
  addss xmm0, xmm1
  ret

dotProduct3D(float*, float*): 
  movss xmm0, dword ptr [rdi]
  movss xmm1, dword ptr [rdi + 4]
  mulss xmm0, dword ptr [rsi]
  xorps xmm2, xmm2
  addss xmm2, xmm0
  mulss xmm1, dword ptr [rsi + 4]
  addss xmm1, xmm2
  movss xmm0, dword ptr [rdi + 8]
  mulss xmm0, dword ptr [rsi + 8]
  addss xmm0, xmm1
  ret

The assembly for the SIMD version looks pretty slick and optimized compared to the other two. It’s shorter and leads to faster computations.

Optimize dot products on x86

What is SIMD anyway?#

Review x86 assembly of dot product#

What is SIMD anyway?

Review x86 assembly of dot product