Data-oriented oriented optimization

To data-oriented design follows data-oriented optimization, that is: reason about a problem in view of the data you’re going to transform and seek optimization opportunities.

Today I want to look at a finite element analysis (FEA) in which 80% of the time goes into collision detection. Particularly in the final stage of it, where we first test primitives for collision, and finally we compute reaction forces for those primitives in contact. These functions are crucial and challenging for many applications, from robotics to computer graphics, but in FEA we need a greater degree of accuracy. This triggers high-computing costs, but let’s see if we can reason about the problem and improve the current state of the art.

Before getting started

There are few things to set:

  • Replace the force inline statements with __attribute__((__noinline__)) so that profiling may be more informative
  • Build with default RelWithDebInfo from CMake so that both optimisation and debug symbols are included
  • Set -march=native to make the most of the vector instructions on my machine

Now I can start profiling my application.

Data-oriented optimization

Building with Apple Clang 15 I get this high-level profiling information:

| Task                 | CPU (sec)  | (%)  |
|----------------------|------------|------|
| Element processing   | 2.6597E+00 | 3.6  |
|   Internal forces    | 2.3526E+00 | 3.2  |
|   Solve equations    | 3.4086E-01 | 0.5  |
| Contact algorithm    | 5.9047E+01 | 80.9 |
|   Traverse trees     | 6.8024E-05 | 0.0  |
|   Primitive test     | 5.8873E+01 | 80.7 |
| Generate output      | 1.0176E+01 | 13.9 |
|   Binary             | 3.6518E-02 | 0.1  |
|   Ascii              | 1.0116E+01 | 13.9 |
| Unmeasured           | 1.1099E+00 | 1.5  |

**Total CPU time: 72.99**

Basically this is the program output saying that we should focus on the primitive tests, as these alone take 80% of the whole FEA simulation. Now I’m going to dive deeper and run this with the Instrument profiler (awesome tool by the way).

Apple Instruments give me a slightly different runtime (total runtime 83.60s), but the function that tests for collision a pair of quadrilaterals takes 19 sec distributed as follows:

  1. 48.8% in a function that finds the intersection between two polygons in 2D
  2. 15.0% for a simple dot product between two 3D vectors
  3. 3.2% in normalising 3D vectors
  4. 0.2% in cross product computation

I’m going to take a stab at the dot products first since earlier I wrote about using SIMD to optimise the dot product in view of further data-oriented tasks.

The current implementation is trivial scalar implementation:

inline double VecDotProd3d( double* a, double* b )
{
  return( a[ 0 ] * b[ 0 ] + a[ 1 ] * b[ 1 ] + a[ 2 ] * b[ 2 ] );
}

Compiler explorer gives an assembly on Clang 15 that resembles the one I get from my Apple Clang 15:

+0x00	pushq       %rbp
+0x01	movq        %rsp,          %rbp
+0x04	movq        %rdi,          -0x8(%rbp)
+0x08	movq        %rsi,          -0x10(%rbp)
+0x0c	movq        -0x8(%rbp),    %rax
+0x10	movsd       (%rax),        %xmm1      ## xmm1 = mem[0],zero
+0x14	movq        -0x10(%rbp),   %rax
+0x18	movsd       (%rax),        %xmm2      ## xmm2 = mem[0],zero
+0x1c	movq        -0x8(%rbp),    %rax
+0x20	movsd       0x8(%rax),     %xmm0   ## xmm0 = mem[0],zero
+0x25	movq        -0x10(%rbp),   %rax
+0x29	mulsd       0x8(%rax),     %xmm0
+0x2e	mulsd       %xmm2,         %xmm1
+0x32	addsd       %xmm0,         %xmm1
+0x36	movq        -0x8(%rbp),    %rax
+0x3a	movsd       0x10(%rax),    %xmm0  ## xmm0 = mem[0],zero
+0x3f	movq        -0x10(%rbp),   %rax
+0x43	movsd       0x10(%rax),    %xmm2  ## xmm2 = mem[0],zero
+0x48	mulsd       %xmm2,         %xmm0
+0x4c	addsd       %xmm1,         %xmm0
+0x50	popq        %rbp
+0x51	retq

Out of the 11.68s spent in dot product computation, the 6 most expensive operations, with addsd and mulsd at the top 3, are:

alt text

My x86 CPU does have AVX support, and I’m going to test whether replacing addsd and mulsd with their vector counterpart will give me a x4 speedup. The vector function is in this previous post, and I’m going to make some changes so that the function signature remains unchanged - and I trust the compiler all that.

First attempt: add intrinsics

The instruction set now uses ymm registers, but the results are not as I expected: the time for dot product computations goes up from 11.86s to 31.59s! Surprisingly vadd* and vmul account to less than 500 ms, but now the most expensive operations now are vmovap*:

alt text

Second attempt: aligned data

I was absolutely wrong in trusting that the compiler would lay the memory in the right way: the vector function is now spending time in loading unaligned data from memory. This is slow and the compiler cannot do anything better.

I edit the function and its signature as follows:

inline double VecDotProd3d(const double* restrict a, const double* restrict b){
    const __m256d vec_a = _mm256_load_pd(a);
    const __m256d vec_b = _mm256_load_pd(b);
    const __m256d prod = _mm256_mul_pd(vec_a, vec_b);

    const __m128d sum_high = _mm256_extractf128_pd(prod, 1);
    const __m128d sum_low = _mm256_castpd256_pd128(prod);
    __m128d sum = _mm_add_pd(sum_low, sum_high);

    sum = _mm_hadd_pd(sum, sum);

    return _mm_cvtsd_f64(sum);
}

Conclusions

Now data gets loaded into registers from aligned chunks of memory - if not, I’d get a crash - and this gives a boost by more than 10%. I put back the function inlining for this final comparison, and this is the result for the primitive tests function:

Time
Old method 65.5 s
New method 53.2 s

This looks good, but far from x4 speedup I expected. I can get more information from the profiler. When I look at how data are transformed in the disassembly I see that about 20 calls to vmovapd appear and these shuffle the data on stack. The actual multiplications and additions are done in about 500ms, but Instruments shows that overall about 4s are used to move data up and down the stack.

The 10% gain is enough for today: give how little code I had to change, this is a significant speedup for a large codebase. The way to improve further is rather invasive: refactor more code to pass register variables and avoid calls to vmovapd. But again, enough for today.

Published At
Tagged with
Creative Commons Attribution-ShareAlike (CC BY-SA)