Measure your cache line

This post was inspired by Measuring the size of the cache line empirically and it’s my first them to quantify the impact of false sharing in multi-threaded applications.

My goal is to measure the cache line on any system and for that I need a fast and portable solution. These are the main step:

Create testing environment - I’ll start with Docker
Write a C program that writes results to a json file
Evaluate and compare the results - with a python helper script

Create testing environment

Surely, I could have have just cloned the repo provided in the reference article, but I want to learn, compare and have fun 🏄‍♂️.

Pull my standard Ubuntu 22 dev environment

docker pull mmontanari/devenv:buildenv-ubuntu-22

keep in running

docker run -d --name buildenv mmontanari/devenv:buildenv-ubuntu-22 tail -f /dev/null

check it runs

docker ps

used it as dev container in VS Code with “Attached to running container …”

Write a C program that writes results

Actually no. I’m too lazy and want to test Claude to write this for me. I will only prompt to preproduce the reference article, but I will not provide the provided github code to Claude.

This is what I got as response:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define BUFFER_SIZE (32 * 1024 * 1024)  // 32 MB
#define NUM_ITERATIONS 10
#define MAX_STRIDE 512

double get_time() {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec + ts.tv_nsec / 1e9;
}

void strided_copy(char *src, char *dst, size_t size, size_t stride) {
    for (size_t i = 0; i < size; i += stride) {
        dst[i] = src[i];
    }
}

int main() {
    char *src = malloc(BUFFER_SIZE);
    char *dst = malloc(BUFFER_SIZE);
    if (!src || !dst) {
        fprintf(stderr, "Memory allocation failed\n");
        return 1;
    }

    // Initialize source buffer
    for (int i = 0; i < BUFFER_SIZE; i++) {
        src[i] = (char)i;
    }

    FILE *fp = fopen("cache_line_results.json", "w");
    if (!fp) {
        fprintf(stderr, "Failed to open output file\n");
        return 1;
    }

    fprintf(fp, "{\n\"results\": [\n");

    for (size_t stride = 16; stride <= MAX_STRIDE; stride++) {
        double min_speed = 1e9, max_speed = 0, total_speed = 0;

        for (int iter = 0; iter < NUM_ITERATIONS; iter++) {
            double start = get_time();
            strided_copy(src, dst, BUFFER_SIZE, stride);
            double end = get_time();

            double elapsed = end - start;
            double speed = (BUFFER_SIZE / 1e9) / elapsed;  // GB/s

            if (speed < min_speed) min_speed = speed;
            if (speed > max_speed) max_speed = speed;
            total_speed += speed;
        }

        double avg_speed = total_speed / NUM_ITERATIONS;

        fprintf(fp, "  {\"stride\": %zu, \"min\": %.2f, \"max\": %.2f, \"avg\": %.2f}%s\n",
                stride, min_speed, max_speed, avg_speed,
                stride == MAX_STRIDE ? "" : ",");
    }

    fprintf(fp, "]\n}\n");
    fclose(fp);

    free(src);
    free(dst);
    return 0;
}

build and run with:

gcc -O3 -o cache_line_test cache_line_test.c && ./cache_line_test

all runs as expected 💯

Evaluate and compare the results

My dev container is missing pip and matplotlib, so let’s fix that:

apt-get update -y && apt-get install pip -y
pip install matplotlib

I don’t need to mess around with virtual environments here since the container will be destroyed afterwards.

Next, let’s run the python script that Claude generated from the same prompt that gave me the C code - Yes, you read it right: one prompt generated all artifacts and matched the output plot thanks to Claude’s vision capabilities.

import json
import matplotlib.pyplot as plt

# Read the JSON file
with open('cache_line_results.json', 'r') as f:
    data = json.load(f)

# Extract data
strides = [result['stride'] for result in data['results']]
min_speeds = [result['min'] for result in data['results']]
max_speeds = [result['max'] for result in data['results']]
avg_speeds = [result['avg'] for result in data['results']]

# Create the plot
plt.figure(figsize=(12, 6))
plt.plot(strides, min_speeds, label='Min Speed', marker='o')
plt.plot(strides, max_speeds, label='Max Speed', marker='o')
plt.plot(strides, avg_speeds, label='Avg Speed', marker='o')

plt.xlabel('Stride (bytes)')
plt.ylabel('Speed (GB/s)')
plt.title('Cache Line Size Test Results')
plt.legend()
plt.xscale('log', base=2)
plt.grid(True)

# Add vertical lines at powers of 2
for x in [32, 64, 128, 256]:
    plt.axvline(x=x, color='gray', linestyle='--', alpha=0.5)

plt.savefig('cache_line_test_results.png')
plt.show()

This runs just fine as well.

Results

This is my first attempt

This is really noisy and I can’t tell if the speed raises at 64 or 128. I decided to repeat the test and after replacing malloc with aligned_alloc. This reduces the noise significantly

Better now! Repeating the test gives more consistent results and the cache line on my laptop seems to be 128 bytes. From common wisdom I expected it to be 64 bytes instead. Let’s do more testing.

Compare against reference tests

Since I could not find any official information from Intel about my i9 CPU, to validate my results I run the reference code which is clearly taking a very similar approach. This runs in 8 minutes.

While it runs I compared the two codes: mine from Claude, the reference from ChatGPT. They are very similar but the devil may be in the details. Also, I’d be very surprised if the python code can actually reproduce the image prompted to 🤩. I’m very curios to see how they compare.

Here is the results from the reference code

They match! This measures a cache line of 128 bytes (the cache line length corresponds to the value in which the speed curve goes from plateau to raising exponentially). Basically, these two codes do produce consistent results. I have been looking for major differences, I made some tests and changes, but I could not find anything incredibly different.

Conclusion

Generating code with tools with ChatGPT, Claude or other LLM is great, but not perfect. These black-boxes sometimes works greatly (see the python snippet for plotting), sometimes they do not. Then I basically have to debug someone else’s code and that takes time. Still, in this case, I got lucky and the result is genuinely impressive.

I gave an extremely simple prompt and augmented it with the full article (text and images). In return, I obtained a perfectly running software that reproduced the desired results. Awesome 🤓.

Measure your cache line

Create testing environment#

Write a C program that writes results#

Evaluate and compare the results#

Results#

Compare against reference tests#

Conclusion#