Mechanical Sympathy

4 min read

Preamble

Developers have to deal with a lot of complexity. As tooling and abstractions get better, we become shielded from what happens underneath the hood - it’s all magic. This is great most of the time, but in building systems where performance really matters, the shelter we’ve grown accustomed to becomes a hindrance.

In the world of Software, Mechanical Sympathy - popularized by Martin Thompson - is the understanding of how the underlying hardware works and using that knowledge to build performant systems. This is especially important in systems where performance is critical, such as high-frequency trading systems, financial exchanges, video games, etc.

Interesting Examples

Summing a Matrix

Consider the following code snippet:

public class Example {
    public static void main(String[] args) {
        // Create N x N matrix of integers
        var N = 10000;
        var matrix = new int[N][N];
        for (var i = 0; i < N; i++) {
            for (var j = 0; j < N; j++) {
                matrix[i][j] = j + i;
            }
        }
 
        // Calculate sum of all elements in the matrix
        var t1 = System.currentTimeMillis();
        var sum = 0L;
        for (var i = 0; i < N; i++) {
            for (var j = 0; j < N; j++) {
                sum += matrix[j][i];
            }
        }
        var t2 = System.currentTimeMillis();
 
        // Print the results
        System.out.println("Sum: " + sum);
        System.out.println("Time taken: " + (t2 - t1) + "ms");
    }
}

Running the above code snippet results in the following output on my M1 Max:

Sum: 999900000000
Time: 473ms

One simple change to the code can result in a significant performance improvement:

- 17               sum += matrix[j][i];
+ 17               sum += matrix[i][j];
Sum: 999900000000
Time: 67ms

That’s nearly a 7x difference in performance. The two code snippets are logically equivalent in terms of time complexity, so why this huge difference in performance?

Memory Access Patterns

When the CPU reads from memory, it reads in chunks and stores them in the cache. These chunks are called cache lines, typically 64 bytes. Elements that are close together in memory (cache locality) is effectively “prefetched” into the cache, resulting in faster access times when they are eventually accessed.

When we read the cell by row-column (matrix[i][j]), the CPU reads the entire row i and caches it, although we’re only interested in a single element j at that point in time. Subsequent reads of the matrix will be faster as the rest of the row (j+1, j+2, …) is already in the cache.

sum += matrix[0][0]  // cache miss on row 0, read and cached
sum += matrix[0][1]  // cache hit on row 0
sum += matrix[0][2]  // cache hit on row 0

When we read the cell by column-row (matrix[j][i]):

sum += matrix[0][0]  // cache miss on row 0
sum += matrix[1][0]  // cache miss on row 1
sum += matrix[2][0]  // cache miss on row 2

The repeated cache misses result in slower performance.

The (language-agnostic) take-away is that contiguous memory access is faster than non-contiguous, erratic memory access.

To further illustrate this point, consider another example -

Iterating over LinkedList vs ArrayList

Java’s LinkedList and ArrayList are two common list implementations. Under the hood:

import java.util.LinkedList;
 
public class Example {
    public static void main(String[] args) {
 
        // Create a list of 100M integers
        List<Integer> list = new LinkedList<>();
        for (int num = 0; num < 100_000_000; num++) {
            list.add(num);
        }
 
        // Iterate through the list and sum
        var t1 = System.currentTimeMillis();
        long sum = 0L;
        for (int num : list) {
            sum += num;
        }
        var t2 = System.currentTimeMillis();
 
        // Print the results
        System.out.println("Sum: " + sum);
        System.out.println("Time: " + (t2 - t1) + "ms");
    }
}

Running the above code snippet results in the following output on my M1 Max:

Sum: 4999999950000000
Time: 922ms

By changing the list implementation from LinkedList to ArrayList, we can see a drastic difference in iteration speed:

- 7        List<Integer> list = new LinkedList<>();
+ 7        List<Integer> list = new ArrayList<>();
Sum: 4999999950000000
Time: 132ms

Conclusion

Understanding how the underlying hardware works can help us write more performant code, when it matters.

References


2024 © Eli Lim. All rights reserved.