Mechanical Sympathy
Preamble
Developers have to deal with a lot of complexity. As tooling and abstractions get better, we become shielded from what happens underneath the hood - it’s all magic. This is great most of the time, but in building systems where performance really matters, the shelter we’ve grown accustomed to becomes a hindrance.
In the world of Software, Mechanical Sympathy - popularized by Martin Thompson - is the understanding of how the underlying hardware works and using that knowledge to build performant systems. This is especially important in systems where performance is critical, such as high-frequency trading systems, financial exchanges, video games, etc.
Interesting Examples
Summing a Matrix
Consider the following code snippet:
public class Example {
public static void main(String[] args) {
// Create N x N matrix of integers
var N = 10000;
var matrix = new int[N][N];
for (var i = 0; i < N; i++) {
for (var j = 0; j < N; j++) {
matrix[i][j] = j + i;
}
}
// Calculate sum of all elements in the matrix
var t1 = System.currentTimeMillis();
var sum = 0L;
for (var i = 0; i < N; i++) {
for (var j = 0; j < N; j++) {
sum += matrix[j][i];
}
}
var t2 = System.currentTimeMillis();
// Print the results
System.out.println("Sum: " + sum);
System.out.println("Time taken: " + (t2 - t1) + "ms");
}
}
Running the above code snippet results in the following output on my M1 Max:
Sum: 999900000000
Time: 473ms
One simple change to the code can result in a significant performance improvement:
- 17 sum += matrix[j][i];
+ 17 sum += matrix[i][j];
Sum: 999900000000
Time: 67ms
That’s nearly a 7x difference in performance. The two code snippets are logically equivalent in terms of time complexity, so why this huge difference in performance?
Memory Access Patterns
When the CPU reads from memory, it reads in chunks and stores them in the cache. These chunks are called cache lines, typically 64 bytes. Elements that are close together in memory (cache locality) is effectively “prefetched” into the cache, resulting in faster access times when they are eventually accessed.
When we read the cell by row-column (matrix[i][j]
),
the CPU reads the entire row i
and caches it, although we’re only interested in a single element j
at that point in time. Subsequent reads of the matrix will be faster as the rest of the row (j+1
, j+2
, …)
is already in the cache.
sum += matrix[0][0] // cache miss on row 0, read and cached
sum += matrix[0][1] // cache hit on row 0
sum += matrix[0][2] // cache hit on row 0
When we read the cell by column-row (matrix[j][i]
):
sum += matrix[0][0] // cache miss on row 0
sum += matrix[1][0] // cache miss on row 1
sum += matrix[2][0] // cache miss on row 2
The repeated cache misses result in slower performance.
The (language-agnostic) take-away is that contiguous memory access is faster than non-contiguous, erratic memory access.
To further illustrate this point, consider another example -
Iterating over LinkedList vs ArrayList
Java’s LinkedList
and ArrayList
are two common list implementations.
Under the hood:
LinkedList
allocates a new object on the heap for each element in the list. These objects are scattered throughout the heap, resulting in non-contiguous memory access.ArrayList
stores elements in multiple arrays on the heap, which are each contiguous segments of memory. This allows more contiguous memory access, improving performance.
import java.util.LinkedList;
public class Example {
public static void main(String[] args) {
// Create a list of 100M integers
List<Integer> list = new LinkedList<>();
for (int num = 0; num < 100_000_000; num++) {
list.add(num);
}
// Iterate through the list and sum
var t1 = System.currentTimeMillis();
long sum = 0L;
for (int num : list) {
sum += num;
}
var t2 = System.currentTimeMillis();
// Print the results
System.out.println("Sum: " + sum);
System.out.println("Time: " + (t2 - t1) + "ms");
}
}
Running the above code snippet results in the following output on my M1 Max:
Sum: 4999999950000000
Time: 922ms
By changing the list implementation from LinkedList
to ArrayList
,
we can see a drastic difference in iteration speed:
- 7 List<Integer> list = new LinkedList<>();
+ 7 List<Integer> list = new ArrayList<>();
Sum: 4999999950000000
Time: 132ms
Conclusion
Understanding how the underlying hardware works can help us write more performant code, when it matters.