[Optimization] CPU, Cache and Datastructures

This post is about data structures and how they compare in performance, when considering the CPU Prefetch mechanism and other cache effects. It shows that LinkedLists are bad in the most cases and how to deal with variable sharing.

The university view

Almost everyone, who studied computer science learned that there are some major differences in the big O notation between the different data structures like, lists, arrays, trees etc. The basic things differences can be seen here:

Data Structure Time Complexity Space Complexity
Average Worst Worst
Access Search Insertion Deletion Access Search Insertion Deletion
Array Θ(1) Θ(n) Θ(n) Θ(n) O(1) O(n) O(n) O(n) O(n)
Stack Θ(n) Θ(n) Θ(1) Θ(1) O(n) O(n) O(1) O(1) O(n)
Queue Θ(n) Θ(n) Θ(1) Θ(1) O(n) O(n) O(1) O(1) O(n)
Doubly-Linked List Θ(n) Θ(n) Θ(1) Θ(1) O(n) O(n) O(1) O(1) O(n)
Hash Table N/A Θ(1) Θ(1) Θ(1) N/A O(n) O(n) O(n) O(n)
Binary Search Tree Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) O(n) O(n) O(n) O(n) O(n)

So basically this is true, but there are some more things to consider, when using data structures in practice.

Memory, Cache and CPU

When talking about cache friendly code, we need to think about what the bottleneck of modern CPUs are. Basically there are several things to know, but in the end the main bottleneck is the memory access. While the registers are very fast, they have a small capacity and while the main memory is rather big, it needs to send the data to the CPU via the northbridge, which is slow.

Between the Registers and the main memory, a modern processor has a hierarchical cache structure. The L1 stage is the fastest one, and is a local cache, which means it is only used by one single core. The L2 stage is slower than the L1 cache and a local cache. the L3 cache is again slower, but does offer a powerful mechanism to share data between the cores, so its a shared cache for all cores. All these caches are organized by hardware. If a cache does not contain requested data, a cache miss occurs. Avoiding cache misses is key of performance.

Want more details? Please take a look at this article about memory: What every programmer should know about memory

Dataflow from Memory to ALU

Level Access Time (ns) Typical Size Technology
Registers ~1 64 CMOS
L1 local 2-5 32KB SRAM
L2 local 5-10 256KB SRAM
L3 shared 10-40 8MB -16MB SRAM
Main Memory 60 8-64 GB DRAM

Coalesced Reads and Prefetching

Data between CPU and main memory is transferred in chunks, so if you have variables stored one after one, the CPU fetches a block, which contains not only the currently requested variable. That means if you read these variables in the same order as they are stored in memory, it is significant faster than if these variables are spread around. Reading data as it is stored in memory is called coalesced reading.

Prefetching is a mechanism that tries to read data from the main memory before it is used explicitly. So what the CPU does is to get some chunks of the main memory, located behind the actual processed data, because it assumes that data is will be used soon. Prefetching is very powerful, if the data is not fragmented.

Consequences

We don´t want data to be fragmented in the memory and we don´t want to have unaligned data. Moreover if we need do iterate through data, we should do it the right way (coalesced). If we share data between threads, we don´t want to suffer from false sharing.

In Java the compiler does a lot of optimization, we don´t need to be as careful as in C++, but we still have some major things to look after.

Linked lists suck

The best example of the limitation of the big O notation is the usage of LinkedList, which is used very often, especially in Java. If you think about linked lists from the cache perspective, they are horrible in the most cases, because linked lists implementations often lead to fragmented memory. Inserting and adding items to a list from time to time spreads the list around the memory, so if you ever iterate these list, it will perform very poor. Many iterations will cause a cache miss and require to read from main memory. Vectors or ArrayLists are much better in the most cases. Even if you think they would be bad in theory, when inserting items, they outperform linked list structures, because they are smart implemented and reserve additional space for new items and also use chunks.

A rule of thumb is to avoid linked lists, especially when you iterate through the list sometime. An additional remark, there is a design technique called data driven design, where linked lists are not that bad.

Want to get more details? Bjarne Stroustrup: Why you should avoid Linked Lists

Iterate smart

If you iterate through multidimensional arrays take a step back and think about, how they are stored in memory. This is a very bad idea:

This iterates through the image by column, not by row, which is a very bad idea, as the data is stored row after row (the index should count up). Making a function call is also not a good idea here, but the compiler would inline it for you in the most cases.

Don´t Share everything

Avoiding false sharing is much harder to achieve. The best way to avoid false sharing is to always use as few as possible shared variables and use atomic types, which are optimized. Another way to go is divide and conquer or MapReduce, which means to do computation locally (thread local) and aggregate results, when all threads finished.

I hope you liked this post, please share!

4.67 avg. rating (93% score) - 3 votes

Related Posts

Leave a reply