The Importance of Correct Memory Alignment

19 11 2009

Developing high performance applications is not an easy task. It is not only from a programming perspective challenging but as well knowing all the possible optimizations is almost impossible. Using compiler optimizations is a first step but even those can’t fix every weirdness the developer programmed. In this blog post I want to give an example for a different possible performance optimization that you can add to your code with few modifications.

RAM — random is not sequential

Even though one might think access to main memory is possible with constant cost, almost everybody heard of the so-called cache hierarchy and the different cache levels available in modern CPUs. Typically this is the last time you thought about them and never cared about them again because access to memory is handled by the application you are using. As soon as you are starting to write your own application it is time to think about the whole topic again.

Back to the beginning: In two sentences, how does the whole memory, cache, register thing work? First the program issues a load statement from a certain memory address, followed by a lookup in all cache hierarchies (L1, L2). If it is not available a load request from memory is triggered and the data is transferred from memory to L2 cache and from their to L1 cache.

The important take-away is that not the single memory location is either transferred alone or completely but rather a single cache line (let’s leave prefetching out of the scope). Depending on your processor architecture this is usually 64 byte. Depending on the speed of your RAM the whole loading take at minimum around 60 clock cycles which is a lot of time during which your program does nothing more than waiting for this specific memory. Now imagine that from those 64 byte you would only consume 4 bytes or even worse only a single byte and then proceed to the next cache line! Depending on your other algorithms this means you are loading 64 bytes and throwing 60 away, loading 64 bytes, throwing 60 bytes away. This does not sound really efficient?

Optimizing Data Structures

The first step to take is to optimize the data structures your program is using. Typical means are checking the width of the of the type, checking if the most important fields are defined first in the structure, etc. Remember C and C++ support structs with bit fields in case you need them!

Optimizing Data Loading

In the last step you probably spent some time on optimizing the layout of your data structure but there is another point that you might not think about before. The alignment of the complete data structure inside your virtual address space.

The above picture shows a memory block that is subdivided into attributes and rows. C.w is the width of all attributes together and L.w is the width of a single cache line. P.w is the width of an operation working only on the first two attributes. The interesting part of this diagram is C.o. C.o is the offset of the first row inside this container to the beginning of the cache line. As you can see C.o virtually moves through all rows of this fixed width container every row having a different offset into the cache line.

The downside of having C.o is that it is not predictable how many cache misses will occur when the data structure is accessed and even more important when offsetting to a certain position inside the memory block it is not possible to calculate if an additional cache miss occurs for the access or not because it is not known what this initial offset is.

To overcome this issue we can use posix_memalign() to specifically align an allocated memory block. The purpose of this method is to align the pointer of this memory area to a certain boundary that can be set using a second parameter. A typical example would be the following:

#include <stdlib.h>
#include <iostream>

#define BLOCK_SIZE 1024
typedef unsigned char BYTE:

int main()
{
    BYTE* data = NULL;
    posix_memalign(&data, 64, BLOCK_SIZE);

    // ... do something ...
    free(data);

    std::cout << "Finished!" << std::endl;
}

In the above case the allocation happens always at the boundary of 64, as a result the beginning of this memory location will always be aligned to the beginning of a cache line. Based on this knowledge it is possible to calculate when an additional cache miss will happen.

In certain cases it might be advisable to consider the possibility to extend the width of a data type to an exact fraction or multiple of a cache line to further optimize the memory access. A possible case could be that a data structure is 60 bytes wide. Extending this structure with an empty 4 byte variable to a size of 64 byte would allow that each element starts at the of a new cache line.

In any case I can only recommend to consult from time to time the hardware documentation of the CPU that you are working with. When you are working with Intel hardware take a look at the following link and I especially recommend to take a look at the Architectures Optimization Reference Manual.

– Martin

Advertisements

Actions

Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: