Memory Padding and The Art of Memory Packing Part1

Who should read this

This article is about a technique for reducing the memory footprint of C programs -manually repacking C structure declarations for reduced size. To read it, you will require basic knowledge of the C programming language.
You need to know this technique if you intend to write code for memory-constrained embedded systems, or operating-system kernels. It is useful if you are working with application data sets so large that your programs routinely hit memory limits. It is good to know in any application where you really care about minimizing cache-line misses.
Finally, knowing this technique is a gateway to other esoteric C topics. You are not an advanced C programmer until you have grasped it.

 

Alignment requirements

The first thing to understand is that, on modern processors, the way your C compiler lays out basic C datatypes in memory is constrained in order to make memory accesses faster.
Storage for the basic C datatypes on an x86 or ARM processor doesn’t normally start at arbitrary byte addresses in memory. Rather, each type except char has an alignment requirement; chars can start on any byte address, but 2-byte shorts must start on an even address, 4-byte ints or floats must start on an address divisible by 4, and 8-byte longs or doubles must start on an address divisible by 8. Signed or unsigned makes no difference.
The jargon for this is that basic C types on x86 and ARM are self-aligned. Pointers, whether 32-bit (4-byte) or 64-bit (8-byte) are self-aligned too.
Self-alignment makes access faster because it facilitates generating single-instruction fetches and puts of the typed data. Without alignment constraints, on the other hand, the code might end up having to do two or more accesses spanning machine-word boundaries. Characters are a special case; they’re equally expensive from anywhere they live inside a single machine word. That’s why they don’t have a preferred alignment.
I said “on modern processors” because on some older ones forcing your C program to violate alignment rules (say, by casting an odd address into an int pointer and trying to use it) didn’t just slow your code down, it caused an illegal instruction fault.

Padding

Now we’ll look at a simple example of variable layout in memory. Consider the following series of variable declarations in the top level of a C module:

int8_t *p;
int8_t c;
int32_t x;

If you didn’t know anything about data alignment, you might assume that these three variables would occupy a continuous span of bytes in memory. That is, on a 32-bit machine 4 bytes of pointer would be immediately followed by 1 byte of char and that immediately followed by 4 bytes of int. And a 64-bit machine would be different only in that the pointer would be 8 bytes.
Here’s what actually happens (on an x86 or ARM or anything else with self-aligned types). The storage for p starts on a self-aligned 4- or 8-byte boundary depending on the machine word size. This is pointer alignment – the strictest possible.
The storage for c follows immediately. But the 4-byte alignment requirement of x forces a gap in the layout; it comes out as though there were a fourth intervening variable, like this:

int8_t *p;      /* 4 or 8 bytes */
int8_t c;       /* 1 byte */
int8_t pad[3];  /* 3 bytes */
int32_t x;      /* 4 bytes */

The pad[3] character array represents the fact that there are three bytes of waste space in the structure. The old-school term for this was “slop”. The value of the padding bits is undefined; in particular it is not guaranteed that they will be zeroed.
Compare what happens if x is a 2-byte short:

int8_t *p;
int8_t c;
int16_t x;

In that case, the actual layout will be this:

int8_t *p;      /* 4 or 8 bytes */
int8_t c;       /* 1 byte */
int8_t pad[1];  /* 1 byte */
int16_t x;      /* 2 bytes */

On the other hand, if x is a long on a 64-bit machine

int8_t *p;
int8_t c;
int64_t x;

we end up with this:

int8_t *p;     /* 8 bytes */
int8_t c;      /* 1 byte
int8_t pad[7]; /* 7 bytes */
int64_t x;     /* 8 bytes */

 Structure alignment and padding

In general, a struct instance will have the alignment of its widest scalar member. Compilers do this as the easiest way to ensure that all the members are self-aligned for fast access.Also, in C the address of a struct is the same as the address of its first member – there is no leading padding.
Consider this struct:

struct s1 {
    int8_t *p;
    int8_t c;
    int64_t x;
};

Assuming a 64-bit machine, any instance of struct s1 will have 8-byte alignment. The memory layout of one of these looks unsurprising, like this:

struct s1 {
    int8_t *p;     /* 8 bytes */
    int8_t c;      /* 1 byte
    int8_t pad[7]; /* 7 bytes */
    int64_t x;     /* 8 bytes */
};

It’s laid out exactly as though variables of these types has been separately declared. But if we put c first, that’s no longer true.

struct s2 {
    int8_t c;      /* 1 byte */
    int8_t pad[7]; /* 7 bytes */
    int8_t *p;     /* 8 bytes */
    int64_t x;     /* 8 bytes */
};

If the members were separate variables, c could start at any byte boundary and the size of pad might vary. Because struct s2 has the pointer alignment of its widest member, that’s no longer possible. Now c has to be pointer-aligned, and following padding of 7 bytes is locked in.
Now let’s talk about trailing padding on structures. To explain this, I need to introduce a basic concept which most people call the stride address of a structure. It is the first address following the structure data that has the same alignment as the structure.
The general rule of trailing structure padding is this: the compiler will behave as though the structure has trailing padding out to its stride address. This rule controls whatsizeof() will return.
Consider this example on a 64-bit x86 or ARM machine:

struct s3 {
    int8_t *p;     /* 8 bytes */
    int8_t c;      /* 1 byte */
};
struct s3 single;
struct s3 quad[4];

You might think that sizeof(struct s3) should be 9, but it’s actually 16. The stride address is that of (&p)[2]. Thus, in the quad array, each member has 7 bytes of trailing padding, because the first member of each following struct wants to be self-aligned on an 8-byte boundary. The memory layout is as though the structure had been declared like this:

struct s3 {
    int8_t *p;     /* 8 bytes */
    int8_t c;      /* 1 byte */
    int8_t pad[7];
};

For contrast, consider the following example:

struct s4 {
    int16_t s;     /* 2 bytes */
    int8_t c;      /* 1 byte */
};

Because s only needs to be 2-byte aligned, the stride address is just one byte after c, and struct s4 as a whole only needs one byte of trailing padding. It will be laid out like this:

struct s4 {
    int16_t s;     /* 2 bytes */
    int8_t c;      /* 1 byte */
    int8_t pad[1];
};

and sizeof(struct s4) will return 4.
Here’s a last important detail: If your structure has structure members, the internal structs want to have the alignment of longest scalar too. Suppose you write this:

struct s5 {
    int8_t c;
    struct s5_internal {
        int8_t *p;
        int16_t x;
    } internal;
};

The  int8_t *p member in the internal struct forces the outer struct to be pointer-aligned as well as the inner. Actual layout will be like this on a 64-bit machine:

struct s5 {
    int8_t c;           /* 1 byte*/
    int8_t pad1[7];     /* 7 bytes */
    struct s5_internal {
        int8_t *p;      /* 8 bytes */
        int16_t x;      /* 2 bytes */
        int8_t pad2[6]; /* 6 bytes */
    } internal;
};

This structure gives us a hint of the savings that might be possible from repacking structures. Of 24 bytes, 13 of them are padding. That’s more than 50% waste space!

Bitfields

Now let’s consider bitfields. What they give you the ability to do is declare structure fields of smaller than character width, down to 1 bit, like this:

struct s6 {
    int16_t s;
    int8_t c;
    int32_t x:1;
    int32_t y:4;
    int32_t z:7;
};

The thing to know about bitfields is that they are implemented with word- and byte-level mask and rotate instructions operating on machine words, and cannot cross word boundaries. C99 guarentees that bit-fields will be packed as tightly as possible, provided they don’t cross storage unit boundaries.
Assuming we’re on a 32-bit machine, that implies that the layout may look like this:

struct s6 {
    int16_t s;       /* 2 bytes */
    int8_t c;        /* 1 byte */
    int32_t x:1;    /* total 1 bit */
    int32_t y:4;    /* total 5 bits */
    int32_t pad1:3;    /* pad to an 8-bit boundary */
    int32_t z:7;  /* 7 bits */
    int32_t pad2:25;   /* pad to 32 bits */
};

But this isn’t the only possibility, because the C standard does not specify that bits are allocated low-to-high. So the layout could look like this:

struct s6 {
    int16_t s;       /* 2 bytes */
    int8_t c;        /* 1 byte */
    int32_t pad1:3;    /* pad to an 8-bit boundary */
    int32_t x:1;    /* total 1 bit */
    int32_t y:4;  /* total 5 bits */
    int32_t pad2:25;   /* pad to 32 bits */
    int32_t z:7;  /* 7 bits */
};

That is, the padding could precede rather than following the payload bits.
Note also that, as with normal structure padding, the padding bits are not guaranteed to be zero; C99 mentions this.Note that the base type of a bit field is interpreted for signedness but not necessarily for size. It is up to implementors whether “int16_t x:1” or “int32_t x:1” are supported, and whether those base types change the size of the storage unit the field is packed into.
The restriction that bitfields cannot cross machine word boundaries means that, while the first two of the following structures pack into one and two 32-bit words as you’d expect, the third (struct s9) takes up three 32-bit words, in the last of which only one bit is used.

struct s7 {
    int32_t bigfield:31;      /* 32-bit word 1 begins */
    int32_t littlefield:1;
};
struct s8 {
    int32_t bigfield1:31;     /* 32-bit word 1 begins /*
    int32_t littlefield1:1;
    int32_t bigfield2:31;     /* 32-bit word 2 begins */
    int32_t littlefield2:1;
};
struct s9 {
    int32_t bigfield1:31;     /* 32-bit word 1 begins */
    int32_t bigfield2:31;     /* 32-bit word 2 begins */
    int32_t littlefield1:1;
    int32_t littlefield2:1;   /* 32-bit word 3 begins */
};

On the other hand, struct s8 would fit into a single 64-bit word if the machine has those.

Structure reordering

Now that you know how and why compilers insert padding in and after your structures we’ll examine what you can do to squeeze out the slop. This is the art of structure packing.
The first thing to notice is that slop only happens in two places. One is where storage bound to a larger data type (with stricter alignment requirements) follows storage bound to a smaller one. The other is where a struct naturally ends before its stride address, requiring padding so the next one will be properly aligned.
The simplest way to eliminate slop is to reorder the structure members by decreasing alignment. That is: make all the pointer-aligned subfields come first, because on a 64-bit machine they will be 8 bytes. Then the 4-byte ints; then the 2-byte shorts; then the character fields.
So, for example, consider this simple linked-list structure:

struct s10 {
    int8_t c;
    struct s10 *p;
    int16_t x;
};

With the implied slop made explicit, here it is:

struct s10 {
    int32_t c;          /* 1 byte */
    int8_t pad1[7];    /* 7 bytes */
    struct s10 *p;     /* 8 bytes */
    int16_t x;         /* 2 bytes */
    int8_t pad2[6];    /* 6 bytes */
};

That’s 24 bytes. If we reorder by size, we get this:

struct s11 {
    struct s11 *p;
    int8_t x;
    int8_t c;
};

Considering self-alignment, we see that none of the data fields need padding. This is because the stride address for a (longer) field with stricter alignment is always a validly-aligned start address for a (shorter) field with less strict requirements. All the repacked struct actually requires is trailing padding:

struct s11 {
    struct s11 *p; /* 8 bytes */
    int16_t x;         /* 2 bytes */
    int8_t c;          /* 1 byte */
    int8_t pad[5];     /* 5 bytes */
};

Our repack transformation drops the size from 24 to 16 bytes. This might not seem like a lot, but suppose you have a linked list of 200K of these? The savings add up fast, especially on memory-constrained embedded systems or in the core part of an OS kernel that has to stay resident.
Note that reordering is not guaranteed to produce savings. Applying this technique to an earlier example, struct s9, we get this:

struct s12 {
    struct s12_internal {
        int8_t *p;      /* 8 bytes */
        int32_t x;        /* 4 bytes */
    } internal;
    int8_t c;           /* 1 byte*/
};

With padding written out, this is

struct s12 {
    struct s12_internal {
        int8_t *p;      /* 8 bytes */
        int32_t x;        /* 4 bytes */
        int8_t pad[4];  /* 4 bytes */
    } internal;
    int8_t c;           /* 1 byte*/
    int8_t pad[7];      /* 7 bytes */
};

It’s still 24 bytes because c cannot back into the inner struct’s trailing padding. To collect that gain you would need to redesign your data structures.
The question you should be asking yourselves right now is  why, if reordering for minimal slop is so simple, C compilers don’t do it automatically. The answer: C is a language originally designed for writing operating systems and other code close to the hardware. Automatic reordering would interfere with a systems programmer’s ability to lay out structures that exactly match the byte and bit-level layout of memory-mapped device control blocks.

Other packing techniques

The riskiest form of packing is to use unions. If you know that certain fields in your structure are never used in combination with certain other fields, consider using a union to make them share storage. But be extra careful and verify your work with regression testing, because if your lifetime analysis is even slightly wrong you will get bugs ranging from crashes to (much worse) subtle data corruption.

Overriding alignment rules

Sometimes you can force your compiler into not using the processor’s normal alignment rules by using a pragma, usually #pragma pack. GCC  has __attribute__((packed))  you can attach to individual structure declarations; GCC has an -fpack-struct option for entire compilations.
Do not do this casually, as it forces the generation of more expensive and slower code. Usually you can save as much memory, or almost as much, with the techniques I describe here.The only good reason for #pragma pack is if you have to exactly match your C data layout to some kind of bit-level hardware or protocol requirement, like a memory-mapped hardware port, and violating normal alignment is required for that to work. If you’re in that situation, and you don’t already know everything else I’m writing about here, you’re in deep trouble and I wish you luck.

This concludes this article about memory padding and the art of memory packing Part1.
In the next article, I will continue talking about memory padding and packing but with hands-on examples using Atmel studio 7 IDE.
If you have any questions or comments about this article please leave it and we’ll answer it as soon as possible.
If you have any questions leave it on our facebook page using this hashtag ‪#‎embeddedx_knowledge_sharing  and stay tuned.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s