FluxFlu

Ashley

SWAR and Hamming Weight

Binary Hamming weight, or population count, refers to the number of binary digits that are set in a given number.

For example, the population count of 0b01101 would be 3. Pretty simple stuff.

While looking for an efficient implementation of this instruction, I came across the following C code. Bear with me if you aren't familiar, the language itself doesn't particularly matter here.

                extern inline uint32_t i32_popcount(uint32_t x) {
                    x = x - ((x >> 1) & 0x55555555);
                    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
                    x = (x + (x >> 4)) & 0x0F0F0F0F;
                    x = (x * 0x01010101) >> 24;
                    return x;
                }
                

Now, unless you're already familiar with this type of optimization, this code is likely very difficult to parse.

As follows is the de-optimized version, which should be much more readable.

                extern inline uint32_t i32_popcount(uint32_t x) {
                    x = (x & 0x55555555) + ((x >> 1 ) & 0x55555555);
                    x = (x & 0x33333333) + ((x >> 2 ) & 0x33333333);
                    x = (x & 0x0F0F0F0F) + ((x >> 4 ) & 0x0F0F0F0F);
                    x = (x & 0x00FF00FF) + ((x >> 8 ) & 0x00FF00FF);
                    x = (x & 0x0000FFFF) + ((x >> 16) & 0x0000FFFF);
                    return x;
                }
                

This still isn't particularly understandable, so I'll start by explaining the logic behind it.

SIMD stands for Single Instruction, Multiple Data. It allows for multiple values to be operated on using a single instruction; it is a technique most well-known for its usage with vector and matrix operations.

SWAR stands for SIMD Within A Register. It's similar to SIMD, but instead of using purpose-built types like vectors, we use a single number.

Given that context, it becomes clear what this code is doing. We start with 32 different 1-bit numbers, successively add the smaller numbers into bigger ones, doubling the size and halving the amount each time.


For example, let's take 0xDEADBEEF. The binary representation of this number is 0b11011110101011011011111011101111. That's 24 ones, or a population count of 24. We treat each of these bits as a single 1-bit number. We have a 32-number-long array of 1-bit numbers. It looks like [1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1]. Now, let's fold it. For each set of two numbers, we add them together. This results in a 16-number-long array of 2-bit numbers. It looks like [10, 01, 10, 01, 01, 01, 10, 01, 01, 10, 10, 01, 10, 01, 10, 10]. We do this again. For each set of two numbers, we add them together. This results in an 8-number-long array of 4-bit numbers. It looks like [0011, 0011, 0010, 0011, 0011, 0011, 0011, 0100]. Fold this into a 4-number-long array of 8-bit numbers: [00000110, 00000101, 00000110, 00000111]. Then we get a 2-number-long array of 16-bit numbers: [0000000000001011, 0000000000001101]. Finally, we end up with 00000000000000000000000000011000. This is the binary representation for the number 24, which is, as shown before, the population count of the number we started with.


It makes sense why this works - to treat each bit in a binary number as a single 1-bit number and then add them all together would logically result in the population count. But it's clearly a strange way of actually adding them.

This may seem like a roundabout method for calculating population count, but in the end it is the most efficient. This is because of SWAR.

Let's look back at the first line of the de-optimized code.

                extern inline uint32_t i32_popcount(uint32_t x) {
                    x = (x & 0x55555555) + ((x >> 1 ) & 0x55555555);
                    x = (x & 0x33333333) + ((x >> 2 ) & 0x33333333);
                    x = (x & 0x0F0F0F0F) + ((x >> 4 ) & 0x0F0F0F0F);
                    x = (x & 0x00FF00FF) + ((x >> 8 ) & 0x00FF00FF);
                    x = (x & 0x0000FFFF) + ((x >> 16) & 0x0000FFFF);
                    return x;
                }
                

If we run our original example, x = 0xDEADBEEF, through this, we end up with (x & 0x55555555) + ((x >> 1) & 0x55555555) == 0b10011001010110010110100110011010. This series of bits may seem familiar, and that's because it's the same as the 16-number-long array of 2-bit numbers from before.

But the most pressing question is, "why does that work?" First, it's important to note that 0x55555555 == 0b01010101010101010101010101010101. This means that (x & 0x55555555) results in only the odd-numbered bits.

To continue with the example, the odd-numbered bits of 0b11011110101011011011111011101111 are 0b01010100000001010001010001000101.

And in the same vein, ((x >> 1) & 0x55555555) results in only the even-numbered bits, this time shifted to the right. The even-numbered bits of 0b11011110101011011011111011101111 are 0b10001010101010001010101010101010. But since it's shifted to the right, we get 0b01000101010101000101010101010101.


The incredible result of this is that we have obtained two arrays, each containing half of every pair of 1-bit numbers. What this means is, we can now add them together, resulting in a 16-number-long array of 2-bit numbers. Hence,

                extern inline uint32_t i32_popcount(uint32_t x) {
                    x = (x & 0x55555555) + ((x >> 1 ) & 0x55555555);
                    x = (x & 0x33333333) + ((x >> 2 ) & 0x33333333);
                    x = (x & 0x0F0F0F0F) + ((x >> 4 ) & 0x0F0F0F0F);
                    x = (x & 0x00FF00FF) + ((x >> 8 ) & 0x00FF00FF);
                    x = (x & 0x0000FFFF) + ((x >> 16) & 0x0000FFFF);
                    return x;
                }
                

Given that each line in this code actually does the same thing, I won't explain it all in depth. The following should be explanatory:

                extern inline uint32_t i32_popcount(uint32_t x) {
                    // 0x55555555 == 0b01010101010101010101010101010101
                    x = (x & 0x55555555) + ((x >> 1 ) & 0x55555555);

                    // 0x33333333 == 0b00110011001100110011001100110011
                    x = (x & 0x33333333) + ((x >> 2 ) & 0x33333333);

                    // 0x0F0F0F0F == 0b00001111000011110000111100001111
                    x = (x & 0x0F0F0F0F) + ((x >> 4 ) & 0x0F0F0F0F);

                    // 0x00FF00FF == 0b00000000111111110000000011111111
                    x = (x & 0x00FF00FF) + ((x >> 8 ) & 0x00FF00FF);

                    // 0x0000FFFF == 0b00000000000000001111111111111111
                    x = (x & 0x0000FFFF) + ((x >> 16) & 0x0000FFFF);

                    return x;
                }
                

So now that the basic form is cleared up, let's move on to optimization.

We can start with something very basic. The first line of code in the function can be optimized as such:

                extern inline uint32_t i32_popcount(uint32_t x) {
                    x = (x & 0x55555555) + ((x >> 1 ) & 0x55555555);
                    x = (x & 0x33333333) + ((x >> 2 ) & 0x33333333);
                    x = (x & 0x0F0F0F0F) + ((x >> 4 ) & 0x0F0F0F0F);
                    x = (x & 0x00FF00FF) + ((x >> 8 ) & 0x00FF00FF);
                    x = (x & 0x0000FFFF) + ((x >> 16) & 0x0000FFFF);
                    return x;
                }
                

Becomes:

                extern inline uint32_t i32_popcount(uint32_t x) {
                    x = x - ((x >> 1) & 0x55555555);
                    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
                    x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F);
                    x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF);
                    x = (x & 0x0000FFFF) + ((x >> 16) & 0x0000FFFF);
                    return x;
                }
                

These are the same for reasons that are effectively coincidental. But since that isn't really a satisfying explanation, I'll write a table.

                Given a 2-bit number,
                
                (11 >> 1) & 0x55555555 == 01
                (10 >> 1) & 0x55555555 == 01
                (01 >> 1) & 0x55555555 == 00
                (00 >> 1) & 0x55555555 == 00

                then the pre-optimization version

                (11 & 0x55555555) + 01 == 10
                (10 & 0x55555555) + 01 == 01
                (01 & 0x55555555) + 00 == 01
                (00 & 0x55555555) + 00 == 00

                and now the post-optimization version

                11 - 01 == 10
                10 - 01 == 01
                01 - 00 == 01
                00 - 00 == 00
                

There isn't any particular reason that these operations have the same properties, they simply do. Let's move on.



An interesting property you may have noticed about this function is that the range of potential values is always [0, n], where n is the number of bits in the given number. We start with 1-bit numbers of range [0, 1], and end with a 32-bit number of range [0, 32].

And this is fine, of course, I'm not super concerned about data compression within an i32. But this does mean some very interesting things for our optimizations.

Notably, it brings us from here:

                extern inline uint32_t i32_popcount(uint32_t x) {
                    x = x - ((x >> 1) & 0x55555555);
                    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
                    x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F);
                    x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF);
                    x = (x & 0x0000FFFF) + ((x >> 16) & 0x0000FFFF);
                    return x;
                }
                

To here:

                extern inline uint32_t i32_popcount(uint32_t x) {
                    x = x - ((x >> 1) & 0x55555555);
                    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
                    x = (x + (x >> 4)) & 0x0F0F0F0F;
                    x = (x + (x >> 8)) & 0x00FF00FF;
                    x = (x + (x >> 16)) & 0x0000FFFF;
                    return x;
                }
                

This trick works because of all the free space we have by the time we start using 4 whole bits to encode values in the range [0, 4].

Effectively, the reason is because we now have a free bit. When dealing with 2-bit values in the range [0, 2], there is still the potential to take up the full 2-bits. That is no longer so, as a number in the range [0, 4] will never take up 4 bits. This extra bit solves the issue of numbers flowing into each other, meaning we can now add the arrays before we apply the bit-mask.

That brings us here:

                extern inline uint32_t i32_popcount(uint32_t x) {
                    x = x - ((x >> 1) & 0x55555555);
                    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
                    x = (x + (x >> 4)) & 0x0F0F0F0F;
                    x = (x + (x >> 8)) & 0x00FF00FF;
                    x = (x + (x >> 16)) & 0x0000FFFF;
                    return x;
                }
                

But that's still different from the final function:

                extern inline uint32_t i32_popcount(uint32_t x) {
                    x = x - ((x >> 1) & 0x55555555);
                    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
                    x = (x + (x >> 4)) & 0x0F0F0F0F;
                    x = (x * 0x01010101) >> 24;
                    return x;
                }
                

So how do we get there? It's actually quite simple.

What we do in these last two folds is take a 4-number-long array of 8-bit numbers and add it together into a 2-number-long array of 16-bit-numbers, and then add that together into a 32-bit number.

But it's important to remember that our actual goal is simply to take a 4-number-long array of 8-bit-numbers and reduce it into a 32 bit number. Put into more ordinary terms, it is to take four bytes and add them together.

Now that this has been made clear, the optimization becomes far more easy to understand.

The 4-number-long array of 8-bit-numbers that we start with will be written as [a, b, c, d].

The un-optimized solution is to fold twice. This transforms the array as follows:

                    [a, b, c, d] + (x >> 8) =
                    [a, a + b, b + c, c + d] & 0x00FF00FF =
                    [0, a + b, 0, c + d] + (x >> 16) =
                    [0, a + b, 0, a + b + c + d] & 0x0000FFFF =
                    [0, 0, 0, a + b + c + d]
                

And here's the optimized version.

                    [a, b, c, d] * 0x01010101 =
                    
                    [a, b, c, d] +
                       [a, b, c, d] +
                          [a, b, c, d] +
                             [a, b, c, d] =
                    
                    overflow|
        [a, a + b, a + b + c| a + b + c + d, b + c + d, c + d, d] =

                    [a + b + c + d, b + c + d, c + d, d] >> 24 =
                    [0, 0, 0, a + b + c + d]
                

We thereby arrive at the same conclusion, having performed a far fewer number of instructions. And it is thus that we arrive at the final optimized function.

                extern inline uint32_t i32_popcount(uint32_t x) {
                    x = x - ((x >> 1) & 0x55555555);
                    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
                    x = (x + (x >> 4)) & 0x0F0F0F0F;
                    x = (x * 0x01010101) >> 24;
                    return x;
                }