Entirely over-optimizing bit-doubling

Two years ago, I came across a problem at work that I thought might become a bottleneck. I got interested in optimizing it, and even though it turned out to not be necessary in that instance, it was still a very good learning opportunity. The problem itself is incredibly simple, yet there are a lot of interesting algorithms and lessons learned stemming from it.

The problem is that of duplicating each bit in a slice of bytes into one of double the size.
So, for example, the slice
[0b00000001, 0b00000010] would become
[0b00000000, 0b00000011, 0b00000000, 0b00001100].

A naive implementation, which takes every bit out of the source, shifts it by the correct amount and then places it in the output

pub fn double_array_sisd(source: &Vec<u8>) -> Vec<u8> {
    let size = source.len();
    let mut output = vec![0; size * 2];

    for i in 0..size {
        for j in 0..8 {
            let byte = 1 - (j / 4);
            let bit = (source[i] >> j) & 1;
            output[i * 2 + byte] |= bit << ((j * 2) % 8);
            output[i * 2 + byte] |= bit << ((j * 2 + 1) % 8);
        }
    }
    output
}

The code for this and all other implementations in this post can be found in this repository.

On a 9950X with 10MB of input data, this function takes about 10ms and therefore gets a throughput of ~2.75 GiB/s. For this blog post, I define throughput as input plus output data (30MB total in this case). And we'll be looking at optimizing for that system: a 9950X running Ubuntu 24.04.3 LTS with Rust 1.91.0-nightly. The memory speed and such doesn't matter here, because all tests fit in the L3 cache.

Now, you may think that copying over a single bit at a time is obviously the biggest problem here, but first, let's look at the assembly to make sure the compiler isn't already doing something clever.

Do note that this post contains a lot of code.

wayyyy too much assembly for how simple the code is ▼

double_array_sisd:
        push    rbp
        push    rbx
        sub     rsp, 72
        lea     rax, [rsi + rsi]
        mov     qword ptr [rsp + 8], rax
        mov     qword ptr [rsp + 16], rcx
        cmp     rax, rcx
        jne     .LBB0_6
        test    rsi, rsi
        je      .LBB0_5
        mov     eax, 1
.LBB0_3:
        cmp     rax, rcx
        jae     .LBB0_7
        movzx   r8d, byte ptr [rdi]
        mov     r11d, r8d
        and     r11b, 1
        lea     r10d, [r11 + r11]
        or      r10b, byte ptr [rdx + rax]
        mov     r9d, r8d
        shr     r9b
        mov     ebx, r9d
        and     bl, 1
        lea     ebp, [4*rbx]
        shl     bl, 3
        or      bl, bpl
        or      bl, r11b
        mov     r11d, r8d
        shr     r11b, 2
        and     r11b, 1
        mov     ebp, r11d
        shl     bpl, 4
        shl     r11b, 5
        or      r11b, bpl
        or      r11b, bl
        lea     ebx, [8*r8]
        and     bl, 64
        mov     ebp, r8d
        shl     bpl, 4
        and     bpl, -128
        or      bpl, bl
        or      bpl, r11b
        or      bpl, r10b
        mov     byte ptr [rdx + rax], bpl
        mov     r11d, r8d
        shr     r11b, 4
        and     r11b, 1
        lea     r10d, [r11 + r11]
        or      r10b, byte ptr [rdx + rax - 1]
        mov     ebx, r8d
        shr     bl, 5
        and     bl, 1
        lea     ebp, [4*rbx]
        shl     bl, 3
        or      bl, bpl
        or      bl, r11b
        mov     r11d, r8d
        shr     r11b, 6
        and     r11b, 1
        mov     ebp, r11d
        shl     bpl, 4
        shl     r11b, 5
        or      r11b, bpl
        or      r11b, bl
        and     r9b, 64
        and     r8b, -128
        or      r8b, r9b
        or      r8b, r11b
        or      r8b, r10b
        mov     byte ptr [rdx + rax - 1], r8b
        inc     rdi
        add     rax, 2
        dec     rsi
        jne     .LBB0_3
.LBB0_5:
        add     rsp, 72
        pop     rbx
        pop     rbp
        ret
.LBB0_6:
        [panic!]
.LBB0_7:
        [panic!]

Right, so, that's not very easy to read, but looking at it closely and knowing that things like * and % get optimized into bitwise operations, you can see that this code does indeed follow the structure and individually doubles every bit one at a time. This tells us that manual optimization is almost certainly worth it, if we can do multiple bits at a time.

Rust config for max performance

The only change (other than compiling for release) is to tell the compiler to compile for this exact CPU with all its features.

For this, I simply add the following to the .cargo/config.toml

[build]
rustflags = ["-C", "target-cpu=native"]

The obvious solution: A lookup table

pub fn double_array_lookup_u8(array: &[u8]) -> Vec<u8> {
    let lookup: [u16; 256] = (0..=255u8)
        .map(|x| {
            let mut res = 0u16;
            for j in 0..8 {
                let bit = (x as u16 >> j) & 1;
                res |= bit << (j * 2);
                res |= bit << (j * 2 + 1);
            }
            res
        })
        .collect::<Vec<u16>>()
        .try_into()
        .unwrap();

    array
        .iter()
        .flat_map(|&x| lookup[x as usize].to_be_bytes())
        .collect()
}

While this does incur a fixed cost per call of the function, this could easily be negated by calculating it at compile time, or using something like lazy_static to only calculate it once during runtime.

This is indeed a lot faster at 2ms or 14.5 GiB/s, but I think we can do better. This lookup table is pretty large at 512 bytes (256*2), and thus fits easily into L1 (80 KB per core), but not into a cache line (64 bytes). And so every bytes needs an additional memory operation.

Well, then just make a smaller LUT

This one splits the byte in two, then looks up each half (4 bits) and returns those. It also has the benefit that the lookup is small enough to reasonably just be typed as a constant.

pub fn double_array_lookup_u4(array: &[u8]) -> Vec<u8> {
    let doubled_array: Vec<u8> = array
        .iter()
        .flat_map(|&x| {
            let high_nibble = (x >> 4) as usize;
            let low_nibble = (x & 0b0000_1111) as usize;

            #[rustfmt::skip]
            const LOOKUP: [u8; 16] = [
                0b00000000,
                0b00000011,
                0b00001100,
                0b00001111,
                0b00110000,
                0b00110011,
                0b00111100,
                0b00111111,
                0b11000000,
                0b11000011,
                0b11001100,
                0b11001111,
                0b11110000,
                0b11110011,
                0b11111100,
                0b11111111,
            ];

            [LOOKUP[high_nibble], LOOKUP[low_nibble]]
        })
        .collect();
    doubled_array
}

2.7ms and 11 GiB/s... Huh, that's actually slower. I guess doing half the work in each step isn't worth it when all the data is already in the same cache line, because it's already in the fastest cache.

Squaring as bit-moving

When I presented this problem to two groups of friends independently (months apart), they both noticed that when taking a number with just a single bit set and multiplying it with itself, the bit moves to the position at twice its previous index. This happens to be the exact position we want the bits to end up (and one up, but that's a very simple shift-by-1 and or operation). The problem with using that for this problem is that if more than one bit is set, they might "collide", or interact with each other in the multiplication. For example 0b0000_0111 * 0b0000_0111 = 0b0011_0001 instead of 0b0010_0101 like we would like. So we need to mask out some bits and then combine multiple iterations of those.

Here is a fairly straight-forward implementation of this by Laura. Notice that instead of the "shift-by-1 and or operation", she used *3 instead to do the actual doubling, and that it's done for all parts individually, instead of at the end. That it's done individually doesn't actually change performance, so the compiler probably noticed that as well and put that operation in the optimal spot.

pub fn double_array_sisd_laura(array: &Vec<u8>) -> Vec<u8> {
    let size = array.len();
    let mut doubled_array = vec![0; size * 2];

    let n1 = 0b10001000u8;
    let n2 = 0b01000100u8;
    let n3 = 0b00100010u8;
    let n4 = 0b00010001u8;

    let m1 = 0b0100000001000000u16;
    let m2 = 0b0001000000010000u16;
    let m3 = 0b0000010000000100u16;
    let m4 = 0b0000000100000001u16;

    for i in 0..size {
        let mut a = (((array[i] & n1) as u16).pow(2) & m1) * 3;
        a += (((array[i] & n2) as u16).pow(2) & m2) * 3;
        a += (((array[i] & n3) as u16).pow(2) & m3) * 3;
        a += (((array[i] & n4) as u16).pow(2) & m4) * 3;

        doubled_array[i * 2] += (a >> 8) as u8;
        doubled_array[i * 2 + 1] += a as u8;
    }
    doubled_array
}

This manages just under a millisecond in my test and so 30 GiB/s. Looking at the assembly output, the compiler auto-vectorized this code using the 9950X's AVX512 capabilities.

Two more multiplication based approaches ▼

pub fn double_array_benk(array: &[u8]) -> Vec<u8> {
    const L: usize = 6;
    const MASK1: [u64; L] = [290456853, 580913706, 1140936768, 2281873536, 262144, 524288];
    const MASK2: [u64; L] = [
        72357760713425169,
        289431042853700676,
        1157425108814401536,
        4629700435257606144,
        68719476736,
        274877906944,
    ];

    pub fn double(x: u32) -> u64 {
        let mut res = 0;
        for i in 0..L {
            let y = (x as u64) & MASK1[i];
            res |= (y * y) & MASK2[i];
        }

        res | (res << 1)
    }

    let mut doubled_array = vec![0; array.len() * 2];

    for i in (0..array.len()).step_by(4) {
        let num = u32::from_le_bytes([array[i], array[i + 1], array[i + 2], array[i + 3]]);
        let num = double(num);
        let num_array = num.to_le_bytes();
        // doubled_array[i * 2..i * 2 + 8].copy_from_slice(&num_array);
        doubled_array[i * 2] = num_array[1];
        doubled_array[i * 2 + 1] = num_array[0];
        doubled_array[i * 2 + 2] = num_array[3];
        doubled_array[i * 2 + 3] = num_array[2];
        doubled_array[i * 2 + 4] = num_array[5];
        doubled_array[i * 2 + 5] = num_array[4];
        doubled_array[i * 2 + 6] = num_array[7];
        doubled_array[i * 2 + 7] = num_array[6];
    }
    doubled_array
}

This one gets 3.5 GiB/s

pub fn double_array_ben(array: &[u8]) -> Vec<u8> {
    fn double(x: u8) -> u16 {
        let a = (((((x as u64) * 0x0101010101010101u64) & 0x8040201008040201u64)
            * 0x0102040810204081u64)
            >> 49)
            & 0x5555;
        let b = (((((x as u64) * 0x0101010101010101u64) & 0x8040201008040201u64)
            * 0x0102040810204081u64)
            >> 48)
            & 0xAAAA;
        (a | b) as u16
    }

    let mut doubled_array = vec![0; array.len() * 2];

    for i in 0..array.len() {
        let num = double(array[i]);
        doubled_array[i * 2 + 1] = (num & 0b0000_0000_1111_1111) as u8;
        doubled_array[i * 2] = ((num & 0b1111_1111_0000_0000) >> 8) as u8;
    }
    doubled_array
}

And this one 21.5 GiB/s

Doing something clever (bitwise divide and conquer)

Manually unrolling the inner loop reveals that there is a pattern of bitshifts and ORs. Using this, we can spread the shifts across the right combination of shifts of length 1, 2, and 4 to make any of the required shifts between 0 and 7 bits. After that, the result is the original number spread out correctly, and the last remaining step is to duplicate the bytes by doing num | num << 1;.

pub fn double_array_sisd_opt_iter(array: &[u8]) -> Vec<u8> {
    array
        .iter()
        .flat_map(|&x| {
            let num = x as u16;
            let num = num & 0b0000_1111_0000_1111 | (num & 0b1111_0000_1111_0000) << 4;
            let num = num & 0b0011_0011_0011_0011 | (num & 0b1100_1100_1100_1100) << 2;
            let num = num & 0b0101_0101_0101_0101 | (num & 0b1010_1010_1010_1010) << 1;
            let num = num | num << 1;
            [
                ((num & 0b1111_1111_0000_0000) >> 8) as u8,
                (num & 0b0000_0000_1111_1111) as u8,
            ]
        })
        .collect()
}

420µs and 70 GiB/s, now that's what I'm talking about!

AVX512

I've mentioned above that the 9950X has AVX512 support, so let's make use of that.

In general, vectorization goes as follows:

align the incoming data. This might leave some data in front and back of the aligned data.
copy the code from the regular code for the pre and post data.
convert each operation into the equivalent vector instruction, while keeping in mind that there are multiple (in this case 32) data points being operated on at the same time.

Laura AVX

The first algorithm to get vectorized is the one by Laura, because it only uses bitwise operations and multiplication, so it is relatively simple to vectorize. Each operation in the original code has an equivalent AVX512 version.

There are however three interesting parts: The constants have to be spread or "broadcast" over the vector to be applicable to all data points at the same time We need to convert the 8-bit numbers to 16-bit inside the register, and so use _mm512_cvtepu8_epi16 to convert the 256-bit input register into the 512-bit one we'll use in the rest of the algorithm. To store the data in the correct order defined by the trivial implementation, it needs to be rearranged before storing. This is done using two shifts and an or instruction.

#[cfg(all(
    any(target_arch = "x86_64"),
    target_feature = "avx512f",
))]
pub fn double_array_simd_laura(array: &Vec<u8>) -> Vec<u8> {
    let mut doubled_array = vec![0; array.len() * 2];

    let n1 = 0b10101000u8;
    let n2 = 0b01000010u8;
    let n3 = 0b00010101u8;

    let m1 = 0b0100010001000000u16;
    let m2 = 0b0001000000000100u16;
    let m3 = 0b0000000100010001u16;

    let (pre, array, post) = unsafe { array.align_to::<__m256i>() };

    for i in 0..pre.len() {
        let mut a = (((pre[i] & n1) as u16).pow(2) & m1) * 3;
        a += (((pre[i] & n2) as u16).pow(2) & m2) * 3;
        a += (((pre[i] & n3) as u16).pow(2) & m3) * 3;

        doubled_array[i * 2] += (a >> 8) as u8;
        doubled_array[i * 2 + 1] += a as u8;
    }

    unsafe {
        let avx_n1 = _mm512_set1_epi16(n1 as i16);
        let avx_n2 = _mm512_set1_epi16(n2 as i16);
        let avx_n3 = _mm512_set1_epi16(n3 as i16);

        let avx_m1 = _mm512_set1_epi16(m1 as i16);
        let avx_m2 = _mm512_set1_epi16(m2 as i16);
        let avx_m3 = _mm512_set1_epi16(m3 as i16);

        let avx_zero = _mm512_set1_epi16(0);

        for i in 0..array.len() {
            // convert the input to 16 bit
            let input = _mm512_cvtepu8_epi16(array[i]);

            // AND the input with n1
            let mut a = _mm512_and_si512(input, avx_n1);
            // multiply that with itself
            a = _mm512_mullo_epi16(a, a);
            // and AND that with m1
            a = _mm512_and_si512(a, avx_m1);

            let mut b = _mm512_and_si512(input, avx_n2);
            b = _mm512_mullo_epi16(b, b);
            b = _mm512_and_si512(b, avx_m2);

            let mut c = _mm512_and_si512(input, avx_n3);
            c = _mm512_mullo_epi16(c, c);
            c = _mm512_and_si512(c, avx_m3);

            // combine all three results
            let out = _mm512_or_si512(_mm512_or_si512(a, b), c);
            // then do the actual doubling
            let out = _mm512_or_si512(out, _mm512_shldi_epi64::<1>(out, avx_zero));

            let swapped = {
                // swap adjacent bytes within each 16-bit lane:
                // e.g. [b1 b0 b3 b2 ...] -> [b0 b1 b2 b3 ...]
                let left = _mm512_slli_epi16::<8>(out);
                let right = _mm512_srli_epi16::<8>(out);
                _mm512_or_si512(left, right)
            };

            // store the resulting 512 bit vector in the output array
            _mm512_storeu_si512(
                doubled_array.as_mut_ptr().add(i * 64 + pre.len() * 2) as *mut __m512i,
                swapped,
            );
        }
    }

    for i in 0..post.len() {
        let mut a = (((post[i] & n1) as u16).pow(2) & m1) * 3;
        a += (((post[i] & n2) as u16).pow(2) & m2) * 3;
        a += (((post[i] & n3) as u16).pow(2) & m3) * 3;

        doubled_array[(i + pre.len() + array.len() * 32) * 2] += (a >> 8) as u8;
        doubled_array[(i + pre.len() + array.len() * 32) * 2 + 1] += a as u8;
    }
    doubled_array
}

Huh, that only manages 35 GiB/s and takes 840µs. Looking at the throughput and latencies for a vector multiply, this makes sense though. They are quite expensive and we need three of those.

AVX512 128-bit LUT

We have discussed earlier how a smaller lookup table from u4 to u8 might be faster, because it fits in a single cache line. Well, that also means that it fits in a single AVX512 register! And there, no slow additional memory operations are needed at all.

#[cfg(all(
    any(target_arch = "x86_64"),
    target_feature = "avx512f",
))]
pub fn double_array_lookup_avx_u4(array: &[u8]) -> Vec<u8> {
    let mut doubled_array: Vec<u8> = Vec::with_capacity(array.len() * 2);

    #[rustfmt::skip]
    const LOOKUP: [u8; 16] = [
        0b00000000,
        0b00000011,
        0b00001100,
        0b00001111,
        0b00110000,
        0b00110011,
        0b00111100,
        0b00111111,
        0b11000000,
        0b11000011,
        0b11001100,
        0b11001111,
        0b11110000,
        0b11110011,
        0b11111100,
        0b11111111,
    ];

    let (pre, array, rest) = unsafe { array.align_to::<__m128i>() };

    for i in pre {
        doubled_array.push(LOOKUP[(i >> 4) as usize]);
        doubled_array.push(LOOKUP[(i & 0b1111) as usize]);
    }

    unsafe {
        // store LUT in a vector
        let lookup = _mm_load_si128(LOOKUP.as_ptr() as *const __m128i);

        let mut_ptr = doubled_array.as_mut_ptr();
        let pre_len_x2 = pre.len() * 2;
        for i in 0..array.len() {
            let input = *array.get_unchecked(i);
            // mask for low nibble
            let mask = _mm_set1_epi8(0x0f as i8);
            // isolate low nibble
            let input_lo = _mm_and_si128(input, mask);
            // get high nibble by shifting right 4 bits (use 16-bit shift to avoid cross-byte shifts), then mask
            let input_hi = _mm_and_si128(_mm_srli_epi16(input, 4), mask);
            // lookup doubled bytes for low/high nibbles using byte shuffle
            let output_lo = _mm_shuffle_epi8(lookup, input_lo);
            let output_hi = _mm_shuffle_epi8(lookup, input_hi);
            // interleave hi/lo bytes per input byte: [hi0, lo0, hi1, lo1, ...] into two 16-byte vectors
            let out0 = _mm_unpacklo_epi8(output_hi, output_lo);
            let out1 = _mm_unpackhi_epi8(output_hi, output_lo);
            // store the two 16-byte vectors (total 32 bytes)
            _mm_storeu_si128(mut_ptr.add(i * 32 + pre_len_x2) as *mut __m128i, out0);
            _mm_storeu_si128(mut_ptr.add(i * 32 + pre_len_x2 + 16) as *mut __m128i, out1);
        }
        doubled_array.set_len(array.len() * 32 + pre.len() * 2);
    }

    // deal with the rest of the array
    for i in rest {
        doubled_array.push(LOOKUP[(i >> 4) as usize]);
        doubled_array.push(LOOKUP[(i & 0b1111) as usize]);
    }

    doubled_array
}

85 GiB/s at 345µs!

Using only very simple and therefore fast vector instructions to work on 16 bytes at a time does actually give the best results so far.

Quick side-note: this same algorithm can (probably, I haven't actually checked) be implemented using AVX or ARM Neon, as they also have 128-bit instructions. If anyone is interested enough, feel free to try to implement a version for those and send it to me. I will add it to this post.

But what if we could do 64 bytes with the same kind of instructions?

AVX512 128-bit LUT with 512-bit input

That should be possible by just replacing the 128-bit instructions with 512-bit ones, right?

#[cfg(all(
    any(target_arch = "x86_64"),
    target_feature = "avx512f",
))]
pub fn double_array_lookup_avx512_u4(array: &[u8]) -> Vec<u8> {
    let mut doubled_array: Vec<u8> = Vec::with_capacity(array.len() * 2);

    let (pre, array, rest) = unsafe { array.align_to::<__m512i>() };

    for i in pre {
        doubled_array.push(LOOKUP.table[(i >> 4) as usize]);
        doubled_array.push(LOOKUP.table[(i & 0b1111) as usize]);
    }

    // <The same lookup table as before>
    
    unsafe {
        // store LUT in a vector
        let lookup = _mm_load_si128(LOOKUP.table.as_ptr() as *const __m128i);
        let lookup = _mm512_broadcast_i32x4(lookup);
        // let lookup = _mm512_add_epi(lookup);
        // mask for low nibble
        let mask = _mm512_set1_epi8(0x0f as i8);

        let mut_ptr = doubled_array.as_mut_ptr();
        let pre_len_x2 = pre.len() * 2;
        for i in 0..array.len() {
            let input = *array.get_unchecked(i);
            // isolate low nibble
            let input_lo = _mm512_and_si512(input, mask);
            // get high nibble by shifting right 4 bits (use 16-bit shift to avoid cross-byte shifts), then mask
            let input_hi = _mm512_and_si512(_mm512_srli_epi16(input, 4), mask);
            // lookup doubled bytes for low/high nibbles using byte shuffle
            let output_lo = _mm512_shuffle_epi8(lookup, input_lo);
            let output_hi = _mm512_shuffle_epi8(lookup, input_hi);
            // interleave hi/lo bytes per input byte: [hi0, lo0, hi1, lo1, ...] into two 64-byte vectors
            let out0 = _mm512_unpacklo_epi8(output_hi, output_lo);
            let out1 = _mm512_unpackhi_epi8(output_hi, output_lo);

            // store the two 64-byte vectors (total 128 bytes)
            _mm512_storeu_si512(mut_ptr.add(i * 128 + pre_len_x2) as *mut __m512i, out0);
            _mm512_storeu_si512(mut_ptr.add(i * 128 + pre_len_x2 + 64) as *mut __m512i, out1);
        }
        _mm_sfence();
        doubled_array.set_len(array.len() * 128 + pre.len() * 2);
    }

    // deal with the rest of the array
    for i in rest {
        doubled_array.push(LOOKUP[(i >> 4) as usize]);
        doubled_array.push(LOOKUP[(i & 0b1111) as usize]);
    }

    doubled_array
}

96 GiB/s (307µs). That's quite a bit faster still and more than half the bandwidth of the L3 cache. Reducing the test size to fit in L1 cache (10KiB input), speeds up the test significantly to over 250 GiB/s. This also speeds up other tests, but not by much, suggesting this is the only implementation where cache bandwidth is a significant bottleneck. (The double_array_sisd_opt_iter for example speeds up from 70 GiB/s to 79 GiB/s)

Unfortunately, the result is not quite right. The _mm512_unpacklo_epi8 intrinsic doesn't do what I expect. I thought it would unpack the low half of the 512-bit register, but instead it unpacks the low halves of the four 128-bit lanes. This results in the output being correct over 128 bits, but those are in the wrong order. I've briefly looked into how to fix this, but not found a solution yet. This search has however led me to another discovery:

You may have wondered what the _mm_sfence is doing after the loop. It increases performance. Why? No clue. The fence instruction makes sure that all non-temporal stores are actually written to memory before continuing. I found it while looking for different store intrinsics that would allow me to store the lanes in the correct location. I didn't find any (other than scatter, but that's way too slow), I did however find _mm512_stream_si512, which is supposed to be faster than the store instruction, because it doesn't guarantee anything is actually stored any time soon, so you need a fence before reading the memory. It turns out however, that here the stream instructions are slower than the store by quite some margin, but keeping the fence is much faster with the store instructions, even though it should be a NOP without any non-temporal stores.

Conclusion

I've had a lot of fun writing, benchmarking, but most importantly sharing this problem with friends who contributed their own ideas. Using unusual instructions and techniques to solve a very simple problem very quickly.

In the end, I was able to achieve speeds above the memory bandwidth of the system on a single core using a fairly simple algorithm and vectorization.

If this were a serious project rather than just a fun exploration, I'd use a more stable platform than a consumer 9950X at dynamic clock speeds and run more rigorous tests. I guess, if I am motivated enough, I can still configure that system to run at a stable clock speed and re-run the tests.