Compile-time Method Overloading Using Template Specialization

Over the past couple of years I have found myself writing a number of file decoders for use in my own personal codebase (PNG, TTF, BMP, WAV, OGG, and most recently FLAC). Decoding compressed file formats in particular will involve reading an arbitrary number of bits, which necessitates the need for a bit buffer. Over the course of implementing the aforementioned decoders, I tended to just write a new bit buffer implementation each time or copy a previous one over, but when it came time to write a FLAC decoder I wanted to clean these up and write one reusable bit buffer implementation.

The issue that I ended up having to deal with was how to handle little-endian vs. big-endian bit buffers. Of primary importance was the performance of the bit buffers, because reading large image or audio files leads to reading literally millions of bytes of data, so retrieving bits from the data stream has to be as fast as possible. For this reason I immediately discarded options such as inheritance & virtual functions from being considered. I figured I would probably end up writing basic functions for each byte ordering that would operate on a bit buffer parameter. e.g.:

int get_bits_le(BitBuffer *bb, int bits);
int get_bits_be(BitBuffer *bb, int bits);

However, I was curious to investigate some meta programming options. I admit, I like meta programming, but I am not really a big fan of C++ templates. I find anything but a restrained use of them leads to unnecessarily obtuse code that is hard to read, hard to follow, and a headache to debug. Not to mention having to keep in mind subtle side effects such as code bloat and portability issues. Being a fan of meta programming though, I can tolerate some amount of template usage.

My goal was to be able to declare a bit buffer with a given byte order, and then having the appropriate methods for that byte order be chosen at compile time without incurring a performance penalty. i.e.:

fsBitBuffer<fsByteOrder:littleEndian> bitBuffer;
...
int value = bitBuffer.get_bits(4);

This is what I ended up with:

enum struct fsByteOrder {
    littleEndian,
    bigEndian
};

template <fsByteOrder>
struct fsBitBuffer {
    uint8_t *stream;
    uint64_t buffer;
    int32_t bitCount;

    uint64_t get_bits(int32_t bits);
    void refill(int32_t bits);
    ...
};

template<> void
fsBitBuffer<fsByteOrder::bigEndian>::refill(int32_t bits) {
    while (bitCount < bits) {
        buffer <<= 8;
        buffer |= *stream++;
        bitCount += 8;
    }
}

template<> void
fsBitBuffer<fsByteOrder::littleEndian>::refill(int32_t bits) {
    while (bitCount < bits) {
        uint64_t byte = *stream++;
        buffer |= (byte << bitCount);
        bitCount += 8;
    }
}

template<> uint64_t
fsBitBuffer<fsByteOrder::bigEndian>::get_bits(int32_t bits) {
    refill(bits);
    bitCount -= bits;
    uint64_t value = buffer >> bitCount;
    value &= (1ULL << bits) - 1;
    
    return value;
}

template<> uint64_t
fsBitBuffer<fsByteOrder::littleEndian>::get_bits(int32 bits) {
    refill(bits);
    uint64_t value = buffer;
    value &= (1ULL << bits) - 1;
    bitCount -= bits;
    buffer >>= bits;

    return value;
}

The next step was to make sure this approach was comparatively fast with other methods. Other methods I compared against included a simple branching implementation, function overloading, and individual functions for each byte ordering. Here are brief examples of each for the sake of clarity:

// Branching
uint64_t get_bits(int32_t bits) {
    refill(bits);
    
    switch (byteOrder) {
        case fsByteOrder::bigEndian:
        {
            bitCount -= bits;
            uint64_t value = buffer >> bitCount;
            value &= (1ULL << bits) - 1;
            
            return value;
        } break;
        
        case fsByteOrder::littleEndian:
        {
            uint64_t value = buffer;
            value &= (1ULL << bits) - 1;
            bitCount -= bits;
            buffer >>= bits;
            
            return value;
        } break;
    }
}

// Function overloading
uint64_t
fs_get_bits(fsBitBuffer<fsByteOrder::bigEndian> *buffer, int32_t bits) {
    fs_refill(buffer, bits);
    buffer->bitCount -= bits;
    uint64_t value = buffer->buffer >> buffer->bitCount;
    value &= (1ULL << bits) - 1; 

    return value; 
} 

// Individual functions 
uint64_t 
fs_get_bits_be(fsBitBuffer *buffer, int32_t bits) { 
    fs_refill_be(buffer, bits); 
    buffer->bitCount -= bits;
    uint64_t value = buffer->buffer >> buffer->bitCount;
    value &= (1ULL << bits) - 1;
    
    return value;
}

To measure the performance time of each of these test cases, I ran all four through three different scenarios of reading arbitrary bits & bytes (each iteration read 1MB of data). Here are the results:

Test case A. Reading single bytes.

Test case A (without branching method)

It should come as no surprise that the branching method clearly loses out to the other three, which are quite close together.

Test case B. Reading uint16_t’s.

Here we see a similar pattern with the three non-branching implementations being very close in execution time.

Test case C. Reading an arbitrary mixture of bits (4, 1, 2, 1).

Test case C (without branching method)

While the three non-branching methods remain close in execution time in this last test case, the template specialization method edges out the other two here after the initial spike. I’ve run these test cases several other times and observed similar findings, so I’m rather pleased to see a fairly simple, straightforward use of template specialization is a worthy approach for compile-time method overloading when performance is of primary concern.

For the sake of completeness, the compiler used was Clang with Apple’s LLVM 9.0 backend (Xcode 9.4.1) with -Os optimization.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s