Tag Archives: C/C++

Custom Game Engine on iOS: The Platform Layer

Video games are what got me interested in programming. I suspect that is true for many programmers. And although I’m not programming games professionally in my career, I still find time to work on my own game as a hobby project. It’s truly rewarding and very enjoyable to have a fun side project to work on that challenges me in different ways than my job does. This particular game has been keeping me busy for the past few years (working on something so complicated part-time is quite time-consuming!), but it’s at a stage now that feels like an actual playable game. As such I’ve been thinking about platforms it would be well-suited for, and touch screens like tablets and phones are a great fit. All development up to this point has been on Mac and Windows, but I became more and more curious to get a sense of what it would feel like on a touch screen, so I finally decided to get the game running on iOS.

Now that I’ve completed the first pass of my iOS platform layer, I thought it would be interesting to detail what went in to making it work, and how a game engine might look on iOS. Of course there are many different ways this can be done; this is just how I did it, and what works for me and the game I am making. Hopefully this knowledge can be helpful and inspiring in some way.

This first part will cover setting up the larger platform layer architecture and what was needed in order to have my game talk to iOS and vice versa. Following parts will go more into detail on using Metal for graphics and Core Audio for, well.. audio. First, however, it’s important to get a view of how the game itself fits into the iOS layer.

The entire game architecture can be split into two main layers: the game layer and the platform layer. The iOS version adds a third intermediate layer as we will shortly see, but broadly speaking, this is how I see the overall structure. The vast bulk of the code exists in the game layer with a minimal amount of code existing in the platform layer that is required for the game to talk to the operating system and vice versa. This has made the initial port to iOS relatively easy and quick.

One of the first things any application needs on any OS is a window, and it’s no different for a game. On iOS we can create a window in the AppDelegate:

class AppDelegate: UIResponder, UIApplicationDelegate {
    var window: UIWindow?

    func application(_ application: UIApplication, didFinishLaunchingWithOptions launchOptions: [UIApplication.LaunchOptionsKey: Any]?) -> Bool {
        let screen = UIScreen.main
        window = UIWindow(frame: screen.bounds)
        window?.rootViewController = ViewController()

        return true

This should look familiar to any iOS developer. We create a window with the same bounds as the main screen, set the root ViewController, and then basically show the window. In the ViewController we need to create the view itself, which is what the game will draw into, and is ultimately what is displayed to the user. To do this, we override loadView:

class ViewController: UIViewController {
    let metalDevice = MTLCreateSystemDefaultDevice()

    override func loadView() {
        var metalView: MetalView?
        if let metalDevice = metalDevice {
            metalView = MetalView(metalDevice: metalDevice)

        view = metalView ?? UIView(frame: UIScreen.main.bounds)
        view.backgroundColor = UIColor.clear

I use Metal to display the bitmap for each frame, so for that I set up a Metal-backed view and assign it to the ViewController‘s view property (in case that fails or the device doesn’t support Metal, a normal UIView can be set instead to at least prevent the game from crashing). I’ll be going into more detail on how the Metal view works in the next part of this series.

Just as a small little detail, I don’t want the status bar shown in my game, so I override a property on the ViewController to hide it:

override var prefersStatusBarHidden: Bool { true }

override func viewWillAppear(_ animated: Bool) {

With those core elements set up (a window and a view that will contain the game), it’s time to look into how to integrate the game layer into the platform layer.

The first order of business is dealing with the fact that the platform layer is written in Swift while the game layer is written entirely in C++. Interoperability between Swift and C is relatively painless, but Swift cannot interoperate with C++, so this means I need a C interface as a bridge between the game and the platform. This is what I meant by the extra intermediate layer I mentioned above being required for the iOS version. I didn’t want to convert any of my existing game code into a C interface, so instead I created a .h/.cpp file pair in  my iOS project where the .h file is a pure C interface and the .cpp file wraps the calls to the actual game as well as implementing some of the lower-level platform functionality like threads and file I/O.

Here is what part of the .h interface file looks like:

#ifdef __cplusplus
extern "C" {

enum TouchPhase {

#pragma mark - Platform
bool ios_platform_initialize(uint32_t threadCount);
void ios_platform_shutdown();

#pragma mark - Game
void ios_game_startup();
void ios_game_shutdown();
void ios_begin_frame();
void ios_end_frame();

#pragma mark - Input
void ios_input_reset();
void ios_input_add_touch(float x, float y, enum TouchPhase phase);

#ifdef __cplusplus

Here is a sample of the corresponding .cpp file:

#include <dispatch/dispatch.h>
#include <mach/mach.h>
#include <mach/mach_time.h>
#include <unistd.h>

#include "bullet_beat_ios.h"
#include "../../Source/bullet_beat.h"

static bbGameMemory gameMemory;
static bbPlatform platform;
static bbThreadPool threadPool;

ios_platform_initialize(uint32_t threadCount) {
    vm_address_t baseAddress = 0;
    size_t memorySize = fsMegabytes(256);
    kern_return_t result = vm_allocate((vm_map_t)mach_task_self(), &baseAddress, memorySize, VM_FLAGS_ANYWHERE);
    if (result == KERN_SUCCESS) {
        gameMemory.init((void *)baseAddress, memorySize);
    mach_timebase_info_data_t machTimebaseInfo;
    timebaseFreq = ((float)machTimebaseInfo.denom / (float)machTimebaseInfo.numer * 1.0e9f);
    // Additional setup
    if (ios_initialize_thread_pool(&threadPool, threadCount)) {
        platform.threadPool = &threadPool;
    } else {
        return false;
    return gameMemory.is_valid();

ios_platform_shutdown() {

ios_game_startup() {

ios_game_shutdown() {

In addition to including some system header files and the corresponding .h file for the C interface, I also include the main header file for the game so I can call directly into the game from this .cpp file. This file also statically defines some game layer structures that the platform layer needs access to, made possible by including the game’s main header file. The ios_platform_initialize function allocates memory, initializes the worker thread pool, as well as some additional work that we will see later. The ios_game_startup and ios_game_shutdown functions just forward their calls to the actual game. In order to expose the C interface to Swift, we make use of a module map. The Clang documentation defines module maps as “the crucial link between modules and headers,” going on to say that they describe how headers map onto the logical structure of a module. C/C++ code is treated as a module by Swift, and a module map describes the mapping of the interface into that module. The file itself is actually very simple in my case:

module Game {
    header "bullet_beat_ios.h"
    export *

That’s it! The file is required to have a .modulemap suffix, and it is placed in the same directory as the .h interface file (the header declaration specifies the relative path to the location of the module map file). Xcode also needs to know about this file, so the directory were it is located is specified in the “Import Paths” build setting under “Swift Compiler – Search Paths”. I can now import the “Game” module into any Swift file, which allows me to call any C function exposed by the .h file in that Swift file.

For instance, the AppDelegate‘s launch method can now be expanded to include platform initialization:

import UIKit
import Game

func application(_ application: UIApplication, didFinishLaunchingWithOptions launchOptions: [UIApplication.LaunchOptionsKey: Any]?) -> Bool {
    let processInfo = ProcessInfo.processInfo
    let threadCount = max(processInfo.processorCount - 2, 2)
    if ios_platform_initialize(UInt32(threadCount)) {
        if let exeUrl = Bundle.main.executableURL {
            let appDir = exeUrl.deletingLastPathComponent()
            let fileManager = FileManager.default
            if fileManager.changeCurrentDirectoryPath(appDir.path) {
                print("[BulletBeat] Working directory: ", fileManager.currentDirectoryPath)
    // Additional initialization
    let screen = UIScreen.main
    window = UIWindow(frame: screen.bounds)
    window?.rootViewController = ViewController()
    return true

The shutdown work goes into the applicationWillTerminate callback:

func applicationWillTerminate(_ application: UIApplication) {
    // Additional shutdown work

With this, the platform can talk to the game, but what about the game talking to the platform? The game needs several services from the platform layer, including getting the path to save game location, disabling/enabling the idle timer (iOS only), hiding/showing the cursor (desktops only), file I/O, etc. A good way of doing this is through function pointers that are assigned by the platform layer. As we saw in the .cpp file above, it defines a static bbPlatform struct, and it contains a bunch of function pointers that the game requires in order to talk to the platform. These are assigned during platform initialization.

For example, here are some of the services declared in the bbPlatform struct that need to be assigned functions:

typedef FILE* bbFileHandle;
typedef HANDLE bbFileHandle;

typedef void(*platform_call_f)(void);
typedef const char*(*platform_get_path_f)(void);
typedef bbFileHandle(*platform_open_file_f)(const char*, const char*);
typedef bool(*platform_close_file_f)(bbFileHandle);

struct bbPlatform {
    bbThreadPool *threadPool;
    platform_call_f hide_cursor;
    platform_call_f show_cursor;
    platform_call_f disable_sleep;
    platform_call_f enable_sleep;
    platform_get_path_f get_save_path;
    platform_open_file_f open_file;
    platform_close_file_f close_file;

Some of these are not applicable to iOS (like hide/show cursor), so they will just be assigned stubs. The open/close file functions can be declared directly in the .cpp file:

static FILE*
ios_open_file(const char *path, const char *mode) {
    FILE *f = fopen(path, mode);
    return f;

static bool
ios_close_file(FILE *file) {
    int result = fclose(file);
    return (result == 0);

They are then assigned in the ios_platform_initialize function seen above:

platform.open_file = ios_open_file;
platform.close_file = ios_close_file;

Others like those dealing with the idle timer and getting the save game directory path need to be defined in the Swift platform layer (in my case in the AppDelegate file):

fileprivate func ios_disable_sleep() {
    let app = UIApplication.shared
    app.isIdleTimerDisabled = true

fileprivate func ios_enable_sleep() {
    let app = UIApplication.shared
    app.isIdleTimerDisabled = false

fileprivate let saveUrl: URL? = {
    let fileManager = FileManager.default
    do {
        var url = try fileManager.url(for: .applicationSupportDirectory, in: .userDomainMask, appropriateFor: nil, create: true)
        url = url.appendingPathComponent("Bullet Beat", isDirectory: true)
        if !fileManager.fileExists(atPath: url.path) {
            try fileManager.createDirectory(at: url, withIntermediateDirectories: true, attributes: nil)
        return url
    } catch {
        return nil

fileprivate func ios_get_save_path() -> UnsafePointer<Int8>? {
    return saveUrl?.path.utf8CString.withUnsafeBytes({ ptr -> UnsafePointer<Int8>? in
        return ptr.bindMemory(to: Int8.self).baseAddress

They are assigned just after the call to initialize the platform by calling C functions through the bridging interface:

if ios_platform_initialize(UInt32(threadCount)) {
    // Other initialization work

The actual assignment to the bbPlatform struct is handled in the .cpp file:

void ios_platform_set_get_save_path_function(const char*(*func)(void)) { 
    platform.get_save_path = func;

void ios_platform_set_disable_sleep_function(void(*func)(void)) { 
    platform.disable_sleep = func;

void ios_platform_set_enable_sleep_function(void(*func)(void)) { 
    platform.enable_sleep = func;

Once assigned, these function pointers are simply called within the game layer like this:

bbPlatform *platform = ...
const char *savePath = platform->get_save_path();
bbFileHandle *file = platform->open_file("/path/to/file", "r");

Before ending this part, there are two more critical pieces that the platform layer needs to do: input and the game loop. Touch gestures need to be recorded and mapped to the game’s input struct. And lastly, the platform layer needs to set up the game loop — synchronized to 60fps — that will call the game’s update function to simulate one frame of the game.

Input is (so far) very straightforward in my case; I just need a touch down and touch up event. I look for these by overriding the touches methods of the ViewController:

override func touchesBegan(_ touches: Set<UITouch>, with event: UIEvent?) {
    for touch in touches {
        let point = touch.location(in: view)
        ios_input_add_touch(Float(point.x), Float(point.y), BEGIN)

override func touchesMoved(_ touches: Set<UITouch>, with event: UIEvent?) {
    for touch in touches {
        let point = touch.location(in: view)
        ios_input_add_touch(Float(point.x), Float(point.y), MOVE)

override func touchesEnded(_ touches: Set<UITouch>, with event: UIEvent?) {
    for touch in touches {
        let point = touch.location(in: view)
        ios_input_add_touch(Float(point.x), Float(point.y), END)

override func touchesCancelled(_ touches: Set<UITouch>, with event: UIEvent?) {

These methods just call a function in the bridging platform layer to add touch events as they come in. The BEGIN/MOVE/END enums here are declared in the .h file. The touches are placed into an input buffer which is reset at the end of each frame:

struct TouchInput {
    struct Touch {
        float x, y;
        TouchPhase phase;
    Touch touches[4];
    fsi32 index;

static TouchInput touchInput;

ios_input_reset() {
    touchInput.index = -1;

ios_input_add_touch(float x, float y, enum TouchPhase phase) {
    if (++touchInput.index < ARRAY_COUNT(touchInput.touches)) { TouchInput::Touch *touch = &touchInput.touches[touchInput.index]; touch->x = x;
        touch->y = y;
        touch->phase = phase;

For now, I just grab the first touch recorded and map it to the game’s input struct. For the time, I don’t need anything more complicated, so I’m just keeping it as simple as possible for now. If, or when, I need to expand it, I can make use of the input buffer to keep touches around for some number of frames and then determine different gestures that way.

switch (touch->phase) {
    case BEGIN:
        input.leftMouseWentDown = true;

    case MOVE:

    case END:
        input.leftMouseWentUp = true;


input.mouse.x = touch->x;
input.mouse.y = touch->y;

And finally, the game loop. For this I use a CADisplayLink from the CoreAnimation framework that lets me specify a callback that is called at the rate of the screen’s refresh rate — 60Hz, or 60fps.

class ViewController: UIViewController {
    weak var displayLink: CADisplayLink?

    override func viewDidLoad() {
        // Other initialization work

    func startGame() {
        displayLink = UIScreen.main.displayLink(withTarget: self, selector: #selector(getFrame(_:)))
        displayLink?.add(to: RunLoop.current, forMode: .default)
    func stopGame() {
    @objc private func getFrame(_ sender: CADisplayLink) {
        let dt = Float(sender.targetTimestamp - sender.timestamp)
        // Simulate one game frame and then render it to the view

The startGame and stopGame methods of the ViewController are also called in the AppDelegate as part of the lifecycle of the app. When the user sends it to the background, stop the game, and when it’s coming back to the foreground (i.e. the game still exists in memory and is not being freshly launched), start it up again.

func applicationWillEnterForeground(_ application: UIApplication) {    
    if let vc = window?.rootViewController as? ViewController {

func applicationDidEnterBackground(_ application: UIApplication) {
    if let vc = window?.rootViewController as? ViewController {

Finally, here is a visual representation of the overall structure:

Structural layers

Structural layers of the overall architecture of porting my custom game engine to iOS

That’s it for the overall architecture of how I ported my game to iOS. In the next part I will be discussing graphics and how the game renders each frame to the screen. Following that I will cover how the game sends audio to the platform for output through the speakers.


Compile-time Method Overloading Using Template Specialization

Over the past couple of years I have found myself writing a number of file decoders for use in my own personal codebase (PNG, TTF, BMP, WAV, OGG, and most recently FLAC). Decoding compressed file formats in particular will involve reading an arbitrary number of bits, which necessitates the need for a bit buffer. Over the course of implementing the aforementioned decoders, I tended to just write a new bit buffer implementation each time or copy a previous one over, but when it came time to write a FLAC decoder I wanted to clean these up and write one reusable bit buffer implementation.

The issue that I ended up having to deal with was how to handle little-endian vs. big-endian bit buffers. Of primary importance was the performance of the bit buffers, because reading large image or audio files leads to reading literally millions of bytes of data, so retrieving bits from the data stream has to be as fast as possible. For this reason I immediately discarded options such as inheritance & virtual functions from being considered. I figured I would probably end up writing basic functions for each byte ordering that would operate on a bit buffer parameter. e.g.:

int get_bits_le(BitBuffer *bb, int bits);
int get_bits_be(BitBuffer *bb, int bits);

However, I was curious to investigate some meta programming options. I admit, I like meta programming, but I am not really a big fan of C++ templates. I find anything but a restrained use of them leads to unnecessarily obtuse code that is hard to read, hard to follow, and a headache to debug. Not to mention having to keep in mind subtle side effects such as code bloat and portability issues. Being a fan of meta programming though, I can tolerate some amount of template usage.

My goal was to be able to declare a bit buffer with a given byte order, and then having the appropriate methods for that byte order be chosen at compile time without incurring a performance penalty. i.e.:

fsBitBuffer<fsByteOrder:littleEndian> bitBuffer;
int value = bitBuffer.get_bits(4);

This is what I ended up with:

enum struct fsByteOrder {

template <fsByteOrder>
struct fsBitBuffer {
    uint8_t *stream;
    uint64_t buffer;
    int32_t bitCount;

    uint64_t get_bits(int32_t bits);
    void refill(int32_t bits);

template<> void
fsBitBuffer<fsByteOrder::bigEndian>::refill(int32_t bits) {
    while (bitCount < bits) {
        buffer <<= 8;
        buffer |= *stream++;
        bitCount += 8;

template<> void
fsBitBuffer<fsByteOrder::littleEndian>::refill(int32_t bits) {
    while (bitCount < bits) {
        uint64_t byte = *stream++;
        buffer |= (byte << bitCount);
        bitCount += 8;

template<> uint64_t
fsBitBuffer<fsByteOrder::bigEndian>::get_bits(int32_t bits) {
    bitCount -= bits;
    uint64_t value = buffer >> bitCount;
    value &= (1ULL << bits) - 1;
    return value;

template<> uint64_t
fsBitBuffer<fsByteOrder::littleEndian>::get_bits(int32 bits) {
    uint64_t value = buffer;
    value &= (1ULL << bits) - 1;
    bitCount -= bits;
    buffer >>= bits;

    return value;

The next step was to make sure this approach was comparatively fast with other methods. Other methods I compared against included a simple branching implementation, function overloading, and individual functions for each byte ordering. Here are brief examples of each for the sake of clarity:

// Branching
uint64_t get_bits(int32_t bits) {
    switch (byteOrder) {
        case fsByteOrder::bigEndian:
            bitCount -= bits;
            uint64_t value = buffer >> bitCount;
            value &= (1ULL << bits) - 1;
            return value;
        } break;
        case fsByteOrder::littleEndian:
            uint64_t value = buffer;
            value &= (1ULL << bits) - 1;
            bitCount -= bits;
            buffer >>= bits;
            return value;
        } break;

// Function overloading
fs_get_bits(fsBitBuffer<fsByteOrder::bigEndian> *buffer, int32_t bits) {
    fs_refill(buffer, bits);
    buffer->bitCount -= bits;
    uint64_t value = buffer->buffer >> buffer->bitCount;
    value &= (1ULL << bits) - 1; 

    return value; 

// Individual functions 
fs_get_bits_be(fsBitBuffer *buffer, int32_t bits) { 
    fs_refill_be(buffer, bits); 
    buffer->bitCount -= bits;
    uint64_t value = buffer->buffer >> buffer->bitCount;
    value &= (1ULL << bits) - 1;
    return value;

To measure the performance time of each of these test cases, I ran all four through three different scenarios of reading arbitrary bits & bytes (each iteration read 1MB of data). Here are the results:

Test case A. Reading single bytes.

Test case A (without branching method)

It should come as no surprise that the branching method clearly loses out to the other three, which are quite close together.

Test case B. Reading uint16_t’s.

Here we see a similar pattern with the three non-branching implementations being very close in execution time.

Test case C. Reading an arbitrary mixture of bits (4, 1, 2, 1).

Test case C (without branching method)

While the three non-branching methods remain close in execution time in this last test case, the template specialization method edges out the other two here after the initial spike. I’ve run these test cases several other times and observed similar findings, so I’m rather pleased to see a fairly simple, straightforward use of template specialization is a worthy approach for compile-time method overloading when performance is of primary concern.

For the sake of completeness, the compiler used was Clang with Apple’s LLVM 9.0 backend (Xcode 9.4.1) with -Os optimization.

DFT Spectral Analysis with OpenGL

The idea for this little experiment/project of mine came about while I was exploring OpenGL through the use of Cinder, a C++ library that focuses on creative coding.  Among its many features, it supports 2D and 3D graphics through OpenGL.  As I was experimenting with building a particle engine, I started thinking about how audio can drive various graphical parameters and elements.  We don’t see this very much, but it has some intriguing applications in video games and other interactive media.

We all know that randomness is a large part of sound and visual effects in games these days.  To illustrate, let’s take an example of a magic spell; both its visual and aural element will have randomness, but will these sync up together?  In other words, if a particular instance of the spell’s sound effect contains a lot of high harmonics, does the visual aspect of it respond by being brighter or more energetic?  If the visual and aural parts of the spell are completely independent, there is less chance that they will behave as one cohesive unit, where the visual element is accurately reflective of the aural part.  It’s a subtle thing, and not applicable in every case, but I strongly believe that these details add much to immersion, and impress upon us a more organic game world.

So now, how does DSP fit into this?  Analysing a sound in the time domain (represented by the wave form) doesn’t give us any useful information as to the frequency content or the particular energy it contains across its spectrum.  The DFT (Discrete Fourier Transform) is used to convert a signal into the frequency domain where this information resides.  Instead of displaying amplitude (y axis) vs. time (x axis) in the time domain, the frequency domain plots the amplitude of each frequency component (y axis) vs. the frequency (x axis).  This is a plot of the frequency domain in Audacity:

Frequency domain plot of a piano chord.

Frequency domain plot of a piano chord.

Spectral analysis using the DFT is nothing new and has been around for some time, but its due to the FFT (Fast Fourier Transform) and dramatically increased performance with floating point operations in modern computers that using the DFT for analysis in real time has become practical.  The DFT is calculated using convolution, which we know is very slow.  The FFT uses a highly optimized algorithm (the Cooley-Turkey is the most widely used, but there are several others) to calculate the DFT of a signal.  Going into detail on the math of the FFT is beyond my experience, but fortunately we don’t need to know its intricacies to use it effectively.

The DFT works by breaking a signal down into its individual components.  This is related to the Fourier Theorem, which states that any complex waveform can be constructed by the addition of sine waves — the building blocks of a signal.  Each sine wave represents a single frequency component; its amplitude is the strength of that frequency.  The DFT results in a complex signal with both cosine and sine wave components.  This information isn’t terribly useful to us in this form.  From the complex-valued signal, however, we find the magnitude of each frequency component by converting it into polar form.  In rectangular form a complex number is represented as:

a + jb (DSP normally uses “j” instead of “i”), where a is the real part and b the imaginary part

In polar form, a complex number is represented by an angle (the argument) and the magnitude (absolute value).  The absolute value is what we’re interested in, and is calculated according to Pythagoras’ Theorem:

m = sqrt( x2 + y2 )

This value, plotted against the frequency (which would be the index of the complex number corresponding to the index of the real-valued array) gives us the spectrum of the signal.  In other words, if we take the FFT of an N-point signal, we get N pairs of complex values as output.  Each N in the complex signal represents a frequency along the x axis, with 0 being DC, N/2 being Nyquist, and N being the sampling rate.  The corresponding magnitude of the complex number gives us the amplitude of that frequency component in the signal.  A final important point to make here is that we’re only interested in the frequency points up to N/2, because anything above Nyquist aliases in discrete time (the spectrum of a signal is actually mirrored around N/2).  Here is a sample plot of a piano chord I recorded, using 5 512-sample buffers from the audio:

Frequency plot of 5 512-sample buffers of a piano chord.

Frequency plot of 5 512-sample buffers of a piano chord.

This plot shows us that most of the energy lies in the frequency range 0 – 0.04 (assuming a sampling rate of 48kHz, this would equate to approximately 0 – 1920 Hz).  Now that we can extract the frequency information from an audio signal, I want to talk about how I use this data along with the RMS value to drive the particle engine I made.

There are a number of parameters in generating particles that I considered.  The number of particles, how they’re generated/constructed, the radius, and their color.  Particles are only generated during transients, and so I check a previous RMS value against the new one to determine if particles should be constructed.  This method isn’t foolproof, but with my tests it’s been working quite well for my purposes.  The number of particles generated is also related to the RMS of the signal — the louder it is, the more particles are made.  This is evenly divided across the whole spectrum, so that at least a few particles from each frequency range is present.  The radius of each particle is determined by the frequency component (the N value of the frequency plot above).  The lower the frequency, the larger the particle.  Finally, the color, or more specifically the brightness of the particle, is determined by the frequency amplitude.  Let me go into a little more detail on each of these before presenting a little video I made demonstrating the program.

First, here is the function that gathers all the information (I’m using Portaudio for real-time playback, so all the audio processing happens in a callback function):

int audioCallback (const void* input, void* output, unsigned long samples, const PaStreamCallbackTimeInfo *timeInfo, PaStreamCallbackFlags statusFlags, void* userData)
    const float* in = (const float*)input;
    float* out = (float*)output;
    PaAudioData* data = (PaAudioData*)userData;

    memset(data->buffer, 0, sizeof(double) * SAMPLE_BLOCK * 2);
    float rms = CFDsp::cf_SSE_calcRms(in, samples);

    for (int i = 0; i < samples; ++i) {
        data->buffer[i*2] = *in * hammingWindow(i, SAMPLE_BLOCK);

        *out++ = *in++;
        *out++ = *in++;

    // Calculate DFT using FFT.
    fft(data->buffer, SAMPLE_BLOCK);

    data->amp_prev = data->amp_now;
    data->amp_now = rms;

    return paContinue;

Since the FFT function I’m using requires interleaved data, it needs to be zeroed out first and then each real-valued sample stored in the even-numbered indices.  The RMS value is acquired using a function I wrote using SSE intrinsics for optimal speed.

float cf_SSE_calcRms (const float* inBuffer, const int length)
    assert(length % 4 == 0);

    __m128* mbuf = (__m128*)inBuffer;
    __m128 m1, m2;
    __m128 macc = _mm_set1_ps(0.f);
    __m128 _1oN = _mm_set1_ps(1.f / static_cast<float>(length));

    for (int i = 0; i < length; i+=4) {
        m1 = _mm_mul_ps(*mbuf, *mbuf);
        m2 = _mm_mul_ps(m1, _1oN);
        macc = _mm_add_ps(m2, macc);

    __declspec(align(16)) float result[4];
    _mm_store_ps(result, macc);

    return sqrtf(result[0] + result[1] + result[2] + result[3]);

Inside the for loop, a Hamming window is applied to the signal prior to sending it to the FFT to bound it along the edges, minimizing or eliminating frequency leakage in the spectrum.  In the application’s update method, I test for the presence of a transient in the current buffer.  If one exists, particles are added.

if (audioData.isTransient())
    Vec2i randPos;

    // 50 pixel border.
    randPos.x = Rand::randInt(50, getWindowWidth() - 50);
    randPos.y = Rand::randInt(50, getWindowHeight() - 50);
    mParticleController.addParticles(randPos, audioData);

The adding of particles is handled by the ParticleController class:

void ParticleController::addParticles (const Vec2i& loc, const PaAudioData& data)
    int stride = static_cast<int>((64.f - powf((64.f * data->amp_now), 1.32f)) + 0.5f;
    stride = (stride < 1 ? 1 : stride);

    // 512 == sample block size; represents the horizontal frequency axis in the spectral analysis.
    for (int i = 0; i < 512; i+=stride) {
        Vec2f randVec = Rand::randVec2f() * 6.5f;
        mParticles.push_back(Particle(loc + randVec, i/1024.f, data.mag(i) * 18.f));

First I calculate a stride value that controls the speed at which we execute the following for loop (this evenly divides the number of particles across the x axis).  As you can see, this is determined by the volume (calculated as RMS) of the current audio buffer.  When creating each individual particle, the i/1024 argument determines the radius — as I mentioned this is based on the frequency value along the x axis.  To constrain along reasonable bounds, this value is further calculated as

sqrt( 1 / (i / 1024) )

The color of each particle is first determined using HSV.  The hue is randomly chosen, but the saturation and value (brightness) is based on the data.mag(i) * 18 argument — i.e. the amplitude of the frequency component.  This is then converted into an RGB value to set the color of the individual particle.

One of the primary challenges of getting this working properly is mapping the output from the DFT analysis and RMS to usable values in the particle engine.  That’s the reason for several of the “magic numbers” in the function above (1024 is not one of them; it’s double our sample length, which I said previously we only want from 0 – 0.5, and since the data stored is in interleaved format, the buffer is actually 1024 samples long).  Quite a lot of experimenting went into fine-tuning these values to get good results, but more can be done in this area, which I am interested in exploring further.  I feel that this is a good first step in this area, however, and I’ll be looking at ways to integrate this process into further projects and ideas I have brewing.

With all of that out of the way, here is the video demonstrating the result (note: my FPS took a hit due to recording, and my way-old, out-of-date video card didn’t help matters either):

DFT Spectral Particle Engine from Christian on Vimeo.

The Different “Sides” of Convolution

We all know how much convolution is used in DSP, and what a significant part it has in making many effects and analysis techniques a reality.  Often we hear about FFT, or fast convolution, but in actuality these algorithms are only faster than straight convolution with a kernel length (of the impulse response) greater than around 60 – 64 or so.  Anything shorter than that is best handled with normal convolution, which can be implemented using either an input side or output side algorithm.

The input side algorithm is normally the one that we learn first (at least it was for me, and judging from many texts and books on it, it seems to be common).  However, the output side algorithm is perhaps slightly easier to code, and we’ll see why later on.  For this post I’m going to evaluate the effects each algorithm has on the resulting audio signal because each one could have an advantage in some situations over the other.  I’ll also share a little trick that will speed up the convolution process for certain types of filters.

The output side algorithm performs the convolution, as one might expect, from the viewpoint of the output (or the result) of the operation, while the input side algorithm implements convolution from the viewpoint of the input signal.  To look at it another way, recall that convolving an input signal with an impulse response results in an output signal that is the length of the input + the length of the impulse response – 1.  With the input side algorithm, this “extra” bit of length is appended to the resulting signal, and is probably the easiest one to theoretically grasp.  The output side algorithm looks at what points from the input we need in order to get our resulting signal.

The reason why the output side method feels more natural to code is that we often write in terms of the output or result: y(n) = ____ — the output is equal to some kind of expression, for example.  Approaching convolution this way requires input samples on the negative side (x[-2], x[-1], etc.), which of course don’t exist.  To get around that little problem, one common way is to implement a delay line that contains all zeroes at first, then shifting in the input signal as we go, doing the calculating with the impulse response and the delay line instead of direclty with the input.

I’ve been busy redesigning and overhauling my DSP filter class, which has entailed some experimentation with these two convolution methods on FIR filters.  Here we can see visually the difference between them.  First up is the output side convolution of a windowed-sinc FIR filter using a Kaiser window of order 162 on a simple triangle wave (using test oscillators is a good way to visually see what’s happening).

Output side convolution on a triangle wave.

Output side convolution on a triangle wave.

I purposely made the filter kernel (impulse response) fairly long so it would illustrate the delay and shift that’s happening at the start of the signal due to the output side algorithm.  Compare that with the image of the input side algorithm using exactly the same filter and order.

Input side convolution on a triangle wave.

Input side convolution on a triangle wave.

Huge difference.  It’s clearly audible as well in this test signal (samples below).  With the output side algorithm there is a clear popping sound as the audio abruptly starts in contrast to the much smoother beginning of the test signal filtered with the input side algorithm.

Here you can hear the difference (to hear it clearly you may have to download the files and listen in an audio editor to avoid the glitchy start of web browser plug-ins):

Triangle wave filtered with output side

Triangle wave filtered with input side

Now to return to the point about the output side method being perhaps a little more natural to code.  The little wrinkle we face in implementing the input side algorithm is that with block processing (which is so common in DSP), how do we account for the longer output as a result of the convolution?  In many cases we don’t have control over the size of the audio buffers we’re given or have to work with.  The solution is to use the overlap-add method.

It’s really not more difficult than the output side algorithm.  All we have to do is calculate the convolution fully (into our own internal buffer if we have to), save the extra length at the end (which will be equal to the impulse response length), then add that to the start of the next block.  Rinse and repeat.  Here is C++ code that implements both of these convolution methods.

Output side convolution process.

Output side convolution process.

Input side convolution process.

Input side convolution process.

Don’t worry, I didn’t forget to share the trick I mentioned at the beginning.  Many FIR filters have linear phase response, which means that their kernels are symmetrical.  That provides us a great opportunity to eliminate extra calculations that aren’t needed.  So notice in the above code each ‘j’ loop (the kernel or impulse response loop) only traverses half the kernel length, as the value of input[i] * mKernel[j] is the same as input[i] * mKernel[mKernelLength-j-1].  At the end of the kernel’s loop we calculate the midpoint value.

Again, the calculations involving symmetry is perhaps easier to see in the output side algorithm, because we naturally gravitate towards expressions that result in a single output value.  If you work a small convolution problem out on paper, however, it will help you see the input side algorithm and why it works.

Audio Resampling: Part 2

This post has been edited to clarify some of the details of implementing the polyphase resampler [May 14, 2013].

Time to finish up this look at resampling.  In Part 1 we introduced the need for resampling to avoid aliasing in signals, and its implementation by windowed sinc FIR filters.  This is a costly operation, however, especially for real-time processing.  Let’s consider the case of upsampling-downsampling a 1 second audio signal at a sampling rate of 44.1 kHz and a resampling factor of 4X.  Going about this with the brute-force method that we saw in Part 1 would result in first upsampling the signal by 4X.  This results in a buffer size that has now grown to be 4 x 44100 = 176,400 that we now have to filter, which will obviously take roughly 4 times as long to compute.  And not only once, but twice, because the decimation filter also operates at this sample rate.  The way around this is to use polyphase interpolating filters.

A polyphase filter is a set of subfilters where the filter kernel has been split up into a matrix with each row representing a subfilter.  The input samples are passed into each subfilter that are then summed together to produce the output.  For example, given the impulse response of the filter


we can separate it into two subfilters, E0 and E1



where E0 contains the even-numbered kernel coefficients and E1 contains the odd ones.  We can then express H(z) as


This can of course be extended for any number of subfilters.  In fact, the number of subfilters in the polyphase interpolating/decimating filters is exactly the same as the resampling factor.  So if we’re upsampling-downsampling by a factor of 4X, we use 4 subfilters.  However, we still have the problem of filtering on 4 times the number of samples as a result of the upsampling.  Here is where the Noble Identity comes in.  It states that for multirate signal processing, filtering before upsampling and filtering after downsampling is equivalent to filtering after upsampling and before downsampling.

Noble Identities for upsampling and downsampling.

Noble Identities for upsampling and downsampling.

When you think about it, it makes sense.  Recall that upsampling involves zero-insertion between the existing samples.  These 0 values, when passed through the filter, simply evaluate out to 0 and have no effect on the resulting signal, so they are wasted calculations.  We are now able to save a lot of computational expense by filtering prior to upsampling the signal, and then filtering after downsampling.  In other words, the actual filtering takes place at the original sample rate.  The way this works in with the polyphase filter is quite clever: through a commutator switch.

Let’s take the case of decimation first because it’s the easier one to understand.  Before the signal’s samples enter the polyphase filter, a commutator selects every Mth sample (M is the decimation factor) to pass into the filter while discarding the rest.  The results of each subfilter is summed together to produce the signal, back to its original sample rate and without aliasing components.  This is illustrated in the following diagram:

Polyphase decimator with a commutator switch that selects the input.

Polyphase decimator with a commutator switch that selects the input.

Interpolation works much the same, but obviously on the other end of the filter.  Each input sample from the signal passes through the polyphase filter, but instead of summing together the subfilers, a commutator switch selects the outputs of the subfilters that make up the resulting upsampled signal.  Think of it this way: one sample passes into each of the subfilters that then results in L outputs (L being the interpolation factor and the number of subfilters).  The following diagram shows this:

Polyphase interpolating filter with a commutator switch.

Polyphase interpolating filter with a commutator switch that selects the output.

We now have a much more efficient resampling filter.  There are other methods that exist as well to implement a resampling filter, including Fast Fourier Transform, which is a fast and efficient way of doing convolution, and would be a preferred method of implementing FIR filters.  At lower orders however, straight convolution is just as fast (if not even slightly faster at orders less than 60 or so) than FFT; the huge gain in efficiency really only occurs with a kernel length greater than 80 – 100 or so.

Before concluding, let’s look at some C++ code fragments that implement these two polyphase structures.  Previously I had done all the work inside a single method that contained all the for loops to implement the convolution.  Since we’re dealing with a polyphase structure, it naturally follows that the code should be refactored into smaller chunks since each filter branch can be throught of as an individual filter.

First, given the prototype filter’s kernel, we break it up into the subfilter branches.  The number of subfilters (branches) we need is simply equal to the resampling factor.  Each filter branch will then have a length equal to the prototype filter’s kernel length divided by the factor, then +1 to take care of rounding error.  i.e.

branch order = (prototype filter kernel length / factor) + 1

The total order of the polyphase structure will then be equal to the branch order x number of branches, which will be larger than the prototype kernel, but any extra elements should be initialized to 0 so they won’t affect the outcome.

The delay line, z, for the interpolator will have a length equal to the branch order.  Again, each branch can be thought of as a separate filter.  First, here is the decimating resampling code:



Example of decimating resampler code.

As can be seen, calculating each polyphase branch is handled by a separate object with its own method of calculating the subfilter (processDownsample).  We index the input signal with variable M, advances at the rate of the resampling factor.  The gain adjust can be more or less ignored depending on how the resampling is implemented.  In my case, I have precalculated the prototype filter kernels to greatly improve efficiency.  However, the interpolation process decreases the level of the signal by an amount equal to the resampling factor in decibels.  In other words, if our factor is 3X, we need to amplify the interpolated signal by 3dB.  I’ve done this by amplifying the prototype filter kernel so I don’t need to adjust the gain during interpolation.  But this means I need to compensate for that in decimation by reducing the level of the signal by the same amount.

Here is the interpolator code:

Example interpolation resampling code.

Example of interpolation resampler code.

As we can see, it’s quite similar to the decimation code, except that the output selector requires an additional for loop to distribute the results of the polyphase branches.  Similarly though, it uses the same polyphase filter object to calculate each filter branch, using the delay line as input instead of the input signal directly.  Here is the code for the polyphase branches:

Code implementing the polyphase branches.

Code implementing the polyphase branches.

Again, quite similar, but with a few important differences.  The decimation/downsampling MACs the input sample by each kernel value whereas interpolation/upsampling MACs the delay line with the branch kernel.

Hopefully this clears up a bit of confusion regarding the implementation of the polyphase filter.  Though this method splits up and divides the tasks of calculating the resampling into various smaller objects than before, it is much easier to understand and maintain.

Resampling, as we have seen, is not a cheap operation, especially if a strong filter is required.  However, noticeable aliasing will render any audio unusable, and once it’s in the signal it cannot be removed.  Probably the best way to avoid aliasing is to prevent it in the first place by using band-limited oscillators or other methods to keep all frequencies below the Nyquist limit, but this isn’t always possible as I pointed out in Part 1 with ring modulation, distortion effects, etc.  There is really no shortage of challenges to deal with in digital audio!

Building a Comb Filter in Audio Units

Now as I am looking into and learning more about digital reverberation, including its implementation and theory, I decided to build a simple comb filter plug-in using Audio Units.  Previously all the plug-in work I’ve done has been using VST, but I was anxious to learn another side of plug-in development, hence Apple’s Audio Units.  It is, truth be told, very similar to VST development in that you derive your plug-in as a subclass of Audio Unit’s AUEffectBase class, inheriting and overwriting functions accordingly to the needs of your effect.  There are some notable differences, however, that are worth pointing out.  In addition, I’ve put up the plug-in available for download on the Downloads page.

The structure of an Audio Unit differs from VST in that within the main interface of the plug-in, a kernel object that is derived from AUKernelBase handles the actual DSP processing.  The outer interface as subclassed from AUEffectBase handles the view, parameters, and communication with the host.  What’s interesting about this method is that the Audio Unit automatically handles multichannel audio streams by initializing new kernels.  This means that the code you write within the Process() function of the kernel object is written as if to handle mono audio data.  When the plug-in detects stereo data it simply initializes another kernel to process the additional channel.  For n-to-n channel effects, this works well.  Naturally options are available for effects or instruments that require n-to-m channel output.

Another benefit of this structure is the generally fast load times of Audio Unit plug-ins.  The plug-in’s constructor, invoked during its instantiation, should not contain any code that requires heavy lifting.  Instead this should be placed within the kernel’s constructor, the initialization, so that any heavy processing will only occur when the user is ready for it.  Acquring the delay buffer in the comb filter happens in the kernel’s constructor, as indicated below, while the plug-in’s constructor only sets up the initial parameter values and presets.

Comb Filter kernel constructor

Comb Filter base constructor

The parameters in Audio Units also differ from VST in that they are not forced to be floating point values that the programmer is responsible for mapping for the purpose of displaying in the UI.  Audio Units comes with built-in categories for parameters which allow you to declare minimum and maximum values for in addition to a default value that is used for when the plug-in instantiates.

Declaring parameters in GetParameterInfo()

Like VST, Audio Units contains a function called Reset() that is called whenever the user starts or stops playback.  This is where you would clear buffers or reset any variables needed to return the plug-in to an initialized state to avoid any clicks, pops, or artifacts when playback is resumed.

Performing clean-up in Reset()

Because a comb filter is essentially a form of delay, a circular buffer is used (mDelayBuf) to hold the delayed audio samples.  In real-time processing where the delay time can change, however, this has repercussions on the size of the buffer used, as it would normally be allocated to the exact number of samples needed to hold the data.  But rather than deallocating and reallocating the delay buffer every time the delay time changes (requiring multiple memory accesses), I allocate the buffer to its maximum possible size as given by the maximum value allowed for the delay time.  As the delay time changes, I keep track of its size with the curBufSize variable, and it is this value that I use to wrap around the buffer’s cursor position (mPos).  This happens within the Process() function.

Comb Filter’s Process() function

Every time Process() is called (which is every time the host sends a new block of samples to the plug-in), it updates the current size of the buffer and checks to make sure that mPos does not exceed it.  The unfortunate consequence of varying the delay time of an effect such as this is that it results in pops and artifacting when it is changed in real time.  The reason being that when the delay time is changed in real time, samples are lost or skipped over, resulting in non-contiguous samples causing artifacting.  This could be remedied by implementing the Comb Filter as a variable delay, meaning when the delay time changes in real time, interpolation is used to fill in the gaps.  As it stands, however, the delay time is not practically suited for automation.

Yet another distinction with Audio Units is the requirement for validation to be usable in a host.  Audio Units are managed by OS X’s Component Manager, and this is where hosts check for Audio Unit plug-ins.  To validate an Audio Unit, a tool called “auval” is used.  This method has both pros and cons to it.  The testing procedure helps to ensure any plug-in behaves well in a host, it shouldn’t cause crashes or result in memory leaks.  While I doubt this method is foolproof, it is definitely useful to make sure your plug-in is secure.

Correction: Audio Units no longer use the Component Manager in OS X 10.7+. Here is a technical note from Apple on adapting to the new AUPlugIn entry point.

The downside to it is that some hosts, especially Logic, can be really picky with which plug-ins it accepts.  I had problems loading the Comb Filter plug-in for the simple reason that version numbers didn’t match (since I was going back and forth between debug and release versions), and so it failed Logic’s validation process.  To remedy this, I had to clear away the plug-in from its location in /Library/Audio/Plug-Ins/Components and then, after reinstalling it, open the AU Manager in Logic to force it to check the new version.  This got to be a little frustrating after having to add/remove versions of the plug-in for testing, especially since it passed successfully in auval.  Fortunately it is all up and running now, though!

Comb Filter plug-in in Logic 8

Finally, I’ll end this post with some examples of me “monkey-ing” around with the plug-in in Logic 8, using some of the factory presets I built into it.

Comb Filter, metallic ring preset

Comb Filter, light delay preset

Comb Filter, wax comb preset

Digital Reverberation

In continuing to explore the many areas of digital signal processing, reverb has cropped up many times as an area of great interest, so I’ve decided to dedicate a series of future posts on this topic.  I’m going to start at the beginning, looking at Schroeder’s design, the first digital reverberator solution, and proceed forward looking at how it’s design was improved upon by Moorer, leading eventually to Feedback Delay Networks (FDN) and other types of artificial reverbs.  All of these stages will include actual implementation, with code/algorithms, and possibly some plug-ins as a result.  However, my goal is not to develop any kind of high-end, competetive product at this point, as some commercial reverb algorithms are closely guarded secrets.  Moreover, digital reverb remains as one of the foremost challenges in DSP.  This process will, however, provide greater understanding of digital audio in addition to honing my skills in DSP coding and design.

Reverberation is of course just a dense series of echoes.  There is also a loss of energy in particular frequency ranges that depend on the material the sound bounces off of.  When all the complexities of natural reverb are accounted for, calculations to simulate this reach into the hundreds of billions or more per second!  Human ears cannot fully perceive the full compelxity of natural reverb, however, so this makes the calculations required much more manageable for many reverb designs (convolution is still very computationally expensive, though).

One of the fundamental building blocks of digital reverb is the comb filter, which Schroeder used in his design.  It circulates a signal through a delay line, adding the delayed version, scaled with a constant, g, to the original.

Comb filter design

The constant g is given by the formula:

where tau (t) is the delay time, or loop time, of the comb filter and RVT is the reverb time desired, which is defined as the time it takes for the delayed signal to reach -60dB (considered silence).

When analyzing the impulse response of natural reverberation, however, we see many dense series of echoes that are not equally spaced out with apparently random amplitudes.  Additionally, the echoes become more diffuse as the amplitudes decrease as the delayed signals build up in the space.  This leads to one of the most important properties of good reverb design, which is the diffusion of the delayed signal’s echoes — in other words it would be unnatural to hear individual pulses as the signal becomes reverberated.  Schroeder proposed the use of four comb filters (in parallel) as one of his solutions to this problem, each with it’s own distinct loop time.  To further ensure the diffusion of echoes, the four loop times should be relatively prime, otherwise the delayed signals would match up too frequently in phase to create a pumping or puffing sound, especially noticeable in the decay.

Another important property of reverb is for the decay to be exponential.  This is satisfied by the comb filter, as can be seen in the above diagram, whereby the impulse response will start out at 1 (assuming an impulse at amplitude 1) and then subsequently being scaled by g, then g2, g3, etc.

To further thicken up the sound of his reverberator, Schroeder fed the summed signals from the four comb filters through two all-pass filters in series.  These filters allow all frequencies to pass, but alter the phase of varying frequencies.  Their design is very much like a comb filter but with a feed-forward section, as can be seen below.

All-pass filter design

The two all-pass filters Schroeder uses also have their own unique loop times just as the comb filters. Unlike the comb filters, however, the reverb time specified for the all-pass filters are different because their purpose is to thicken and diffuse the echoes of the signal, not to apply additional reverberation.

Schroeder accompanied his design with suggested values to simulate a concert hall.  These values are given below (source: Dodge & Jerse, “Computer Music”, pg. 301):

Values for Schroeder’s Reverberator, simulating a concert hall

The RVT value of the comb filters is variable and can be specified by the user, but is normally around the order of 1.o second.

The actual implementation of these two filters is fairly straightforward in C++.  The code is given below:

Code implementing a comb filter

Code implementing an all-pass filter

Now let’s look at some audio samples to hear how this all sounds.  All the code was written by me, including implementation of the comb filters and all-pass filters as well as the mix.  Furthermore, I implemented a wet/dry option into the mixing stage as well as an output level due to the fact that the processed audio can increase in levels quite a bit depending on the source audio.  As far as mixing goes, at its most basic it is just adding signals together, but when mixing several audio buffers (as in the four parallel comb filters) it is a good idea to scale each sample by a factor of 1/N, where N = number of audio buffers being mixed ( 1/(sqrt(N) can also be used in some cases).

Guitar strum, original audio

Guitar strum, single comb filter

In the above example with the single comb filter applied (with a loop time of 29.7 msec) we can hear the distinct echoes/delays of the signal at the beginning.  As the audio decays we can also hear some unnatural pulsation happening (some pulsation is present in the original audio, but the comb filter augments it).

Guitar strum, 4 comb filters & 2 all-pass filters, 100% wet

Adding in all the comb filters and the 2 all-pass networks as per Schroeder’s design diffuses the echoes noticeably and the tail sounds a little more natural as well.  But for a more realistic sound we of course need the dry signal in the mix as well.

Guitar strum, 30% wet mix

It’s worth listening to a more percussive sound to hear the reverb’s effect on it.  Here is a short piano riff and a single comb filter applied to it, and the echo effect is very noticeable and quite disturbing.

Piano riff, original audio

Piano riff, single comb filter

Now applying the reverb in its entirety onto the piano riff with a 30% wet mix results in a more natural reverb.

Piano riff, 30% wet mix

It is, however, not perfect by any means.  We can still hear a slight echo after each attack, and the reverb sound is a little bright and metallic sounding.  As stated at the beginning, the echoes from reverberation lose energy as well as amplitude as they reflect off surfaces and travel through air, and this has not been accounted for in this design.  To improve on this, adding in a simple low-pass filter in the comb filters was used as a solution.  This will be one of the things I’ll be looking at going forward as well as more elaborate reverb designs that attempt to more realistically simulate natural reverberation.

The Making of a Plug-In: Part 2

This entry in my making of a plug-in series will detail what went into finalizing the prototype program for the Match Envelope plug-in.  A prototype of this kind is usually a command-line program wherein much of the code is actually written to implement functionality and features, and then later transferred into a plug-in’s SDK (in my case, the VST SDK).  Plug-ins, by their very nature, are not self-executable programs and need a host to run, so it is more efficient to test the fundamental code structure within a command-line program.

Having now completed my prototype, I want to first share some of the things I improved upon as well as new features I implemented.  One of the main features of the plug-in that I mentioned in part 1 was the match % parameter.  This effectively lets you control how strongly the envelope you’re matching affects the audio, and rather than being a linear effect, it is proportional to the difference between the amplitude of the envelope and the amplitude of the audio.  Originally this was the formula I used (from part 1):

We could see that this mostly gave me the results I was after, but if we look closely at the resulting waveform, there is some asymmetry in comparing it to the original (look at the bottom of the waveforms).  One of the mistakes in this formula was in comparing the interpolated value ‘ival’ with the actual sample amplitude of the ‘buffer’.  To remedy this, I now also extract the envelope of the destination audio that we are applying the envelope on to (with the same window width used to extract the source envelope) and use this value to compare the difference with ‘ival’.  This ensures a more consistent and accurate comparison of amplitudes.

The other mistake in the original formula was to linearly affect ‘a‘, the alpha value that is input by the user in %, by the term that calculates the difference between the amplitudes. So while the a value does affect the resulting ival proportionally, a itself was not.  The final equation then, just became:

I use two strategies when deciding on an appropriate mathematical formula for what I need.  One is considering how I want a value to change over time, or over some range of values, and turning to a kind of equation that does that (i.e. should it be a linear change, exponential, logarithmic, cyclical, etc.).  This leads to the second method, and that is to use a graph to visualize the shape of change I am after; this leads to an equation that defines that graph.

Here is a quick graphic and audio to illustrate these changes using the same flute source as the envelope and triangle wave as its destination from part1:

Flute envelope applied with 80msec window size at 100% match

Shortly, we will be seeing some much more interesting musical examples of the plug-in at work.  But before that, we can see another feature at work above that I implemented since last time: junction smoothing.

In addition to specifying the length of the envelope, the user also specifies a value (in msec) to smooth the transition from the envelope match to the original, unmodified audio.  Longer values will obviously make the transition more gradual, while shorter makes it more abrupt.  The process of implementing this feature turned out to be reasonably simple.  This is the basic equation:

where ‘ival’ is the interpolated value and ‘jpos’ is the current position within the bounds of the junction smoothing specified by the user.  ‘jpos’ starts at 0, and once the smoothing begins, it increments (within a normalized value) until it hits 1 at the end of the smoothing.  The larger ‘jpos’ gets, the less of the actual interpolated value we end up with in our ‘jval’ result, which is used to scale the audio buffer (just as ‘ival’ does outside of junction smoothing).  In other words, when ‘jpos’ hits 1, ‘jval’ will be 1 and so we multiply our audio buffer by 1; then we have reached the end of our process and the original audio continues on unmodified.

Before we move on, here is a musical example.  This very famous opening of Debussy’s “Prelude to the Afternoon of a Faun” seemed like a good excerpt to test my plug-in on.  This is the original audio (the opening is very very soft as most classical recordings are of quiet moments to preserve dynamic range, so I had to amplify it which is why there is some audible low noise):

Opening of “Prelude to the Afternoon of a Faun”

Using this as the source envelope, I used a window size of 250msec at 90% match to apply on to this flute line that I recorded in Logic, doubling the original from the audio above (the flute sample is from Vienna Special Edition).

Unmodified flute doubling of the Prelude opening

The result:

Flute doubling after Match Envelope

And here is the result mixed with the original audio:

Doubled flute line mixed with original audio

Junction smoothing was of course applied during the process to let the flute line fade out as in its original incarnation.  Without smoothing it would have abruptly cut off.  This gives us a seamless transition from the end of the flute solo into the orchestral answer.

It was very important in this example to specify a fairly large window duration, because we don’t want to capture the tremolo of the original flute solo as this would fight against the tremolo of the sampled recording that we are applying the envelope on to.  We can hear a little bit of this in places even with a 250msec window, so this will be something I intend to test further to see how this may be avoided or at least minimized.

This brings me to the other major challenge I faced in developing this plug-in since part 1: stereo handling.  Dealing with stereo files isn’t complicated in itself, but there were a few complexities I encountered along the way specific to how I wanted the plug-in to behave. Instead of only allowing a 1-to-1 correspondence (i.e. only supporting mono to mono, or stereo to stereo), I decided to allow for the two additional situations of mono to stereo and stereo to mono.

The first two cases are easy enough to deal with, but what should happen if the source envelope is mono and the destination audio is stereo, and vice versa?  I decided to allocate 2-dimensional arrays for the envelope buffers to hold the mono/stereo amplitude data and then use bitwise flags to store the states of each envelope:

This saves on having multiple variables representing channel states for each envelope, so I only have to pass around one variable that contains all of this information that is then parsed in the appropriate places to retrieve this information.

As we can see, only one variable (‘envFlags’) is used in the extraction of the source envelope, and we can find out mono/stereo information by using bitwise AND with the corresponding enum definition of the flag we’re after.  Furthermore, in the case of the source envelope being stereo but destination audio being mono, I combine the amplitude data of the two channels into one, according to either average or peak extraction method (also specified by user).

The difficulty in implementing this feature wasn’t so much in how to get it done, but how to get it done more efficiently, without a massive number of parameters passing around and a whole lot of conditional statements within the main processing loop to determine how many channels each envelope has.  We can see some of this at work within the main process:

I use another variable (‘stereo_src’) that extracts some information from the bitwise flags to take care of the case where the source envelope is mono but destination audio is stereo.  Since the loop covers the channels of the destination audio (in interleaved format), I needed a way to restrict out of bounds indexing of the source envelope.  If the source envelope is mono, ‘stereo_src’ will be 0, so the indexing of it will not exceed its limit.  If both envelopes are stereo, ‘stereo_src’ will be 1, so it will effectively “follow” the same indexing as the destination audio.

For the next, and last, musical example of this entry, we change things up a whole lot.  I’m going to show the application of this plug-in to electronic dance music.  This match envelope plug-in can emulate, or function as a kind of side chain compressor, which is quite commonly found in EDM.  Here is a simple kick drum pattern and a synth patch that goes on top:

Kick drum pattern

Synth pattern

By applying the match envelope plug-in to the synth pattern using the kick drum pattern above as the source envelope (window size of 100msec and 65% match), we achieve the kind of pumping pattern in the synth so characteristic of this kind of music.  The result, and mix, are as follows:

Match Envelope applied to synth pattern

Modified synth pattern mixed with percussion and bass

This part has really covered the preliminary features of what I’m planning to include in the Match Envelope plug-in.  As I stated at the start, the next step is to transfer the code into the VST SDK (that’ll be part 3), but this also comes with its share of considerations and complications, mainly dealing with UI.  How should this appear to the user?  How do you neatly package it all together to make it easy and efficient to use?  How should be parameters be presented so that they are intuitive?

Most all plug-ins/hosts offer up a default UI, which is what I’ll be working with initially, but eventually a nice graphic custom GUI will be needed (part 4? 5? 42?).

Shaking it up with Vibrato

Let’s start it off with some music:

Vibrato has always been an essential technique in making music feel more alive, rich, and full of expression.  Whether it is string, wind, or brass players in an orchestra, a singer, or a synthesized waveform in an old 8-bit NES game, vibrato ensures those long notes and phrases connect with us in a more meaningful way by giving them character and shape.

Unlike tremolo (which was the subject of the previous blog entry), vibrato modulates pitch, not amplitude, using an LFO.  When generating a waveform using synthesis, this is a trivial matter as we have direct access to the frequency component.  But with prerecorded audio, the vibrato effect is a achieved through the use of modulated variable delay.  To better understand this, let’s start off by looking at a basic delay effect implemented in C++ code.

The way a simple delay works is by creating a buffer with a length equal to the delay time (making sure to initialize it to contain all zeroes) and then as we process the audio buffer, we transfer each sample from it into the delay buffer while extracting values from the delay buffer and mixing it with the original audio buffer.  Since the delay buffer is initialized to contain all zeroes, the first pass through it will do nothing to the original audio, but upon completing the first pass the delay buffer will contain the samples from the audio that will then be mixed in, creating the delay.  By using a delay time of 0.5 seconds (which would require the delay buffer to contain 22050 samples assuming a sample rate of 44.1kHz), and a ‘depth’ of 45% or so, the following code would generate a single half-second slap-back delay, or echo, at 45% of the original amplitude:

Adapting this code to create a vibrato effect isn’t too complex, but it does require a few steps that might seem a bit hard to grasp at first.  We need to create a variable delay and this requires two pointers to our delay buffer — a writing pointer that will proceed sample by sample as in the basic delay above, and a reading pointer that will be calculated in relation to the writing pointer and modulated by the LFO.  The reading position will almost always fall between buffer positions, so interpolation is required to achieve more accurate output.  With these points considered, the variable delay code becomes:

It was here that I first encountered a big roadblock in writing my vibrato effect.  Upon testing it on a number of soundfiles, I was getting a moderate amount of distortion, or sample noise, in my output.  Having already learned from similar challenges in writing the tremolo effect previously, I was fairly certain this was a new issue I had to tackle.  The test that led me to the source of the problem was using a constant delay time in the code above (no modulation by the sine wave) and that produced a clean output.  From here, I knew the problem had to lie in how I was calculating the offset using the sine wave modulator.  Originally I calculated it like this:

offset = (delay time * sine wave(phase)) * sample rate,

where the phase of the sine wave increments by the value of 2 * pi * freq / SR.  After doing some research (and hard thinking on the matter), it became clear that this was the wrong mathematical operation because multiplying the modulator with the delay time scales it; we want to move “around” it (i.e. vibrato fluctuates pitch by a small amount around a central pitch).  That eventually led me to come up with the following base equation:

offset = (delay time + sine wave(phase) * delay time) * sample rate.

This equation needs a couple more modifications since it isn’t modulating “around” the delay time yet, just adding to it.  A depth modifier needs to be included in here as well so that we can change the intensity of the vibrato effect (by modifying the magnitude of the sine wave).  The final equation then becomes:

offset = (delay time/2 + (sine wave(phase) *depth) * delay time/2) * sample rate,

which simplifies to:

offset = (delay time/2 * (1 + sine wave(phase) * depth)) * sample rate.

This finally created the expected output I was after!  It’s such a great feeling to solve logical programming challenges!  Here is an example of the output with a vibrato rate of 8.6Hz at 32% depth:

Terra’s theme with vibrato rate of 8.6Hz at 32% depth

One other important element to discuss is the actual delay time used to generate the vibrato effect.  I experiemented around with many values before settling on a delay time of 0.004 seconds, which is the value that we “delay around” using the sine wave.  I found as the values got smaller than 0.004 seconds that the sound of the effect degraded, and actually resulted in some sample noise because the delay buffer became so small (nearing as few as only 30 samples).  As the delay time increases, the pitch of the audio begins to vary so much that we actually lose almost all pitch in the original audio.

This is not necessarily a bad thing.  This opens up vibrato to be used as a sound effect rather than purely a musical expression tool.  By setting the delay time to 0.03 seconds for example, the vibrato effect generates an output not unlike a record-scratch or something resembling flanging (which is actually also achieved through the use of variable delay).  See if you can recognize the source music in this sample:

Vibrato effect at 9.0Hz and 75% depth

Of course a more subtle effect is often desired for musical purposes and this is controlled by the depth modifier.  Here is a sample of a more subtle vibrato effect (back to the delay time of 0.004 seconds):

Zelda with vibrato rate of 6.4Hz at a 13% depth

One final thing to mention in regards to applying the vibrato effect onto prerecorded audio is that it can distort the sound somewhat when the audio used is a fully realized composition.  The vibrato is of course being applied on to the entire file (i.e. every instrument, every sound).  A more practical application would be to use vibrato on a single instrument source; a flute for example (please excuse my horrible flute playing):

Flute with vibrato rate of 6.0Hz at a 40% depth

Last, but not least, it is important to consider the implementation and design of the code that applies the effect.  I have continued to code these effects as C++ classes using object-oriented design as it makes the implementation of them very easy and efficient.  For example, calling the effect in the main loop of the program is as trivial as:

Here we can see that first we read sample data in from the soundfile and store it in ‘buffer’.  Then the ‘buffer’ is passed, along with the LFO modulator, into the process that applies the variable delay (vibrato in this case), and this is then written to the output soundfile.  The LFO modulator used for the vibrato is just a new instance of the oscillator class I developed for the tremolo effect previously.  I just initialize a new instance of it for use in the vibrato effect, and done!

This is an example of the benefits of object-oriented design and how adaptable it is.  We’ll be seeing much more of this to come as well.  For example, it would require a few trivial code changes to set up multi-tap delays, each with their own depth, and even to incorporate filters into the delays once I get into developing them.  And finally, allowing the use of envelopes to further shape these effects will be an important step to be taken in the future.  With so many tantalizing possibilities, there’s no stopping now!

Coding some Tremolo

The adventure continues.  This time we occupy the world of tremolo as a digital signal processing effect; also known as amplitude modulation.  My studies into the area of audio programming has progressed quite far I must say, covering the likes of filters and delays (get your math hats ready), reverb, and even plug-in development.  In order to really solidify what I’ve been learning though, I decided to go back and create a program from scratch that will apply tremolo and vibrato to an existing audio file, and that’s where this blog entry comes in.  For now I am just covering the tremolo effect as there is plenty to discuss here on that, while vibrato will be the subject of the next blog entry.

Tremolo in and of itself is pretty straightforward to implement, both on generating signals and on existing soundfiles.  (Vibrato on the other hand is easy enough to apply on to signals being generated, but a bit more complex when it comes to sound input).  Nonetheless, several challenges were met along the way that required a fair amount of research, experimentation and problem solving to overcome, but in doing so I’ve only expanded my knowledge in the area of DSP and audio programming.  I think this is why I enjoy adventure games so much — chasing down solutions and the feeling you get when you solve a problem!

The tremolo effect is simply implemented by multiplying a signal by an LFO (low frequency oscillator).  While LFOs are normally between 0 – 20 Hz, a cap of 10 Hz works well for tremolo.  The other specification we need is depth — the amount of modulation the LFO will apply on to the original signal — specified in percent.  A modulation depth of 100%, for example, will alternate between full signal strength to complete suppression of the signal at the frequency rate of the LFO.  For a more subtle effect, a depth of around 30% or so will result in a much smoother amplitude variance of the signal.  With this information we can develop a mathematical formula for deriving the modulating signal in which we can base our code on.  This is also where I encountered one of my first big challenges.  The formula I used at first (from the book Audio Programming) was:

ModSignal = 1 + DEPTH * sin(w * FREQ)

where w = 2 * pi / samplerate.  This signal, derived from the LFO defined by the sine operation, would be used to modulate the incoming sound signal:

Signal = Signal * ModSignal

This produced the desired tremolo effect quite nicely.  But when the original signal approached full amplitude, overmodulation would occur resulting in a nasty digital distortion.  As can be seen in the above equation for the modulating signal, it will exceed 1 for values of sine > 0.   Essentially this equation is a DC offset, which takes a normally bipolar signal and shifts it up or down.  This is what we want to create the tremolo effect, but after realizing what was causing distortion in the output, I set about finding a new equation to calculate the modulating signal.  After some searching, I found this:

ModSignal = (1 – DEPTH) + DEPTH * (sin(w * FREQ))2

This equation was much better in that it never exceeds 1, so it won’t result in overmodulation of the original signal.  I did however make one personal modification to it; I decided not to square the sine operation after experimenting around with it in the main processing loop.  Ideally we want to perform as few calculations (especially costly ones) within loops as possible.  This is especially important in audio where responsiveness and efficiency are so important in real-time applications.  To compensate for this I scale the DEPTH parameter from a percentage to a range of 0 – 0.5.  From here we can now get into the code.  First, initialization occurs:

Then the main processing loop:

With expandability and flexibility in mind, I began creating my own “oscillator” class which can be seen here:

This is where the power of C++ and object-oriented programming start to show itself.  It affords the programmer much needed flexibility and efficiency in creating objects that can be portable between different programs and functions for future use, and this is definitely important for me as I can utilize these for upcoming plug-ins or standalone audio apps.  Furthermore, by designing it with flexibility in mind, this will allow for the modulation of the modulator so-to-speak.  In other words, we can time-vary the modulation frequency or depth through the use of envelopes or other oscillators.  Values extracted from an envelope or oscillator can be passed into the “oscillator” class which processes and updates its internal data with the proper function calls.  This will allow for anything from ramp ups of the tremolo effect to entirely new and more complex effects derived from amplitude modulation itself!

But now let’s get on to the listening part!  For this demonstration I extracted a short segment of the Great Fairy Fountain theme from the Zelda 25th Anniversary CD release, probably my favorite theme from all of Zelda.

Zelda theme

Here it is after being modulated with a frequency of 4.5 Hz at a depth of 40%:

Tremolo at 4.5 Hz and 40% depth

And for a little more extreme tremolo, we can modulate it at 7.0 Hz at a depth of 85%:

Tremolo at 7.0 Hz and 85% depth

This brings up another challenge that had to be overcome during the development of this program.  Prior to this most of the work I had been studying in the book “Audio Programming” dealt with mono soundfiles.  For this I really wanted to get into handling stereo files and this presented a few problems as I had to learn exactly how to properly process the buffer that holds all the sound data for stereo files.  I am using libsndfile (http://www.mega-nerd.com/libsndfile/) to handle I/O on the actual soundfile being processed and this required me to search around and further adapt my code to work properly with this library.  At one point I was getting very subtle distortion in all of my outputs as well as tremolo rates that were double (or even quadruple) the rates that I had specified.  It took a lot of investigation and trial & error before I discovered the root of the problem lie in how I was handling the stereo files.

In closing off this blog entry, here is a further processing I did on the Zelda sample.  After applying tremolo to it using the program I wrote, I put it through the pitch shifter VST plug-in I implemented to come up with a very eerie result.  ‘Till next time!

Eerie Zelda