Category Archives: Audio

Building a Tone Generator for iOS using Audio Units

Over the past month or so, I’ve been working with a friend and colleague, George Hufnagl, on building an iOS app for audiophiles that includes several useful functions and references for both linear and interactive media.  The app is called “Pocket Audio Tools” and will be available sometime in mid-August for iPhone, with iPad native and OS X desktop versions to follow.  Among the functions included in the app is one that displays frequency values for all pitches in the MIDI range with adjustable A4 tuning.  As an extension of this, we decided to include playback of a pure sine wave for any pitch selected as further reference.  While standard audio playback on iOS is quite straightforward, building a tone generator requires manipulation of the actual sample data, and for that level of control, you need to access the lowest audio layer on iOS — Audio Units.

To do this, we set up an Objective-C class (simply derived from NSObject) that takes care of all the operations we need to initialize and play back our sine tone.  To keep things reasonably short, I’m just going to focus on the specifics of initializing and using Audio Units.

Here is what we need to initialize our tone generator; we’re setting up an output unit with input supplied by a callback function.

AudioComponentDescription defaultOutputDescription = { 0 };
defaultOutputDescription.componentType = kAudioUnitType_Output;
defaultOutputDescription.componentSubType = kAudioUnitSubType_RemoteIO;
defaultOutputDescription.componentManufacturer = kAudioUnitManufacturer_Apple;
defaultOutputDescription.componentFlags = 0;
defaultOutputDescription.componentFlagsMask = 0;

AudioComponent defaultOutput = AudioComponentFindNext(NULL, &defaultOutputDescription);
OSErr error = AudioComponentInstanceNew(defaultOutput, &_componentInstance);

Above, we set our type and sub type to an output Audio Unit in which its I/O interfaces directly with iOS (specified by RemoteIO).  Currently on iOS, no third-party AUs are allowed, so the only manufacturer is Apple.  The two flag fields are set to 0 as per the Apple documentation.  Next we search for the default Audio Component by passing in NULL as the first argument to the function call, and then we initialize the instance.

Once an Audio Unit instance has been created, it’s properties are customized through calls to AudioUnitSetProperty().  Here we set up the callback function used to populate the buffers with the sine wave for our tone generator.

AURenderCallbackStruct input;
input.inputProc = RenderTone;
input.inputProcRefCon = (__bridge void *)(self);
error = AudioUnitSetProperty(_componentInstance, kAudioUnitProperty_SetRenderCallback, kAudioUnitScope_Input, 0, &input, sizeof(input));

After setting the callback function RenderTone (we’ll define this function later) and our user data (self in this case because we want the instance of the class we’re defining to be our user data in the callback routine), we set this property to the AU instance.  The scope of the Audio Unit refers to the context for which the property applies, in this case input.  The “0” argument is the element number, and we want the first one, so we specify “0”.

Next we need to specify the type of stream the audio callback function will be expecting.  Again we use the AudioUnitSetProperty() function call, this time passing in an instance of  AudioStreamBasicDescription.

AudioStreamBasicDescription streamFormat = { 0 };
streamFormat.mSampleRate = _sampleRate;
streamFormat.mFormatID = kAudioFormatLinearPCM;
streamFormat.mFormatFlags = kAudioFormatFlagsNativeFloatPacked | kAudioFormatFlagIsNonInterleaved;
streamFormat.mBytesPerPacket = 4; // 4 bytes for 'float'
streamFormat.mBytesPerFrame = 4; // sizeof(float) * 1 channel
streamFormat.mFramesPerPacket = 1; // 1 channel
streamFormat.mChannelsPerFrame = 1; // 1 channel
streamFormat.mBitsPerChannel = 8 * 4; // 1 channel * 8 bits/byte * sizeof(float)
error = AudioUnitSetProperty(_componentInstance, kAudioUnitProperty_StreamFormat, kAudioUnitScope_Input, 0,
                                     &streamFormat, sizeof(AudioStreamBasicDescription));

The format identifier in our case is linear PCM since we’ll be dealing with non-compressed audio data.  We’ll also be using native floating point as our sample data, and because we’ll be using only one channel, our buffers don’t need to be interleaved.  The bytes per packet field is set to 4 since we’re using floating point numbers (which is of course 4 bytes in size), and bytes per frame is the same because in mono format, 1 frame is equal to 1 sample.  In uncompressed formats, we need 1 frame per packet, and we specify 1 channel for mono.  Bits per channel is just calculated as 8 * sizeof our data type, which is float.

This completes the setup of our Audio Unit instance, and we then just initialize it with a call to

AudioUnitInitialize(_componentInstance);

Before looking at the callback routine, there is one extra feature that I included in this tone generator.  Simply playing back a sine wave results in noticeable clicks and pops in the audio due to the abrupt amplitude changes, so I added fade ins and fade outs to the playback.  This is accomplished simply by storing arrays of fade in and fade out amplitude values that shape the audio based on the current playback state.  For our purposes, setting a base amplitude level of 0.85 and using linear interpolation to calculate the arrays using this value is sufficient.  Now we can examine the callback function.

OSStatus RenderTone (void* inRefCon, AudioUnitRenderActionFlags* ioActionFlags,
                     const AudioTimeStamp* inTimeStamp,
                     UInt32 inBusNumber,
                     UInt32 inNumberFrames,
                     AudioBufferList* ioData)
{
    PAToneGenerator *toneGenerator = (__bridge PAToneGenerator*)(inRefCon);
    Float32 *buffer = (Float32*)ioData->mBuffers[0].mData;

    for (UInt32 frame = 0; frame < inNumberFrames; ++frame)
    {

We only need to consider the following arguments: inRefCon, inNumberFrames, and ioData.  First we need to cast our user data to the right type (our tone generator class), and then get our buffer that we will be filling with data.  From here, we need to determine the amplitude of our sine wave based on the playback state.

switch ( toneGenerator->_state ) {
    case TG_STATE_IDLE:
        toneGenerator->_amplitude = 0.f;
        break;

    case TG_STATE_FADE_IN:
        if ( toneGenerator->_fadeInPosition < kToneGeneratorFadeInSamples ) {
            toneGenerator->_amplitude = toneGenerator->_fadeInCurve[toneGenerator->_fadeInPosition];
            ++toneGenerator->_fadeInPosition;
        } else {
            toneGenerator->_fadeInPosition = 0;
            toneGenerator->_state = TG_STATE_SUSTAINING;
        }
        break;

    case TG_STATE_FADE_OUT:
        if ( toneGenerator->_fadeOutPosition < kToneGeneratorFadeOutSamples ) {
            toneGenerator->_amplitude = toneGenerator->_fadeOutCurve[toneGenerator->_fadeOutPosition];
            ++toneGenerator->_fadeOutPosition;
        } else {
            toneGenerator->_fadeOutPosition = 0;
            toneGenerator->_state = TG_STATE_IDLE;
        }
        break;

    case TG_STATE_SUSTAINING:
        toneGenerator->_amplitude = kToneGeneratorAmplitude;
        break;

    default:
        toneGenerator->_amplitude = 0.f;
        break;
}

Once we have the amplitude, we simply fill the buffer with the sine wave.

buffer[frame] = sinf(toneGenerator->_phase) * toneGenerator->_amplitude;

toneGenerator->_phase += toneGenerator->_phase_incr;
toneGenerator->_phase = ( toneGenerator->_phase > kTwoPi ? toneGenerator->_phase - kTwoPi : toneGenerator->_phase );

One important thing to bear in mind is that the callback function is continually called after we start the Audio Unit with

AudioOutputUnitStart(_componentInstance);

We don’t need to start and stop the entire instance each time a tone is played because that could very well introduce some delays in playback, so the callback continually runs in the background while the Audio Unit is active.  That way, calls to play and stop the tone are as simple as changing the playback state.

- (void)playTone
{
    if ( _isActive) {
        _state = TG_STATE_FADE_IN;
    }
}

- (void)stopTone
{
    if ( _isActive ) {
        if ( _state == TG_STATE_FADE_IN ) {
            _state = TG_STATE_IDLE;
        } else {
            _state = TG_STATE_FADE_OUT;
        }
    }
}

The reason for the extra check in stopTone() is to prevent clicks from occuring if the playback time is very short.  In other words, if the playback state has not yet reached its sustain point, we don’t want any sound output.

To finish off, we can stop the Audio Unit by calling

AudioOutputUnitStop(_componentInstance);

and free the resources with calls to

AudioUnitUninitialize(_componentInstance);
AudioComponentInstanceDispose(_componentInstance);
_componentInstance = nil;

That completes this look at writing a tone generator for output on iOS.  Apple’s Audio Units layer is a sophisticated and flexible system that can be used for all sorts of audio needs including effects, analysis, and synthesis.  Keep an eye out for Pocket Audio Tools, coming soon to iOS!

DFT Spectral Analysis with OpenGL

The idea for this little experiment/project of mine came about while I was exploring OpenGL through the use of Cinder, a C++ library that focuses on creative coding.  Among its many features, it supports 2D and 3D graphics through OpenGL.  As I was experimenting with building a particle engine, I started thinking about how audio can drive various graphical parameters and elements.  We don’t see this very much, but it has some intriguing applications in video games and other interactive media.

We all know that randomness is a large part of sound and visual effects in games these days.  To illustrate, let’s take an example of a magic spell; both its visual and aural element will have randomness, but will these sync up together?  In other words, if a particular instance of the spell’s sound effect contains a lot of high harmonics, does the visual aspect of it respond by being brighter or more energetic?  If the visual and aural parts of the spell are completely independent, there is less chance that they will behave as one cohesive unit, where the visual element is accurately reflective of the aural part.  It’s a subtle thing, and not applicable in every case, but I strongly believe that these details add much to immersion, and impress upon us a more organic game world.

So now, how does DSP fit into this?  Analysing a sound in the time domain (represented by the wave form) doesn’t give us any useful information as to the frequency content or the particular energy it contains across its spectrum.  The DFT (Discrete Fourier Transform) is used to convert a signal into the frequency domain where this information resides.  Instead of displaying amplitude (y axis) vs. time (x axis) in the time domain, the frequency domain plots the amplitude of each frequency component (y axis) vs. the frequency (x axis).  This is a plot of the frequency domain in Audacity:

Frequency domain plot of a piano chord.

Frequency domain plot of a piano chord.

Spectral analysis using the DFT is nothing new and has been around for some time, but its due to the FFT (Fast Fourier Transform) and dramatically increased performance with floating point operations in modern computers that using the DFT for analysis in real time has become practical.  The DFT is calculated using convolution, which we know is very slow.  The FFT uses a highly optimized algorithm (the Cooley-Turkey is the most widely used, but there are several others) to calculate the DFT of a signal.  Going into detail on the math of the FFT is beyond my experience, but fortunately we don’t need to know its intricacies to use it effectively.

The DFT works by breaking a signal down into its individual components.  This is related to the Fourier Theorem, which states that any complex waveform can be constructed by the addition of sine waves — the building blocks of a signal.  Each sine wave represents a single frequency component; its amplitude is the strength of that frequency.  The DFT results in a complex signal with both cosine and sine wave components.  This information isn’t terribly useful to us in this form.  From the complex-valued signal, however, we find the magnitude of each frequency component by converting it into polar form.  In rectangular form a complex number is represented as:

a + jb (DSP normally uses “j” instead of “i”), where a is the real part and b the imaginary part

In polar form, a complex number is represented by an angle (the argument) and the magnitude (absolute value).  The absolute value is what we’re interested in, and is calculated according to Pythagoras’ Theorem:

m = sqrt( x2 + y2 )

This value, plotted against the frequency (which would be the index of the complex number corresponding to the index of the real-valued array) gives us the spectrum of the signal.  In other words, if we take the FFT of an N-point signal, we get N pairs of complex values as output.  Each N in the complex signal represents a frequency along the x axis, with 0 being DC, N/2 being Nyquist, and N being the sampling rate.  The corresponding magnitude of the complex number gives us the amplitude of that frequency component in the signal.  A final important point to make here is that we’re only interested in the frequency points up to N/2, because anything above Nyquist aliases in discrete time (the spectrum of a signal is actually mirrored around N/2).  Here is a sample plot of a piano chord I recorded, using 5 512-sample buffers from the audio:

Frequency plot of 5 512-sample buffers of a piano chord.

Frequency plot of 5 512-sample buffers of a piano chord.

This plot shows us that most of the energy lies in the frequency range 0 – 0.04 (assuming a sampling rate of 48kHz, this would equate to approximately 0 – 1920 Hz).  Now that we can extract the frequency information from an audio signal, I want to talk about how I use this data along with the RMS value to drive the particle engine I made.

There are a number of parameters in generating particles that I considered.  The number of particles, how they’re generated/constructed, the radius, and their color.  Particles are only generated during transients, and so I check a previous RMS value against the new one to determine if particles should be constructed.  This method isn’t foolproof, but with my tests it’s been working quite well for my purposes.  The number of particles generated is also related to the RMS of the signal — the louder it is, the more particles are made.  This is evenly divided across the whole spectrum, so that at least a few particles from each frequency range is present.  The radius of each particle is determined by the frequency component (the N value of the frequency plot above).  The lower the frequency, the larger the particle.  Finally, the color, or more specifically the brightness of the particle, is determined by the frequency amplitude.  Let me go into a little more detail on each of these before presenting a little video I made demonstrating the program.

First, here is the function that gathers all the information (I’m using Portaudio for real-time playback, so all the audio processing happens in a callback function):

int audioCallback (const void* input, void* output, unsigned long samples, const PaStreamCallbackTimeInfo *timeInfo, PaStreamCallbackFlags statusFlags, void* userData)
{
    const float* in = (const float*)input;
    float* out = (float*)output;
    PaAudioData* data = (PaAudioData*)userData;

    memset(data->buffer, 0, sizeof(double) * SAMPLE_BLOCK * 2);
    float rms = CFDsp::cf_SSE_calcRms(in, samples);

    for (int i = 0; i < samples; ++i) {
        data->buffer[i*2] = *in * hammingWindow(i, SAMPLE_BLOCK);

        *out++ = *in++;
        *out++ = *in++;
    }

    // Calculate DFT using FFT.
    fft(data->buffer, SAMPLE_BLOCK);

    data->amp_prev = data->amp_now;
    data->amp_now = rms;

    return paContinue;
}

Since the FFT function I’m using requires interleaved data, it needs to be zeroed out first and then each real-valued sample stored in the even-numbered indices.  The RMS value is acquired using a function I wrote using SSE intrinsics for optimal speed.

float cf_SSE_calcRms (const float* inBuffer, const int length)
{
    assert(length % 4 == 0);

    __m128* mbuf = (__m128*)inBuffer;
    __m128 m1, m2;
    __m128 macc = _mm_set1_ps(0.f);
    __m128 _1oN = _mm_set1_ps(1.f / static_cast<float>(length));

    for (int i = 0; i < length; i+=4) {
        m1 = _mm_mul_ps(*mbuf, *mbuf);
        m2 = _mm_mul_ps(m1, _1oN);
        macc = _mm_add_ps(m2, macc);
        mbuf++;
    }

    __declspec(align(16)) float result[4];
    _mm_store_ps(result, macc);

    return sqrtf(result[0] + result[1] + result[2] + result[3]);
}

Inside the for loop, a Hamming window is applied to the signal prior to sending it to the FFT to bound it along the edges, minimizing or eliminating frequency leakage in the spectrum.  In the application’s update method, I test for the presence of a transient in the current buffer.  If one exists, particles are added.

if (audioData.isTransient())
{
    Vec2i randPos;

    // 50 pixel border.
    randPos.x = Rand::randInt(50, getWindowWidth() - 50);
    randPos.y = Rand::randInt(50, getWindowHeight() - 50);
    mParticleController.addParticles(randPos, audioData);
}

The adding of particles is handled by the ParticleController class:

void ParticleController::addParticles (const Vec2i& loc, const PaAudioData& data)
{
    int stride = static_cast<int>((64.f - powf((64.f * data->amp_now), 1.32f)) + 0.5f;
    stride = (stride < 1 ? 1 : stride);

    // 512 == sample block size; represents the horizontal frequency axis in the spectral analysis.
    for (int i = 0; i < 512; i+=stride) {
        Vec2f randVec = Rand::randVec2f() * 6.5f;
        mParticles.push_back(Particle(loc + randVec, i/1024.f, data.mag(i) * 18.f));
    }
}

First I calculate a stride value that controls the speed at which we execute the following for loop (this evenly divides the number of particles across the x axis).  As you can see, this is determined by the volume (calculated as RMS) of the current audio buffer.  When creating each individual particle, the i/1024 argument determines the radius — as I mentioned this is based on the frequency value along the x axis.  To constrain along reasonable bounds, this value is further calculated as

sqrt( 1 / (i / 1024) )

The color of each particle is first determined using HSV.  The hue is randomly chosen, but the saturation and value (brightness) is based on the data.mag(i) * 18 argument — i.e. the amplitude of the frequency component.  This is then converted into an RGB value to set the color of the individual particle.

One of the primary challenges of getting this working properly is mapping the output from the DFT analysis and RMS to usable values in the particle engine.  That’s the reason for several of the “magic numbers” in the function above (1024 is not one of them; it’s double our sample length, which I said previously we only want from 0 – 0.5, and since the data stored is in interleaved format, the buffer is actually 1024 samples long).  Quite a lot of experimenting went into fine-tuning these values to get good results, but more can be done in this area, which I am interested in exploring further.  I feel that this is a good first step in this area, however, and I’ll be looking at ways to integrate this process into further projects and ideas I have brewing.

With all of that out of the way, here is the video demonstrating the result (note: my FPS took a hit due to recording, and my way-old, out-of-date video card didn’t help matters either):

DFT Spectral Particle Engine from Christian on Vimeo.

Upgrading Dyner with Parameter Smoothing

It’s been just over a month since I released my free ring modulator plug-in, Dyner.  Since then I’ve gotten some very nice feedback on it, and thankfully no crash reports (well.. that I know of). Additionally, I’ve been tweaking a few things like the GUI controls and doing some minor optimizations, but more importantly I just finished adding smoothing to the depth parameter, so I figured it was time to upgrade to version 1.0.4 and put that out.  Also, since parameter smoothing is a pretty important feature in DSP for achieving good quality sound, I thought I’d give a brief overview of what I did.

First, let’s hear the difference.  Previously, artifacts were most easily heard using a low frequency setting with the sine wave while changing the depth somewhat rapidly.  Other wave forms, due to their rich harmonic content, obscured most artifacts that resulted in automating the depth.  So here then, are two audio samples “before” and “after”:

Before parameter smoothing

After parameter smoothing

In writing the parameter smoother class, there were a few features I felt were important to include:

  • Lightweight and easy to use,
  • Flexible in terms of how the interpolated smoothing is calculated,
  • Reusable for other plug-ins or DSP effects.

One thing I wanted to avoid was having to insert a whole bunch of new code into my plug-in to “check for this”, “update that”, “initialize”, etc.  The more operations that are handled inside the parameter smoother class, the better.  To accomplish this, I decided to include a reference to the parameter value that’s receiving the smoothing.  Once the object is initialized with this reference, it is automatically notified when the user changes the value of the parameter, and I compare that to the previous value that is stored within the class to signal the change.  In other words,

Constructor signature

Constructor signature

Instantiating the class; first argument is passed by reference.

Instantiating the class; first argument is passed by reference.

Since the first argument (initialValue) is passed by reference and stored internally as such, changing paramDepth outside the object automatically changes its reference inside as well since they point to the same address space.  Once this connections is made, there is no need to call an external function to check whether the parameter value has changed, or to explicitly set the change.  This minimizes function calls and makes it very easy to use and implement in any project; just instantiate the object as above, and then call one method within the audio’s process loop to fetch the parameter value.

Parameter smoothing itself is typically handled by interpolating from the current value to the target value that has just been set by the user, making the change more gradual to avoid artifacting that can result from the abrupt jump otherwise (other methods might include passing parameter values through a low-pass filter).  Linear interpolation is often good enough for this task.  It’s very light on computation and easy to implement.  But I wanted to leave it open to implement different curve shapes that the interpolation can be mapped onto.  Especially for cases where the smoothing takes place over a longer duration of samples whereby a specific curve would become more noticeable.  This was easily handled through a callback function (also seen in the above image of the constructor), which allows me to define a custom function that the interpolation follows.

Callback function signature.

Callback function signature.

For example, to define an exponential curve, I could write the following function, and then pass it as an argument (which is basically a function pointer) to the constructor:

Exponential function (x^1.84).

Exponential function (x^1.84).

Initializing with function pointer.

Initializing with function pointer.

Here we can see the difference graphically:

Smoothing with linear interpolation.

Smoothing with linear interpolation.

Smoothing with exponential interpolation.

Smoothing with exponential interpolation.

Being able to specify the number of samples over which the parameter smoothing takes place is also a fairly obvious customization that I included.  This way, I can adapt the smoothing to be equal to the block size that the host passes to my plug-in, for example.  Another option I wanted to include was the ability to calculate new interpolated values at a different rate than the sampling rate.  In other words, instead of calculating new values every sample in the processing loop, I can have it calculate every 2, 4, 12, 48, etc. samples by setting the stride variable that you might have noticed above.  It’s rarely necessary to calculate a new value every sample, and in fact, the smoothing in Dyner happens every 4 samples.  This could be increased as well if a custom callback is used that is of greater complexity than simple linear interpolation (e.g. mapping to an S-curve).  The resulting interpolation looks like this with a stride value of 24:

Linear smoothing with a stride of 24 samples.

Linear smoothing with a stride of 24 samples.

Lastly, you might have noticed that the above graph (apart from the staircases) looks different from the preceding ones.  This illustrates a critical feature that needed attention: allowing the parameter smoothing to change when a new value is set while smoothing is already taking place due to a previous change.  In such a case, the current state nees to be saved and a new one begins towards the new target parameter value.  I did this by defining four states that I keep track of using bitwise flags.

State-handling for the parameter smoother.

State-handling for the parameter smoother.

Updating the state flags using bitwise operations.

Updating the state flags using bitwise operations.

As can be seen above, comparison between the parameter reference value and its previous setting determines whether a new change has occurred.  mPos is the position index along the interpolation line/curve, so when it is non-zero it means smoothing is currently processing due to a previous value change.

It’s also worth mentioning that the actual method that retrieves the smoothed parameter value (getParameterValue) needs to be called every frame of processing so it can properly update itself.  It replaces any direct use of the original parameter for processing calculations.

Example of using the class method instead of the actual parameter inside the processing loop.

Example of using the class method instead of the actual parameter inside the processing loop.

New Plug-In: Dyner (ring modulator)

Finally I have gotten the chance to polish up and release a new plug-in.  This one is a ring modulator called Dyner (ring modulation is basically a type of heterodyning).  I’ve spoken about this effect in a previous blog post, but just to quickly summarize, ring modulation is accomplished by multiplying a modulator signal (usually an oscillator of some kind) with the carrier signal (the audio).  By itself it’s a very easy and straightforward operation, but unless you’re only modulating with a sine wave, aliasing becomes a big issue to deal with.  Many free ring modulators that I’ve tried do not tackle this problem in any way, so I was determined to do so in Dyner.

This meant writing my own resampler to handle the oversampling of the operation.  As I outlined in previous posts regarding resampling, this was no mean feat in making sure the process was accurate and efficient.  However, without regard to the oscillator itself, a very strong anti-aliasing filter would be required, severely impacting the CPU load and execution of the effect.  The next task was then to create fully band-limited oscillators.  In this plug-in I’m just using the wavetable approach, with each oscillator using 10 different tables calculated with a unique number of harmonics appropriate for the octave range so they don’t alias.  Now that it was ensured that the oscillators themselves would obey the sampling theorem, oversampling by 2X using a light filter is so far giving good results.

The plug-in (VST 2.4, for both Mac and Windows, 32-bit & 64-bit) can be downloaded for free in Downloads.

Dyner GUI

Dyner GUI

Features:

  • Five oscillators (sine, sawtooth up, sawtooth down, triangle, square)
  • Oscillators are fully band-limited and FFT generated for optimum performance
  • 2X oversampling to avoid aliasing effects
  • Smooth knob controls (mouse/mouse wheel, type in value manually) capable of fine granularity for parameter settings

Requires:

  • VST 2.4 capable host
  • SSE3 processor support
  • Mac: OS X 10.6.8+
  • Windows: Only tested on Windows 7 so far, but should be compatible with Vista/XP at least

The Different “Sides” of Convolution

We all know how much convolution is used in DSP, and what a significant part it has in making many effects and analysis techniques a reality.  Often we hear about FFT, or fast convolution, but in actuality these algorithms are only faster than straight convolution with a kernel length (of the impulse response) greater than around 60 – 64 or so.  Anything shorter than that is best handled with normal convolution, which can be implemented using either an input side or output side algorithm.

The input side algorithm is normally the one that we learn first (at least it was for me, and judging from many texts and books on it, it seems to be common).  However, the output side algorithm is perhaps slightly easier to code, and we’ll see why later on.  For this post I’m going to evaluate the effects each algorithm has on the resulting audio signal because each one could have an advantage in some situations over the other.  I’ll also share a little trick that will speed up the convolution process for certain types of filters.

The output side algorithm performs the convolution, as one might expect, from the viewpoint of the output (or the result) of the operation, while the input side algorithm implements convolution from the viewpoint of the input signal.  To look at it another way, recall that convolving an input signal with an impulse response results in an output signal that is the length of the input + the length of the impulse response – 1.  With the input side algorithm, this “extra” bit of length is appended to the resulting signal, and is probably the easiest one to theoretically grasp.  The output side algorithm looks at what points from the input we need in order to get our resulting signal.

The reason why the output side method feels more natural to code is that we often write in terms of the output or result: y(n) = ____ — the output is equal to some kind of expression, for example.  Approaching convolution this way requires input samples on the negative side (x[-2], x[-1], etc.), which of course don’t exist.  To get around that little problem, one common way is to implement a delay line that contains all zeroes at first, then shifting in the input signal as we go, doing the calculating with the impulse response and the delay line instead of direclty with the input.

I’ve been busy redesigning and overhauling my DSP filter class, which has entailed some experimentation with these two convolution methods on FIR filters.  Here we can see visually the difference between them.  First up is the output side convolution of a windowed-sinc FIR filter using a Kaiser window of order 162 on a simple triangle wave (using test oscillators is a good way to visually see what’s happening).

Output side convolution on a triangle wave.

Output side convolution on a triangle wave.

I purposely made the filter kernel (impulse response) fairly long so it would illustrate the delay and shift that’s happening at the start of the signal due to the output side algorithm.  Compare that with the image of the input side algorithm using exactly the same filter and order.

Input side convolution on a triangle wave.

Input side convolution on a triangle wave.

Huge difference.  It’s clearly audible as well in this test signal (samples below).  With the output side algorithm there is a clear popping sound as the audio abruptly starts in contrast to the much smoother beginning of the test signal filtered with the input side algorithm.

Here you can hear the difference (to hear it clearly you may have to download the files and listen in an audio editor to avoid the glitchy start of web browser plug-ins):

Triangle wave filtered with output side

Triangle wave filtered with input side

Now to return to the point about the output side method being perhaps a little more natural to code.  The little wrinkle we face in implementing the input side algorithm is that with block processing (which is so common in DSP), how do we account for the longer output as a result of the convolution?  In many cases we don’t have control over the size of the audio buffers we’re given or have to work with.  The solution is to use the overlap-add method.

It’s really not more difficult than the output side algorithm.  All we have to do is calculate the convolution fully (into our own internal buffer if we have to), save the extra length at the end (which will be equal to the impulse response length), then add that to the start of the next block.  Rinse and repeat.  Here is C++ code that implements both of these convolution methods.

Output side convolution process.

Output side convolution process.

Input side convolution process.

Input side convolution process.

Don’t worry, I didn’t forget to share the trick I mentioned at the beginning.  Many FIR filters have linear phase response, which means that their kernels are symmetrical.  That provides us a great opportunity to eliminate extra calculations that aren’t needed.  So notice in the above code each ‘j’ loop (the kernel or impulse response loop) only traverses half the kernel length, as the value of input[i] * mKernel[j] is the same as input[i] * mKernel[mKernelLength-j-1].  At the end of the kernel’s loop we calculate the midpoint value.

Again, the calculations involving symmetry is perhaps easier to see in the output side algorithm, because we naturally gravitate towards expressions that result in a single output value.  If you work a small convolution problem out on paper, however, it will help you see the input side algorithm and why it works.

Audio Resampling: Part 2

This post has been edited to clarify some of the details of implementing the polyphase resampler [May 14, 2013].

Time to finish up this look at resampling.  In Part 1 we introduced the need for resampling to avoid aliasing in signals, and its implementation by windowed sinc FIR filters.  This is a costly operation, however, especially for real-time processing.  Let’s consider the case of upsampling-downsampling a 1 second audio signal at a sampling rate of 44.1 kHz and a resampling factor of 4X.  Going about this with the brute-force method that we saw in Part 1 would result in first upsampling the signal by 4X.  This results in a buffer size that has now grown to be 4 x 44100 = 176,400 that we now have to filter, which will obviously take roughly 4 times as long to compute.  And not only once, but twice, because the decimation filter also operates at this sample rate.  The way around this is to use polyphase interpolating filters.

A polyphase filter is a set of subfilters where the filter kernel has been split up into a matrix with each row representing a subfilter.  The input samples are passed into each subfilter that are then summed together to produce the output.  For example, given the impulse response of the filter

transf1

we can separate it into two subfilters, E0 and E1

transf2

transf3

where E0 contains the even-numbered kernel coefficients and E1 contains the odd ones.  We can then express H(z) as

fullequation

This can of course be extended for any number of subfilters.  In fact, the number of subfilters in the polyphase interpolating/decimating filters is exactly the same as the resampling factor.  So if we’re upsampling-downsampling by a factor of 4X, we use 4 subfilters.  However, we still have the problem of filtering on 4 times the number of samples as a result of the upsampling.  Here is where the Noble Identity comes in.  It states that for multirate signal processing, filtering before upsampling and filtering after downsampling is equivalent to filtering after upsampling and before downsampling.

Noble Identities for upsampling and downsampling.

Noble Identities for upsampling and downsampling.

When you think about it, it makes sense.  Recall that upsampling involves zero-insertion between the existing samples.  These 0 values, when passed through the filter, simply evaluate out to 0 and have no effect on the resulting signal, so they are wasted calculations.  We are now able to save a lot of computational expense by filtering prior to upsampling the signal, and then filtering after downsampling.  In other words, the actual filtering takes place at the original sample rate.  The way this works in with the polyphase filter is quite clever: through a commutator switch.

Let’s take the case of decimation first because it’s the easier one to understand.  Before the signal’s samples enter the polyphase filter, a commutator selects every Mth sample (M is the decimation factor) to pass into the filter while discarding the rest.  The results of each subfilter is summed together to produce the signal, back to its original sample rate and without aliasing components.  This is illustrated in the following diagram:

Polyphase decimator with a commutator switch that selects the input.

Polyphase decimator with a commutator switch that selects the input.

Interpolation works much the same, but obviously on the other end of the filter.  Each input sample from the signal passes through the polyphase filter, but instead of summing together the subfilers, a commutator switch selects the outputs of the subfilters that make up the resulting upsampled signal.  Think of it this way: one sample passes into each of the subfilters that then results in L outputs (L being the interpolation factor and the number of subfilters).  The following diagram shows this:

Polyphase interpolating filter with a commutator switch.

Polyphase interpolating filter with a commutator switch that selects the output.

We now have a much more efficient resampling filter.  There are other methods that exist as well to implement a resampling filter, including Fast Fourier Transform, which is a fast and efficient way of doing convolution, and would be a preferred method of implementing FIR filters.  At lower orders however, straight convolution is just as fast (if not even slightly faster at orders less than 60 or so) than FFT; the huge gain in efficiency really only occurs with a kernel length greater than 80 – 100 or so.

Before concluding, let’s look at some C++ code fragments that implement these two polyphase structures.  Previously I had done all the work inside a single method that contained all the for loops to implement the convolution.  Since we’re dealing with a polyphase structure, it naturally follows that the code should be refactored into smaller chunks since each filter branch can be throught of as an individual filter.

First, given the prototype filter’s kernel, we break it up into the subfilter branches.  The number of subfilters (branches) we need is simply equal to the resampling factor.  Each filter branch will then have a length equal to the prototype filter’s kernel length divided by the factor, then +1 to take care of rounding error.  i.e.

branch order = (prototype filter kernel length / factor) + 1

The total order of the polyphase structure will then be equal to the branch order x number of branches, which will be larger than the prototype kernel, but any extra elements should be initialized to 0 so they won’t affect the outcome.

The delay line, z, for the interpolator will have a length equal to the branch order.  Again, each branch can be thought of as a separate filter.  First, here is the decimating resampling code:

 

decim_code_2

Example of decimating resampler code.

As can be seen, calculating each polyphase branch is handled by a separate object with its own method of calculating the subfilter (processDownsample).  We index the input signal with variable M, advances at the rate of the resampling factor.  The gain adjust can be more or less ignored depending on how the resampling is implemented.  In my case, I have precalculated the prototype filter kernels to greatly improve efficiency.  However, the interpolation process decreases the level of the signal by an amount equal to the resampling factor in decibels.  In other words, if our factor is 3X, we need to amplify the interpolated signal by 3dB.  I’ve done this by amplifying the prototype filter kernel so I don’t need to adjust the gain during interpolation.  But this means I need to compensate for that in decimation by reducing the level of the signal by the same amount.

Here is the interpolator code:

Example interpolation resampling code.

Example of interpolation resampler code.

As we can see, it’s quite similar to the decimation code, except that the output selector requires an additional for loop to distribute the results of the polyphase branches.  Similarly though, it uses the same polyphase filter object to calculate each filter branch, using the delay line as input instead of the input signal directly.  Here is the code for the polyphase branches:

Code implementing the polyphase branches.

Code implementing the polyphase branches.

Again, quite similar, but with a few important differences.  The decimation/downsampling MACs the input sample by each kernel value whereas interpolation/upsampling MACs the delay line with the branch kernel.

Hopefully this clears up a bit of confusion regarding the implementation of the polyphase filter.  Though this method splits up and divides the tasks of calculating the resampling into various smaller objects than before, it is much easier to understand and maintain.

Resampling, as we have seen, is not a cheap operation, especially if a strong filter is required.  However, noticeable aliasing will render any audio unusable, and once it’s in the signal it cannot be removed.  Probably the best way to avoid aliasing is to prevent it in the first place by using band-limited oscillators or other methods to keep all frequencies below the Nyquist limit, but this isn’t always possible as I pointed out in Part 1 with ring modulation, distortion effects, etc.  There is really no shortage of challenges to deal with in digital audio!

Audio Resampling: Part 1

Resampling in digital audio has two main uses.  One is to convert audio into a sampling rate needed by a particular system or engine (e.g. converting 48kHz audio to the required 44.1kHz required by CDs).  The second is to avoid aliasing during signal processing by raising the Nyquist limit.  I will be discussing the latter.

Lately I’ve been very busy working on improving and enhancing the sound of ring modulation for a fairly basic plug-in being developed by AlgoRhythm Audio (coming soon).  I say basic becuase as far as ring modulation goes, there are few DSP effects that are simpler in theory and in execution.  Simply take some input signal, multiply it by a carrier signal (usually some kind of oscillator like a sine wave), and we have ring modulation.  The problem that arises, however, and how this connects in with resampling, is that this creates new frequencies in the resulting output that were not present in either signal prior to processing.  These new frequencies created could very well violate the Nyquist limit of the current sampling rate during processing, and that leads us to resampling as a way to clamp down on aliasing frequencies that can be introduced as a result.

Aliasing is an interesting phenomenon that occurs in digital audio, and in every sense of the word is an undesired noise that we need to make sure does not pollute our audio.  There are many resources around that go into more detail on aliasing, but I will give a brief overview of it with some audio and visual samples.

Aliasing occurs when there are frequencies present in a signal that are greater than the Nyquist limit (half of the sampling rate).  What happens in such a case is that the sampling rate is not high enough to properly capture (sample) the high frequency of the signal, and so the frequency “folds over” and creates aliases that are mirror images of the original frequency.  Here, for example, is a square wave at 4000Hz created using 20 harmonics at a sampling rate of 44.1kHz (keep in mind that 4000Hz is the fundamental frequency, and that square waves contain many additional frequencies above that depending on how many harmonics were used to create it, so in this case Nyquist is still being violated):

4000Hz square wave with aliasing

Notice the low tone below the actual 4000Hz frequency.  Here is the resulting waveform of this square wave that shows us visually that we’re not sampling fast enough to accurately reproduce the waveform.  Notice the inconsistencies in the waveform.

squarewave_alias

Now, going to the extreme a bit, here is the same square wave sampled at a rate of 192kHz.

4000Hz square wave with no aliasing

It’s a pure 4000Hz square wave tone.  Examining the waveform of this square wave shows us that the sampling rate was more that adequate to reproduce this signal digitally:

square_noalias

Not all signal processing effects are susceptible to aliasing, and certainly not to the same degree.  Because ring modulation produces additional inharmonic frequencies, it is a prime example of a process that is easily affected by aliasing.  Others include distortion and various other kinds of modulation techniques (especially when taken to the extreme).  However, ring modulation with a sine wave is generally safe as long as the frequency of the sine wave is kept  low enough because sine waves have no harmonics to them, only the fundamental.  Introducing other wavetypes into this process, however, can quite easily bring about aliasing.

Here is an example of ring modulation with a sawtooth wave sweeping up from about 200Hz to 5000Hz.  As it glides up, you will be able to hear the aliasing kicking in at around 0:16 or so.  The example with no aliasing has been upsampled by 3X before processing, then downsampled by 3X back to its original rate.

Ring modulation with a sawtooth sweep, with aliasing

Ring modulation with sawtooth sweep, with no aliasing

So how does upsampling and downsampling work?  In theory, and even to some extent in practice, it’s very straightforward.  The issue, as we will see, is in making it efficient and fast.  In DSP we’re always concerning ourselves with speedy execution times to avoid latency or audio dropouts or running out of memory, etc.

To upsample, we insert 0-valued samples in between every L-th (our upsampling factor) original sample.  (i.e. upsampling by 3X, [1, 2, 3, 4] becomes [1,0,0,2,0,0,3,0,0…])  However, this introduces aliasing into our audio so we need to interpolate these values so that they “join” the sample values of the original waveform.  This is accomplished by using an interpolating low-pass filter.  This entire process is known as interpolation.

Dowsampling is much the same.  We remove every M-th (our downsampling factor) sample from the original signal.  (i.e. downsampling by 2X, [1,2,3,4,5] becomes [1,3,5…])  This process does not introduce aliasing, but we do need to make sure the Nyquist limit is adhered to at the new sampling rate by low-pass filtering with a cutoff frequency at the new Nyquist rate.  This entire process is known as decimation.

Fairly straightforward.  Proceeding with this method, however, would be known as brute force — generally not a good way to go.  The reason why becomes clearer when we consider what kind of low-pass filter we need for this operation.  The ideal filter would be one that would brick-wall attenuate all frequencies higher than Nyquist and leave everything else untouched (thus preserving all the frequencies and tonal content of our original audio).  This is, alas, impossible, as it would require an infinitely long filter kernel.  The function that would implement this ideal filter is the rectangular function.

rect

The rectangular function.

By taking the Fourier transform of the rectangular function we end up with the sinc function, which is given by:

y(x) = sin (x) / x, which becomes y(x) = sin (πx) / πx for signal processing.

Graph of the sinc function, sin (x) / x

Graph of the sinc function, sin (πx) / πx

The sinc functions trails on for infinity in both directions, which can be seen in the graph above, so we need to enforce bounds around it by applying a window function.  Windowing is a method of designing FIR filters by essentially “surrounding” a function (in our case the sinc) by the window, which enforces bounds so that we can properly derive a filter kernel for use in calculations.  The rectangular function shown above is a type of window, but as I mentioned, infinite slope is a deal-breaker in audio.

The Blackman window, given by the function

w(i) = 0.42 – 0.5 cos(2πi M) + 0.08 cos(4πi M),

where M is the length of the filter kernel, is a good choice for resampling because it offers a good stop-band attenuation of -74dB with good rolloff.  Putting this together with the sinc function, we can derive the filter kernel with the following formula:

Windowed-sinc kernel formula*

Windowed-sinc kernel formula*

where fc is the normalized cutoff frequency.  When i = M/2, to avoid a divide by zero, h(i) = 2fc.  K is a constant value used to achieve unity gain at zero frequency and can be ignored while calculating the kernel coefficients.  After all coefficients have been calculated, K can be found by summing together all the coefficients, h(i), and then dividing each by the resulting sum.

* Source: Smith, Steven W., “The Scientist and Engineer’s Guide to Digital Signal Processing”, Chapter 16.

Now that we have the filter design, let’s consider the properties of the FIR filter and compare them to IIR filter designs.  IIR filters give us better performance and attenuation at lower orders, meaning that they execute faster and perform better with fewer calculations than FIR filters.  However, IIR filters are still not powerful enough, even at slightly higher orders, to give us the performance we need for resampling, and trying to push IIR filters into very high orders can make them unstable and/or susceptible to quantization error due to the nature of recursion.  IIR filters also do not offer linear phase response.  FIR filters are the better choice for these reasons, but the unfortunate drawback is that they execute slowly due to being implemented by convolution.  In addition, they need to be pushed to high orders to give us the performance needed for attenuating aliasing frequencies.

However, the order of the interpolating low-pass filter can be negotiated based on the frequency content of the audio signal(s) involved.  If the audio is sufficiently oversampled, it will not contain enough frequencies near Nyquist, and as such a lower order filter can be used with a gentler rolloff without adversely affecting the audio and attenuating actual frequencies in the signal.  There are plenty of cases where we just don’t know the frequency content of the audio signals involved in processing, however, so a strong filter may be needed in these cases.  Here is a graph of a 264-order window-sinc filter (in other words, a filter kernel of length 265 including the sample x(0)):

264-order window-sinc low-pass filter frequency response

264-order window-sinc low-pass filter frequency response (cutoff frequency at 10kHz, resulting in a transition band of 882Hz)

With this in consideration, it can be easy to see that convolving a signal with a 264-order FIR filter is computationally costly for real-time processing.  There are a number of ways to improve upon this when it comes to resampling.  One is using the FFT to apply the filter.  Another interesting solution is to combine the upsampling/downsampling process into the filter itself, which can further be optimized by turning it into a polyphase filter.

The theory and implementation of a polyphase filter is a fairly long and involved topic on its own so that will be forthcoming in part 2,where we look at how to implement resampling efficiently.