DFT Spectral Analysis with OpenGL

The idea for this little experiment/project of mine came about while I was exploring OpenGL through the use of Cinder, a C++ library that focuses on creative coding.  Among its many features, it supports 2D and 3D graphics through OpenGL.  As I was experimenting with building a particle engine, I started thinking about how audio can drive various graphical parameters and elements.  We don’t see this very much, but it has some intriguing applications in video games and other interactive media.

We all know that randomness is a large part of sound and visual effects in games these days.  To illustrate, let’s take an example of a magic spell; both its visual and aural element will have randomness, but will these sync up together?  In other words, if a particular instance of the spell’s sound effect contains a lot of high harmonics, does the visual aspect of it respond by being brighter or more energetic?  If the visual and aural parts of the spell are completely independent, there is less chance that they will behave as one cohesive unit, where the visual element is accurately reflective of the aural part.  It’s a subtle thing, and not applicable in every case, but I strongly believe that these details add much to immersion, and impress upon us a more organic game world.

So now, how does DSP fit into this?  Analysing a sound in the time domain (represented by the wave form) doesn’t give us any useful information as to the frequency content or the particular energy it contains across its spectrum.  The DFT (Discrete Fourier Transform) is used to convert a signal into the frequency domain where this information resides.  Instead of displaying amplitude (y axis) vs. time (x axis) in the time domain, the frequency domain plots the amplitude of each frequency component (y axis) vs. the frequency (x axis).  This is a plot of the frequency domain in Audacity:

Frequency domain plot of a piano chord.

Spectral analysis using the DFT is nothing new and has been around for some time, but its due to the FFT (Fast Fourier Transform) and dramatically increased performance with floating point operations in modern computers that using the DFT for analysis in real time has become practical.  The DFT is calculated using convolution, which we know is very slow.  The FFT uses a highly optimized algorithm (the Cooley-Turkey is the most widely used, but there are several others) to calculate the DFT of a signal.  Going into detail on the math of the FFT is beyond my experience, but fortunately we don’t need to know its intricacies to use it effectively.

The DFT works by breaking a signal down into its individual components.  This is related to the Fourier Theorem, which states that any complex waveform can be constructed by the addition of sine waves — the building blocks of a signal.  Each sine wave represents a single frequency component; its amplitude is the strength of that frequency.  The DFT results in a complex signal with both cosine and sine wave components.  This information isn’t terribly useful to us in this form.  From the complex-valued signal, however, we find the magnitude of each frequency component by converting it into polar form.  In rectangular form a complex number is represented as:

a + jb (DSP normally uses “j” instead of “i”), where a is the real part and b the imaginary part

In polar form, a complex number is represented by an angle (the argument) and the magnitude (absolute value).  The absolute value is what we’re interested in, and is calculated according to Pythagoras’ Theorem:

m = sqrt( x2 + y2 )

This value, plotted against the frequency (which would be the index of the complex number corresponding to the index of the real-valued array) gives us the spectrum of the signal.  In other words, if we take the FFT of an N-point signal, we get N pairs of complex values as output.  Each N in the complex signal represents a frequency along the x axis, with 0 being DC, N/2 being Nyquist, and N being the sampling rate.  The corresponding magnitude of the complex number gives us the amplitude of that frequency component in the signal.  A final important point to make here is that we’re only interested in the frequency points up to N/2, because anything above Nyquist aliases in discrete time (the spectrum of a signal is actually mirrored around N/2).  Here is a sample plot of a piano chord I recorded, using 5 512-sample buffers from the audio:

Frequency plot of 5 512-sample buffers of a piano chord.

This plot shows us that most of the energy lies in the frequency range 0 – 0.04 (assuming a sampling rate of 48kHz, this would equate to approximately 0 – 1920 Hz).  Now that we can extract the frequency information from an audio signal, I want to talk about how I use this data along with the RMS value to drive the particle engine I made.

There are a number of parameters in generating particles that I considered.  The number of particles, how they’re generated/constructed, the radius, and their color.  Particles are only generated during transients, and so I check a previous RMS value against the new one to determine if particles should be constructed.  This method isn’t foolproof, but with my tests it’s been working quite well for my purposes.  The number of particles generated is also related to the RMS of the signal — the louder it is, the more particles are made.  This is evenly divided across the whole spectrum, so that at least a few particles from each frequency range is present.  The radius of each particle is determined by the frequency component (the N value of the frequency plot above).  The lower the frequency, the larger the particle.  Finally, the color, or more specifically the brightness of the particle, is determined by the frequency amplitude.  Let me go into a little more detail on each of these before presenting a little video I made demonstrating the program.

First, here is the function that gathers all the information (I’m using Portaudio for real-time playback, so all the audio processing happens in a callback function):

```int audioCallback (const void* input, void* output, unsigned long samples, const PaStreamCallbackTimeInfo *timeInfo, PaStreamCallbackFlags statusFlags, void* userData)
{
const float* in = (const float*)input;
float* out = (float*)output;
PaAudioData* data = (PaAudioData*)userData;

memset(data->buffer, 0, sizeof(double) * SAMPLE_BLOCK * 2);
float rms = CFDsp::cf_SSE_calcRms(in, samples);

for (int i = 0; i < samples; ++i) {
data->buffer[i*2] = *in * hammingWindow(i, SAMPLE_BLOCK);

*out++ = *in++;
*out++ = *in++;
}

// Calculate DFT using FFT.
fft(data->buffer, SAMPLE_BLOCK);

data->amp_prev = data->amp_now;
data->amp_now = rms;

return paContinue;
}
```

Since the FFT function I’m using requires interleaved data, it needs to be zeroed out first and then each real-valued sample stored in the even-numbered indices.  The RMS value is acquired using a function I wrote using SSE intrinsics for optimal speed.

```float cf_SSE_calcRms (const float* inBuffer, const int length)
{
assert(length % 4 == 0);

__m128* mbuf = (__m128*)inBuffer;
__m128 m1, m2;
__m128 macc = _mm_set1_ps(0.f);
__m128 _1oN = _mm_set1_ps(1.f / static_cast<float>(length));

for (int i = 0; i < length; i+=4) {
m1 = _mm_mul_ps(*mbuf, *mbuf);
m2 = _mm_mul_ps(m1, _1oN);
mbuf++;
}

__declspec(align(16)) float result[4];
_mm_store_ps(result, macc);

return sqrtf(result[0] + result[1] + result[2] + result[3]);
}
```

Inside the `for` loop, a Hamming window is applied to the signal prior to sending it to the FFT to bound it along the edges, minimizing or eliminating frequency leakage in the spectrum.  In the application’s update method, I test for the presence of a transient in the current buffer.  If one exists, particles are added.

```if (audioData.isTransient())
{
Vec2i randPos;

// 50 pixel border.
randPos.x = Rand::randInt(50, getWindowWidth() - 50);
randPos.y = Rand::randInt(50, getWindowHeight() - 50);
}
```

The adding of particles is handled by the `ParticleController` class:

```void ParticleController::addParticles (const Vec2i& loc, const PaAudioData& data)
{
int stride = static_cast<int>((64.f - powf((64.f * data->amp_now), 1.32f)) + 0.5f;
stride = (stride < 1 ? 1 : stride);

// 512 == sample block size; represents the horizontal frequency axis in the spectral analysis.
for (int i = 0; i < 512; i+=stride) {
Vec2f randVec = Rand::randVec2f() * 6.5f;
mParticles.push_back(Particle(loc + randVec, i/1024.f, data.mag(i) * 18.f));
}
}
```

First I calculate a stride value that controls the speed at which we execute the following for loop (this evenly divides the number of particles across the x axis).  As you can see, this is determined by the volume (calculated as RMS) of the current audio buffer.  When creating each individual particle, the i/1024 argument determines the radius — as I mentioned this is based on the frequency value along the x axis.  To constrain along reasonable bounds, this value is further calculated as

sqrt( 1 / (i / 1024) )

The color of each particle is first determined using HSV.  The hue is randomly chosen, but the saturation and value (brightness) is based on the data.mag(i) * 18 argument — i.e. the amplitude of the frequency component.  This is then converted into an RGB value to set the color of the individual particle.

One of the primary challenges of getting this working properly is mapping the output from the DFT analysis and RMS to usable values in the particle engine.  That’s the reason for several of the “magic numbers” in the function above (1024 is not one of them; it’s double our sample length, which I said previously we only want from 0 – 0.5, and since the data stored is in interleaved format, the buffer is actually 1024 samples long).  Quite a lot of experimenting went into fine-tuning these values to get good results, but more can be done in this area, which I am interested in exploring further.  I feel that this is a good first step in this area, however, and I’ll be looking at ways to integrate this process into further projects and ideas I have brewing.

With all of that out of the way, here is the video demonstrating the result (note: my FPS took a hit due to recording, and my way-old, out-of-date video card didn’t help matters either):

The Different “Sides” of Convolution

We all know how much convolution is used in DSP, and what a significant part it has in making many effects and analysis techniques a reality.  Often we hear about FFT, or fast convolution, but in actuality these algorithms are only faster than straight convolution with a kernel length (of the impulse response) greater than around 60 – 64 or so.  Anything shorter than that is best handled with normal convolution, which can be implemented using either an input side or output side algorithm.

The input side algorithm is normally the one that we learn first (at least it was for me, and judging from many texts and books on it, it seems to be common).  However, the output side algorithm is perhaps slightly easier to code, and we’ll see why later on.  For this post I’m going to evaluate the effects each algorithm has on the resulting audio signal because each one could have an advantage in some situations over the other.  I’ll also share a little trick that will speed up the convolution process for certain types of filters.

The output side algorithm performs the convolution, as one might expect, from the viewpoint of the output (or the result) of the operation, while the input side algorithm implements convolution from the viewpoint of the input signal.  To look at it another way, recall that convolving an input signal with an impulse response results in an output signal that is the length of the input + the length of the impulse response – 1.  With the input side algorithm, this “extra” bit of length is appended to the resulting signal, and is probably the easiest one to theoretically grasp.  The output side algorithm looks at what points from the input we need in order to get our resulting signal.

The reason why the output side method feels more natural to code is that we often write in terms of the output or result: y(n) = ____ — the output is equal to some kind of expression, for example.  Approaching convolution this way requires input samples on the negative side (x[-2], x[-1], etc.), which of course don’t exist.  To get around that little problem, one common way is to implement a delay line that contains all zeroes at first, then shifting in the input signal as we go, doing the calculating with the impulse response and the delay line instead of direclty with the input.

I’ve been busy redesigning and overhauling my DSP filter class, which has entailed some experimentation with these two convolution methods on FIR filters.  Here we can see visually the difference between them.  First up is the output side convolution of a windowed-sinc FIR filter using a Kaiser window of order 162 on a simple triangle wave (using test oscillators is a good way to visually see what’s happening).

Output side convolution on a triangle wave.

I purposely made the filter kernel (impulse response) fairly long so it would illustrate the delay and shift that’s happening at the start of the signal due to the output side algorithm.  Compare that with the image of the input side algorithm using exactly the same filter and order.

Input side convolution on a triangle wave.

Huge difference.  It’s clearly audible as well in this test signal (samples below).  With the output side algorithm there is a clear popping sound as the audio abruptly starts in contrast to the much smoother beginning of the test signal filtered with the input side algorithm.

Here you can hear the difference (to hear it clearly you may have to download the files and listen in an audio editor to avoid the glitchy start of web browser plug-ins):

Triangle wave filtered with output side

Triangle wave filtered with input side

Now to return to the point about the output side method being perhaps a little more natural to code.  The little wrinkle we face in implementing the input side algorithm is that with block processing (which is so common in DSP), how do we account for the longer output as a result of the convolution?  In many cases we don’t have control over the size of the audio buffers we’re given or have to work with.  The solution is to use the overlap-add method.

It’s really not more difficult than the output side algorithm.  All we have to do is calculate the convolution fully (into our own internal buffer if we have to), save the extra length at the end (which will be equal to the impulse response length), then add that to the start of the next block.  Rinse and repeat.  Here is C++ code that implements both of these convolution methods.

Output side convolution process.

Input side convolution process.

Don’t worry, I didn’t forget to share the trick I mentioned at the beginning.  Many FIR filters have linear phase response, which means that their kernels are symmetrical.  That provides us a great opportunity to eliminate extra calculations that aren’t needed.  So notice in the above code each ‘j’ loop (the kernel or impulse response loop) only traverses half the kernel length, as the value of input[i] * mKernel[j] is the same as input[i] * mKernel[mKernelLength-j-1].  At the end of the kernel’s loop we calculate the midpoint value.

Again, the calculations involving symmetry is perhaps easier to see in the output side algorithm, because we naturally gravitate towards expressions that result in a single output value.  If you work a small convolution problem out on paper, however, it will help you see the input side algorithm and why it works.