Tag Archives: Fourier

DFT Spectral Analysis with OpenGL

The idea for this little experiment/project of mine came about while I was exploring OpenGL through the use of Cinder, a C++ library that focuses on creative coding.  Among its many features, it supports 2D and 3D graphics through OpenGL.  As I was experimenting with building a particle engine, I started thinking about how audio can drive various graphical parameters and elements.  We don’t see this very much, but it has some intriguing applications in video games and other interactive media.

We all know that randomness is a large part of sound and visual effects in games these days.  To illustrate, let’s take an example of a magic spell; both its visual and aural element will have randomness, but will these sync up together?  In other words, if a particular instance of the spell’s sound effect contains a lot of high harmonics, does the visual aspect of it respond by being brighter or more energetic?  If the visual and aural parts of the spell are completely independent, there is less chance that they will behave as one cohesive unit, where the visual element is accurately reflective of the aural part.  It’s a subtle thing, and not applicable in every case, but I strongly believe that these details add much to immersion, and impress upon us a more organic game world.

So now, how does DSP fit into this?  Analysing a sound in the time domain (represented by the wave form) doesn’t give us any useful information as to the frequency content or the particular energy it contains across its spectrum.  The DFT (Discrete Fourier Transform) is used to convert a signal into the frequency domain where this information resides.  Instead of displaying amplitude (y axis) vs. time (x axis) in the time domain, the frequency domain plots the amplitude of each frequency component (y axis) vs. the frequency (x axis).  This is a plot of the frequency domain in Audacity:

Frequency domain plot of a piano chord.

Frequency domain plot of a piano chord.

Spectral analysis using the DFT is nothing new and has been around for some time, but its due to the FFT (Fast Fourier Transform) and dramatically increased performance with floating point operations in modern computers that using the DFT for analysis in real time has become practical.  The DFT is calculated using convolution, which we know is very slow.  The FFT uses a highly optimized algorithm (the Cooley-Turkey is the most widely used, but there are several others) to calculate the DFT of a signal.  Going into detail on the math of the FFT is beyond my experience, but fortunately we don’t need to know its intricacies to use it effectively.

The DFT works by breaking a signal down into its individual components.  This is related to the Fourier Theorem, which states that any complex waveform can be constructed by the addition of sine waves — the building blocks of a signal.  Each sine wave represents a single frequency component; its amplitude is the strength of that frequency.  The DFT results in a complex signal with both cosine and sine wave components.  This information isn’t terribly useful to us in this form.  From the complex-valued signal, however, we find the magnitude of each frequency component by converting it into polar form.  In rectangular form a complex number is represented as:

a + jb (DSP normally uses “j” instead of “i”), where a is the real part and b the imaginary part

In polar form, a complex number is represented by an angle (the argument) and the magnitude (absolute value).  The absolute value is what we’re interested in, and is calculated according to Pythagoras’ Theorem:

m = sqrt( x2 + y2 )

This value, plotted against the frequency (which would be the index of the complex number corresponding to the index of the real-valued array) gives us the spectrum of the signal.  In other words, if we take the FFT of an N-point signal, we get N pairs of complex values as output.  Each N in the complex signal represents a frequency along the x axis, with 0 being DC, N/2 being Nyquist, and N being the sampling rate.  The corresponding magnitude of the complex number gives us the amplitude of that frequency component in the signal.  A final important point to make here is that we’re only interested in the frequency points up to N/2, because anything above Nyquist aliases in discrete time (the spectrum of a signal is actually mirrored around N/2).  Here is a sample plot of a piano chord I recorded, using 5 512-sample buffers from the audio:

Frequency plot of 5 512-sample buffers of a piano chord.

Frequency plot of 5 512-sample buffers of a piano chord.

This plot shows us that most of the energy lies in the frequency range 0 – 0.04 (assuming a sampling rate of 48kHz, this would equate to approximately 0 – 1920 Hz).  Now that we can extract the frequency information from an audio signal, I want to talk about how I use this data along with the RMS value to drive the particle engine I made.

There are a number of parameters in generating particles that I considered.  The number of particles, how they’re generated/constructed, the radius, and their color.  Particles are only generated during transients, and so I check a previous RMS value against the new one to determine if particles should be constructed.  This method isn’t foolproof, but with my tests it’s been working quite well for my purposes.  The number of particles generated is also related to the RMS of the signal — the louder it is, the more particles are made.  This is evenly divided across the whole spectrum, so that at least a few particles from each frequency range is present.  The radius of each particle is determined by the frequency component (the N value of the frequency plot above).  The lower the frequency, the larger the particle.  Finally, the color, or more specifically the brightness of the particle, is determined by the frequency amplitude.  Let me go into a little more detail on each of these before presenting a little video I made demonstrating the program.

First, here is the function that gathers all the information (I’m using Portaudio for real-time playback, so all the audio processing happens in a callback function):

int audioCallback (const void* input, void* output, unsigned long samples, const PaStreamCallbackTimeInfo *timeInfo, PaStreamCallbackFlags statusFlags, void* userData)
{
    const float* in = (const float*)input;
    float* out = (float*)output;
    PaAudioData* data = (PaAudioData*)userData;

    memset(data->buffer, 0, sizeof(double) * SAMPLE_BLOCK * 2);
    float rms = CFDsp::cf_SSE_calcRms(in, samples);

    for (int i = 0; i < samples; ++i) {
        data->buffer[i*2] = *in * hammingWindow(i, SAMPLE_BLOCK);

        *out++ = *in++;
        *out++ = *in++;
    }

    // Calculate DFT using FFT.
    fft(data->buffer, SAMPLE_BLOCK);

    data->amp_prev = data->amp_now;
    data->amp_now = rms;

    return paContinue;
}

Since the FFT function I’m using requires interleaved data, it needs to be zeroed out first and then each real-valued sample stored in the even-numbered indices.  The RMS value is acquired using a function I wrote using SSE intrinsics for optimal speed.

float cf_SSE_calcRms (const float* inBuffer, const int length)
{
    assert(length % 4 == 0);

    __m128* mbuf = (__m128*)inBuffer;
    __m128 m1, m2;
    __m128 macc = _mm_set1_ps(0.f);
    __m128 _1oN = _mm_set1_ps(1.f / static_cast<float>(length));

    for (int i = 0; i < length; i+=4) {
        m1 = _mm_mul_ps(*mbuf, *mbuf);
        m2 = _mm_mul_ps(m1, _1oN);
        macc = _mm_add_ps(m2, macc);
        mbuf++;
    }

    __declspec(align(16)) float result[4];
    _mm_store_ps(result, macc);

    return sqrtf(result[0] + result[1] + result[2] + result[3]);
}

Inside the for loop, a Hamming window is applied to the signal prior to sending it to the FFT to bound it along the edges, minimizing or eliminating frequency leakage in the spectrum.  In the application’s update method, I test for the presence of a transient in the current buffer.  If one exists, particles are added.

if (audioData.isTransient())
{
    Vec2i randPos;

    // 50 pixel border.
    randPos.x = Rand::randInt(50, getWindowWidth() - 50);
    randPos.y = Rand::randInt(50, getWindowHeight() - 50);
    mParticleController.addParticles(randPos, audioData);
}

The adding of particles is handled by the ParticleController class:

void ParticleController::addParticles (const Vec2i& loc, const PaAudioData& data)
{
    int stride = static_cast<int>((64.f - powf((64.f * data->amp_now), 1.32f)) + 0.5f;
    stride = (stride < 1 ? 1 : stride);

    // 512 == sample block size; represents the horizontal frequency axis in the spectral analysis.
    for (int i = 0; i < 512; i+=stride) {
        Vec2f randVec = Rand::randVec2f() * 6.5f;
        mParticles.push_back(Particle(loc + randVec, i/1024.f, data.mag(i) * 18.f));
    }
}

First I calculate a stride value that controls the speed at which we execute the following for loop (this evenly divides the number of particles across the x axis).  As you can see, this is determined by the volume (calculated as RMS) of the current audio buffer.  When creating each individual particle, the i/1024 argument determines the radius — as I mentioned this is based on the frequency value along the x axis.  To constrain along reasonable bounds, this value is further calculated as

sqrt( 1 / (i / 1024) )

The color of each particle is first determined using HSV.  The hue is randomly chosen, but the saturation and value (brightness) is based on the data.mag(i) * 18 argument — i.e. the amplitude of the frequency component.  This is then converted into an RGB value to set the color of the individual particle.

One of the primary challenges of getting this working properly is mapping the output from the DFT analysis and RMS to usable values in the particle engine.  That’s the reason for several of the “magic numbers” in the function above (1024 is not one of them; it’s double our sample length, which I said previously we only want from 0 – 0.5, and since the data stored is in interleaved format, the buffer is actually 1024 samples long).  Quite a lot of experimenting went into fine-tuning these values to get good results, but more can be done in this area, which I am interested in exploring further.  I feel that this is a good first step in this area, however, and I’ll be looking at ways to integrate this process into further projects and ideas I have brewing.

With all of that out of the way, here is the video demonstrating the result (note: my FPS took a hit due to recording, and my way-old, out-of-date video card didn’t help matters either):

DFT Spectral Particle Engine from Christian on Vimeo.

Advertisements

The Different “Sides” of Convolution

We all know how much convolution is used in DSP, and what a significant part it has in making many effects and analysis techniques a reality.  Often we hear about FFT, or fast convolution, but in actuality these algorithms are only faster than straight convolution with a kernel length (of the impulse response) greater than around 60 – 64 or so.  Anything shorter than that is best handled with normal convolution, which can be implemented using either an input side or output side algorithm.

The input side algorithm is normally the one that we learn first (at least it was for me, and judging from many texts and books on it, it seems to be common).  However, the output side algorithm is perhaps slightly easier to code, and we’ll see why later on.  For this post I’m going to evaluate the effects each algorithm has on the resulting audio signal because each one could have an advantage in some situations over the other.  I’ll also share a little trick that will speed up the convolution process for certain types of filters.

The output side algorithm performs the convolution, as one might expect, from the viewpoint of the output (or the result) of the operation, while the input side algorithm implements convolution from the viewpoint of the input signal.  To look at it another way, recall that convolving an input signal with an impulse response results in an output signal that is the length of the input + the length of the impulse response – 1.  With the input side algorithm, this “extra” bit of length is appended to the resulting signal, and is probably the easiest one to theoretically grasp.  The output side algorithm looks at what points from the input we need in order to get our resulting signal.

The reason why the output side method feels more natural to code is that we often write in terms of the output or result: y(n) = ____ — the output is equal to some kind of expression, for example.  Approaching convolution this way requires input samples on the negative side (x[-2], x[-1], etc.), which of course don’t exist.  To get around that little problem, one common way is to implement a delay line that contains all zeroes at first, then shifting in the input signal as we go, doing the calculating with the impulse response and the delay line instead of direclty with the input.

I’ve been busy redesigning and overhauling my DSP filter class, which has entailed some experimentation with these two convolution methods on FIR filters.  Here we can see visually the difference between them.  First up is the output side convolution of a windowed-sinc FIR filter using a Kaiser window of order 162 on a simple triangle wave (using test oscillators is a good way to visually see what’s happening).

Output side convolution on a triangle wave.

Output side convolution on a triangle wave.

I purposely made the filter kernel (impulse response) fairly long so it would illustrate the delay and shift that’s happening at the start of the signal due to the output side algorithm.  Compare that with the image of the input side algorithm using exactly the same filter and order.

Input side convolution on a triangle wave.

Input side convolution on a triangle wave.

Huge difference.  It’s clearly audible as well in this test signal (samples below).  With the output side algorithm there is a clear popping sound as the audio abruptly starts in contrast to the much smoother beginning of the test signal filtered with the input side algorithm.

Here you can hear the difference (to hear it clearly you may have to download the files and listen in an audio editor to avoid the glitchy start of web browser plug-ins):

Triangle wave filtered with output side

Triangle wave filtered with input side

Now to return to the point about the output side method being perhaps a little more natural to code.  The little wrinkle we face in implementing the input side algorithm is that with block processing (which is so common in DSP), how do we account for the longer output as a result of the convolution?  In many cases we don’t have control over the size of the audio buffers we’re given or have to work with.  The solution is to use the overlap-add method.

It’s really not more difficult than the output side algorithm.  All we have to do is calculate the convolution fully (into our own internal buffer if we have to), save the extra length at the end (which will be equal to the impulse response length), then add that to the start of the next block.  Rinse and repeat.  Here is C++ code that implements both of these convolution methods.

Output side convolution process.

Output side convolution process.

Input side convolution process.

Input side convolution process.

Don’t worry, I didn’t forget to share the trick I mentioned at the beginning.  Many FIR filters have linear phase response, which means that their kernels are symmetrical.  That provides us a great opportunity to eliminate extra calculations that aren’t needed.  So notice in the above code each ‘j’ loop (the kernel or impulse response loop) only traverses half the kernel length, as the value of input[i] * mKernel[j] is the same as input[i] * mKernel[mKernelLength-j-1].  At the end of the kernel’s loop we calculate the midpoint value.

Again, the calculations involving symmetry is perhaps easier to see in the output side algorithm, because we naturally gravitate towards expressions that result in a single output value.  If you work a small convolution problem out on paper, however, it will help you see the input side algorithm and why it works.

Audio Resampling: Part 1

Resampling in digital audio has two main uses.  One is to convert audio into a sampling rate needed by a particular system or engine (e.g. converting 48kHz audio to the required 44.1kHz required by CDs).  The second is to avoid aliasing during signal processing by raising the Nyquist limit.  I will be discussing the latter.

Lately I’ve been very busy working on improving and enhancing the sound of ring modulation for a fairly basic plug-in being developed by AlgoRhythm Audio (coming soon).  I say basic becuase as far as ring modulation goes, there are few DSP effects that are simpler in theory and in execution.  Simply take some input signal, multiply it by a carrier signal (usually some kind of oscillator like a sine wave), and we have ring modulation.  The problem that arises, however, and how this connects in with resampling, is that this creates new frequencies in the resulting output that were not present in either signal prior to processing.  These new frequencies created could very well violate the Nyquist limit of the current sampling rate during processing, and that leads us to resampling as a way to clamp down on aliasing frequencies that can be introduced as a result.

Aliasing is an interesting phenomenon that occurs in digital audio, and in every sense of the word is an undesired noise that we need to make sure does not pollute our audio.  There are many resources around that go into more detail on aliasing, but I will give a brief overview of it with some audio and visual samples.

Aliasing occurs when there are frequencies present in a signal that are greater than the Nyquist limit (half of the sampling rate).  What happens in such a case is that the sampling rate is not high enough to properly capture (sample) the high frequency of the signal, and so the frequency “folds over” and creates aliases that are mirror images of the original frequency.  Here, for example, is a square wave at 4000Hz created using 20 harmonics at a sampling rate of 44.1kHz (keep in mind that 4000Hz is the fundamental frequency, and that square waves contain many additional frequencies above that depending on how many harmonics were used to create it, so in this case Nyquist is still being violated):

4000Hz square wave with aliasing

Notice the low tone below the actual 4000Hz frequency.  Here is the resulting waveform of this square wave that shows us visually that we’re not sampling fast enough to accurately reproduce the waveform.  Notice the inconsistencies in the waveform.

squarewave_alias

Now, going to the extreme a bit, here is the same square wave sampled at a rate of 192kHz.

4000Hz square wave with no aliasing

It’s a pure 4000Hz square wave tone.  Examining the waveform of this square wave shows us that the sampling rate was more that adequate to reproduce this signal digitally:

square_noalias

Not all signal processing effects are susceptible to aliasing, and certainly not to the same degree.  Because ring modulation produces additional inharmonic frequencies, it is a prime example of a process that is easily affected by aliasing.  Others include distortion and various other kinds of modulation techniques (especially when taken to the extreme).  However, ring modulation with a sine wave is generally safe as long as the frequency of the sine wave is kept  low enough because sine waves have no harmonics to them, only the fundamental.  Introducing other wavetypes into this process, however, can quite easily bring about aliasing.

Here is an example of ring modulation with a sawtooth wave sweeping up from about 200Hz to 5000Hz.  As it glides up, you will be able to hear the aliasing kicking in at around 0:16 or so.  The example with no aliasing has been upsampled by 3X before processing, then downsampled by 3X back to its original rate.

Ring modulation with a sawtooth sweep, with aliasing

Ring modulation with sawtooth sweep, with no aliasing

So how does upsampling and downsampling work?  In theory, and even to some extent in practice, it’s very straightforward.  The issue, as we will see, is in making it efficient and fast.  In DSP we’re always concerning ourselves with speedy execution times to avoid latency or audio dropouts or running out of memory, etc.

To upsample, we insert 0-valued samples in between every L-th (our upsampling factor) original sample.  (i.e. upsampling by 3X, [1, 2, 3, 4] becomes [1,0,0,2,0,0,3,0,0…])  However, this introduces aliasing into our audio so we need to interpolate these values so that they “join” the sample values of the original waveform.  This is accomplished by using an interpolating low-pass filter.  This entire process is known as interpolation.

Dowsampling is much the same.  We remove every M-th (our downsampling factor) sample from the original signal.  (i.e. downsampling by 2X, [1,2,3,4,5] becomes [1,3,5…])  This process does not introduce aliasing, but we do need to make sure the Nyquist limit is adhered to at the new sampling rate by low-pass filtering with a cutoff frequency at the new Nyquist rate.  This entire process is known as decimation.

Fairly straightforward.  Proceeding with this method, however, would be known as brute force — generally not a good way to go.  The reason why becomes clearer when we consider what kind of low-pass filter we need for this operation.  The ideal filter would be one that would brick-wall attenuate all frequencies higher than Nyquist and leave everything else untouched (thus preserving all the frequencies and tonal content of our original audio).  This is, alas, impossible, as it would require an infinitely long filter kernel.  The function that would implement this ideal filter is the rectangular function.

rect

The rectangular function.

By taking the Fourier transform of the rectangular function we end up with the sinc function, which is given by:

y(x) = sin (x) / x, which becomes y(x) = sin (πx) / πx for signal processing.

Graph of the sinc function, sin (x) / x

Graph of the sinc function, sin (πx) / πx

The sinc functions trails on for infinity in both directions, which can be seen in the graph above, so we need to enforce bounds around it by applying a window function.  Windowing is a method of designing FIR filters by essentially “surrounding” a function (in our case the sinc) by the window, which enforces bounds so that we can properly derive a filter kernel for use in calculations.  The rectangular function shown above is a type of window, but as I mentioned, infinite slope is a deal-breaker in audio.

The Blackman window, given by the function

w(i) = 0.42 – 0.5 cos(2πi M) + 0.08 cos(4πi M),

where M is the length of the filter kernel, is a good choice for resampling because it offers a good stop-band attenuation of -74dB with good rolloff.  Putting this together with the sinc function, we can derive the filter kernel with the following formula:

Windowed-sinc kernel formula*

Windowed-sinc kernel formula*

where fc is the normalized cutoff frequency.  When i = M/2, to avoid a divide by zero, h(i) = 2fc.  K is a constant value used to achieve unity gain at zero frequency and can be ignored while calculating the kernel coefficients.  After all coefficients have been calculated, K can be found by summing together all the coefficients, h(i), and then dividing each by the resulting sum.

* Source: Smith, Steven W., “The Scientist and Engineer’s Guide to Digital Signal Processing”, Chapter 16.

Now that we have the filter design, let’s consider the properties of the FIR filter and compare them to IIR filter designs.  IIR filters give us better performance and attenuation at lower orders, meaning that they execute faster and perform better with fewer calculations than FIR filters.  However, IIR filters are still not powerful enough, even at slightly higher orders, to give us the performance we need for resampling, and trying to push IIR filters into very high orders can make them unstable and/or susceptible to quantization error due to the nature of recursion.  IIR filters also do not offer linear phase response.  FIR filters are the better choice for these reasons, but the unfortunate drawback is that they execute slowly due to being implemented by convolution.  In addition, they need to be pushed to high orders to give us the performance needed for attenuating aliasing frequencies.

However, the order of the interpolating low-pass filter can be negotiated based on the frequency content of the audio signal(s) involved.  If the audio is sufficiently oversampled, it will not contain enough frequencies near Nyquist, and as such a lower order filter can be used with a gentler rolloff without adversely affecting the audio and attenuating actual frequencies in the signal.  There are plenty of cases where we just don’t know the frequency content of the audio signals involved in processing, however, so a strong filter may be needed in these cases.  Here is a graph of a 264-order window-sinc filter (in other words, a filter kernel of length 265 including the sample x(0)):

264-order window-sinc low-pass filter frequency response

264-order window-sinc low-pass filter frequency response (cutoff frequency at 10kHz, resulting in a transition band of 882Hz)

With this in consideration, it can be easy to see that convolving a signal with a 264-order FIR filter is computationally costly for real-time processing.  There are a number of ways to improve upon this when it comes to resampling.  One is using the FFT to apply the filter.  Another interesting solution is to combine the upsampling/downsampling process into the filter itself, which can further be optimized by turning it into a polyphase filter.

The theory and implementation of a polyphase filter is a fairly long and involved topic on its own so that will be forthcoming in part 2,where we look at how to implement resampling efficiently.

Additive Synthesis

For this blog entry, I felt a good topic to cover would be additive synthesis and its implementation in C.  Based on Fourier theory, additive synthesis concerns combining simple periodic waveforms (such as sine or cosine waves) together that result in more complex sounds.  Put another way, any waveform can be expressed as a sum of sine waves due to the principle of the harmonic series which states that any sound contains an infinite number of overtones, or partials, as a diverging series.  These overtones, of a certain frequency and amplitude, when expressed as simple sinusoidal waves form the building blocks of a more complex waveform.  Mathematically this can be defined (for partials k = 1 to N, w = (2πf)/sample rate, and ø as the phase offset) as:

We can now, for example, create a sawtooth wave using this equation.  But first, here is a snippet of C code that generates a simple sine wave and outputs it to a raw binary data file (these can be read by most audio editing software like Audacity) as well as a text file that can be plotted (I used Gnuplot):

Here is the graph of the sine wave as well as the soundfile (converted to .wav from .raw):

Sine wave at 440Hz

Now let’s move on to generating the sawtooth wave and we’ll see how the simple sinusoidal wave above is transformed into a more complex waveform through the use of the Fourier series equation.  I set the total number of partials to 30 for this demonstration, and sawtooth waves contain all partials of the harmonic series with the amplitude of each equal to 1/partial (ak = 1/k).  We also set the phase offset as ø = -π/2.

Here is the graph and the soundfile output of the sawtooth wave:

Sawtooth wave at 440Hz

Now for something really funky!  If we go back to the Fourier series equation above as we defined it, both amplitude and frequency were invarying.  This is, as we know, not how sound behaves and really isn’t very interesting.  To generate much more interesting and complex waveforms, we set both the amplitude (a) and the frequency (f) as finite, time-varying functions.  I set up my code to read in a breakpoint file for both amplitude and frequency, each containing a few random values at arbitrary time locations to illustrate this.  (A breakpoint file contains data set up similar to coordinates, or points, that can be plotted.  In this case we have a amplitude vs. time and frequency vs. time.)  Here is the plot of the amplitude stream …

… and the frequency stream …

 

 

… and the C code that implements this:

The code that reads the breakpoint data uses linear interpolation to find the value at time n / sample rate.  The normalize_buffer routine adjusts the resulting amplitude to fit within the values of -1 and +1 and then scales it to the desired amplitude set by the user (in this case 0.7).  Following is a snapshot of the resulting waveform (just before the 1 second mark) in Gnuplot as well as the soundfile output; the result of just summing together simple sinusoidal waves as seen in the first Gnuplot above!

Crazy wave at.. ??Hz

This is a short and simple example of how flexible and powerful the Fourier series can be at creating textures and complex waveforms through additive synthesis.  The possibilities just escalate from here, depending on the time-varying functions that determine the amplitude and frequencies applied to the individual sinusoidal waves, i.e. the harmonics.

It must be pointed out that the code used to implement these routines above is brute force and not optimal.  For instance, to calculate these 3-second waves at a 44.1 kHz sampling rate with 30 partials requires 30 iterations through 132,300 (3 * 44,100) samples of data for a total of 3,969,000 loop iterations!  Calculating sine and cosine values is actually quite a costly operation, so this is unacceptable for most practical purposes.  One alternative is wavetable synthesis, whereby the data for one period of a basic wave is stored in a table and can then be looked up and utilized by the program.  This is much more efficient (though depending on the size of the table, there can be a loss in precision) as redundant trigonometric calculations don’t have to be repeatedly calculated.  This is potentially another topic for another time however.

For now, hope this was an interesting look into additive synthesis using the Fourier series and how with relatively few lines of code we can make some pretty cool things happen.