Tag Archives: RMS

Dynamics processing: Compressor/Limiter, part 1

Lately I’ve been busy developing an audio-focused game in Unity, whose built-in audio engine is notorious for being extremely basic and lacking in features. (As of this writing, Unity 5 has not yet been released, in which its entire built-in audio engine is being overhauled). For this project I have created all the DSP effects myself as script components, whose behavior is driven by Unity’s coroutines. In order to have slightly more control over the final mix of these elements, it became clear that I needed a compressor/limiter. This particular post is written with Unity/C# in mind, but the theory and code is easy enough to adapt to other uses. In this first part we’ll be looking at writing the envelope detector, which is needed by the compressor to do its job.

An envelope detector (also called a follower) extracts the amplitude envelope from an audio signal based on three parameters: an attack time, release time, and detection mode. The attack/release times are fairly straightforward, simply defining how quickly the detection responds to rising and falling amplitudes. There are typically two modes of calculating the envelope of a signal: by its peak value or its root mean square value. A signal’s peak value is just the instantaneous sample value while the root mean square is measured over a series of samples, and gives a more accurate account of the signal’s power. The root mean square is calculated as:

rms = sqrt ( (1/n) * (x12 + x22 + … + xn2) ),

where n is the number of data values. In other words, we sum together the squares of all the sample values in the buffer, find the average by dividing by n, and then taking the square root. In audio processing, however, we normally bound the sample size (n) to some fixed number (called windowing). This effectively means that we calculate the RMS value over the past n samples.

(As an aside, multiplying by 1/n effectively assigns equal weights to all the terms, making it a rectangular window. Other window equations can be used instead which would favor terms in the middle of the window. This results in even greater accuracy of the RMS value since brand new samples (or old ones at the end of the window) have less influence over the signal’s power.)

Now that we’ve seen the two modes of detecting a signal’s envelope, we can move on to look at the role of the attack/release times. These values are used in calculating coefficients for a first-order recursive filter (also called a leaky integrator) that processes the values we get from the audio buffer (through one of the two detection methods). Simply stated, we get the sample values from the audio signal and pass them through a low-pass filter to smooth out the envelope.

We calculate the coefficients using the time-constant equation:

g = e ^ ( -1 / (time * sample rate) ),

where time is in seconds, and sample rate in Hz. Once we have our gain coefficients for attack/release, we put them into our leaky integrator equation:

out = in + g * (out – in),

where in is the input sample we detected from the incoming audio, g is either the attack or release gain, and out is the envelope sample value. Here it is in code:

public void GetEnvelope (float[] audioData, out float[] envelope)
{
    envelope = new float[audioData.Length];

    m_Detector.Buffer = audioData;

    for (int i = 0; i < audioData.Length; ++i) {
        float envIn = m_Detector[i];

        if (m_EnvelopeSample < envIn) {
            m_EnvelopeSample = envIn + m_AttackGain * (m_EnvelopeSample - envIn);
        } else {
            m_EnvelopeSample = envIn + m_ReleaseGain * (m_EnvelopeSample - envIn);
        }

        envelope[i] = m_EnvelopeSample;
    }
}

(Source: code is based on “Envelope detector” from http://www.musicdsp.org/archive.php?classid=2#97, with detection modes added by me.)

The envelope sample is calculated based on whether the current audio sample is rising or falling, with the envIn sample resulting from one of the two detection modes. This is implemented similarly to what is known as a functor in C++. I prefer this method to having another branching structure inside the loop because among other things, it’s more extensible and results in cleaner code (as well as being modular). It could be implemented using delegates/function pointers, but the advantage of a functor is that it retains its own state, which is useful for the RMS calculation as we will see. Here is how the interface and classes are declared for the detection modes:

public interface IEnvelopeDetection
{
    float[] Buffer { set; get; }
    float this [int index] { get; }

    void Reset ();
}

We then have two classes that implement this interface, one for each mode:

A signal’s peak value is the instantaneous sample value while the root mean square is measured over a series of samples, and gives a more accurate account of the signal’s power.

public class DetectPeak : IEnvelopeDetection
{
    private float[] m_Buffer;

    /// <summary>
    /// Sets the buffer to extract envelope data from. The original buffer data is held by reference (not copied).
    /// </summary>
    public float[] Buffer
    {
        set { m_Buffer = value; }
        get { return m_Buffer; }
    }

    /// <summary>
    /// Returns the envelope data at the specified position in the buffer.
    /// </summary>
    public float this [int index]
    {
        get { return Mathf.Abs(m_Buffer[index]); }
    }

    public DetectPeak () {}
    public void Reset () {}
}

This particular class involves a rather trivial operation of just returning the absolute value of a signal’s sample. The RMS detection class is more involved.

/// <summary>
/// Calculates and returns the root mean square value of the buffer. A circular buffer is used to simplify the calculation, which avoids
/// the need to sum up all the terms in the window each time.
/// </summary>
public float this [int index]
{
    get {
        float sampleSquared = m_Buffer[index] * m_Buffer[index];
        float total = 0f;
        float rmsValue;

        if (m_Iter < m_RmsWindow.Length-1) {
            total = m_LastTotal + sampleSquared;
            rmsValue = Mathf.Sqrt((1f / (index+1)) * total);
        } else {
            total = m_LastTotal + sampleSquared - m_RmsWindow.Read();
            rmsValue = Mathf.Sqrt((1f / m_RmsWindow.Length) * total);
        }

        m_RmsWindow.Write(sampleSquared);
        m_LastTotal = total;
        m_Iter++;

        return rmsValue;
    }
}

public DetectRms ()
{
    m_Iter = 0;
    m_LastTotal = 0f;
    // Set a window length to an arbitrary 128 for now.
    m_RmsWindow = new RingBuffer<float>(128);
}

public void Reset ()
{
    m_Iter = 0;
    m_LastTotal = 0f;
    m_RmsWindow.Clear(0f);
}

The RMS calculation in this class is an optimization of the general equation I stated earlier. Instead of continually summing together all the  values in the window for each new sample, a ring buffer is used to save each new term. Since there is only ever 1 new term to include in the calculation, it can be simplified by storing all the squared sample values in the ring buffer and using it to subtract from our previous total. We are just left with a multiply and square root, instead of having to redundantly add together 128 terms (or however big n is). An iterator variable ensures that the state of the detector remains consistent across successive audio blocks.

In the envelope detector class, the detection mode is selected by assigning the corresponding class to the ivar:

public class EnvelopeDetector
{
    protected float m_AttackTime;
    protected float m_ReleaseTime;
    protected float m_AttackGain;
    protected float m_ReleaseGain;
    protected float m_SampleRate;
    protected float m_EnvelopeSample;

    protected DetectionMode m_DetectMode;
    protected IEnvelopeDetection m_Detector;

    // Continued...
public DetectionMode DetectMode
{
    get { return m_DetectMode; }
    set {
        switch(m_DetectMode) {
            case DetectionMode.Peak:
                m_Detector = new DetectPeak();
                break;

            case DetectionMode.Rms:
                m_Detector = new DetectRms();
                break;
        }
    }
}

Now that we’ve looked at extracting the envelope from an audio signal, we will look at using it to create a compressor/limiter component to be used in Unity. That will be upcoming in part 2.

DFT Spectral Analysis with OpenGL

The idea for this little experiment/project of mine came about while I was exploring OpenGL through the use of Cinder, a C++ library that focuses on creative coding.  Among its many features, it supports 2D and 3D graphics through OpenGL.  As I was experimenting with building a particle engine, I started thinking about how audio can drive various graphical parameters and elements.  We don’t see this very much, but it has some intriguing applications in video games and other interactive media.

We all know that randomness is a large part of sound and visual effects in games these days.  To illustrate, let’s take an example of a magic spell; both its visual and aural element will have randomness, but will these sync up together?  In other words, if a particular instance of the spell’s sound effect contains a lot of high harmonics, does the visual aspect of it respond by being brighter or more energetic?  If the visual and aural parts of the spell are completely independent, there is less chance that they will behave as one cohesive unit, where the visual element is accurately reflective of the aural part.  It’s a subtle thing, and not applicable in every case, but I strongly believe that these details add much to immersion, and impress upon us a more organic game world.

So now, how does DSP fit into this?  Analysing a sound in the time domain (represented by the wave form) doesn’t give us any useful information as to the frequency content or the particular energy it contains across its spectrum.  The DFT (Discrete Fourier Transform) is used to convert a signal into the frequency domain where this information resides.  Instead of displaying amplitude (y axis) vs. time (x axis) in the time domain, the frequency domain plots the amplitude of each frequency component (y axis) vs. the frequency (x axis).  This is a plot of the frequency domain in Audacity:

Frequency domain plot of a piano chord.

Frequency domain plot of a piano chord.

Spectral analysis using the DFT is nothing new and has been around for some time, but its due to the FFT (Fast Fourier Transform) and dramatically increased performance with floating point operations in modern computers that using the DFT for analysis in real time has become practical.  The DFT is calculated using convolution, which we know is very slow.  The FFT uses a highly optimized algorithm (the Cooley-Turkey is the most widely used, but there are several others) to calculate the DFT of a signal.  Going into detail on the math of the FFT is beyond my experience, but fortunately we don’t need to know its intricacies to use it effectively.

The DFT works by breaking a signal down into its individual components.  This is related to the Fourier Theorem, which states that any complex waveform can be constructed by the addition of sine waves — the building blocks of a signal.  Each sine wave represents a single frequency component; its amplitude is the strength of that frequency.  The DFT results in a complex signal with both cosine and sine wave components.  This information isn’t terribly useful to us in this form.  From the complex-valued signal, however, we find the magnitude of each frequency component by converting it into polar form.  In rectangular form a complex number is represented as:

a + jb (DSP normally uses “j” instead of “i”), where a is the real part and b the imaginary part

In polar form, a complex number is represented by an angle (the argument) and the magnitude (absolute value).  The absolute value is what we’re interested in, and is calculated according to Pythagoras’ Theorem:

m = sqrt( x2 + y2 )

This value, plotted against the frequency (which would be the index of the complex number corresponding to the index of the real-valued array) gives us the spectrum of the signal.  In other words, if we take the FFT of an N-point signal, we get N pairs of complex values as output.  Each N in the complex signal represents a frequency along the x axis, with 0 being DC, N/2 being Nyquist, and N being the sampling rate.  The corresponding magnitude of the complex number gives us the amplitude of that frequency component in the signal.  A final important point to make here is that we’re only interested in the frequency points up to N/2, because anything above Nyquist aliases in discrete time (the spectrum of a signal is actually mirrored around N/2).  Here is a sample plot of a piano chord I recorded, using 5 512-sample buffers from the audio:

Frequency plot of 5 512-sample buffers of a piano chord.

Frequency plot of 5 512-sample buffers of a piano chord.

This plot shows us that most of the energy lies in the frequency range 0 – 0.04 (assuming a sampling rate of 48kHz, this would equate to approximately 0 – 1920 Hz).  Now that we can extract the frequency information from an audio signal, I want to talk about how I use this data along with the RMS value to drive the particle engine I made.

There are a number of parameters in generating particles that I considered.  The number of particles, how they’re generated/constructed, the radius, and their color.  Particles are only generated during transients, and so I check a previous RMS value against the new one to determine if particles should be constructed.  This method isn’t foolproof, but with my tests it’s been working quite well for my purposes.  The number of particles generated is also related to the RMS of the signal — the louder it is, the more particles are made.  This is evenly divided across the whole spectrum, so that at least a few particles from each frequency range is present.  The radius of each particle is determined by the frequency component (the N value of the frequency plot above).  The lower the frequency, the larger the particle.  Finally, the color, or more specifically the brightness of the particle, is determined by the frequency amplitude.  Let me go into a little more detail on each of these before presenting a little video I made demonstrating the program.

First, here is the function that gathers all the information (I’m using Portaudio for real-time playback, so all the audio processing happens in a callback function):

int audioCallback (const void* input, void* output, unsigned long samples, const PaStreamCallbackTimeInfo *timeInfo, PaStreamCallbackFlags statusFlags, void* userData)
{
    const float* in = (const float*)input;
    float* out = (float*)output;
    PaAudioData* data = (PaAudioData*)userData;

    memset(data->buffer, 0, sizeof(double) * SAMPLE_BLOCK * 2);
    float rms = CFDsp::cf_SSE_calcRms(in, samples);

    for (int i = 0; i < samples; ++i) {
        data->buffer[i*2] = *in * hammingWindow(i, SAMPLE_BLOCK);

        *out++ = *in++;
        *out++ = *in++;
    }

    // Calculate DFT using FFT.
    fft(data->buffer, SAMPLE_BLOCK);

    data->amp_prev = data->amp_now;
    data->amp_now = rms;

    return paContinue;
}

Since the FFT function I’m using requires interleaved data, it needs to be zeroed out first and then each real-valued sample stored in the even-numbered indices.  The RMS value is acquired using a function I wrote using SSE intrinsics for optimal speed.

float cf_SSE_calcRms (const float* inBuffer, const int length)
{
    assert(length % 4 == 0);

    __m128* mbuf = (__m128*)inBuffer;
    __m128 m1, m2;
    __m128 macc = _mm_set1_ps(0.f);
    __m128 _1oN = _mm_set1_ps(1.f / static_cast<float>(length));

    for (int i = 0; i < length; i+=4) {
        m1 = _mm_mul_ps(*mbuf, *mbuf);
        m2 = _mm_mul_ps(m1, _1oN);
        macc = _mm_add_ps(m2, macc);
        mbuf++;
    }

    __declspec(align(16)) float result[4];
    _mm_store_ps(result, macc);

    return sqrtf(result[0] + result[1] + result[2] + result[3]);
}

Inside the for loop, a Hamming window is applied to the signal prior to sending it to the FFT to bound it along the edges, minimizing or eliminating frequency leakage in the spectrum.  In the application’s update method, I test for the presence of a transient in the current buffer.  If one exists, particles are added.

if (audioData.isTransient())
{
    Vec2i randPos;

    // 50 pixel border.
    randPos.x = Rand::randInt(50, getWindowWidth() - 50);
    randPos.y = Rand::randInt(50, getWindowHeight() - 50);
    mParticleController.addParticles(randPos, audioData);
}

The adding of particles is handled by the ParticleController class:

void ParticleController::addParticles (const Vec2i& loc, const PaAudioData& data)
{
    int stride = static_cast<int>((64.f - powf((64.f * data->amp_now), 1.32f)) + 0.5f;
    stride = (stride < 1 ? 1 : stride);

    // 512 == sample block size; represents the horizontal frequency axis in the spectral analysis.
    for (int i = 0; i < 512; i+=stride) {
        Vec2f randVec = Rand::randVec2f() * 6.5f;
        mParticles.push_back(Particle(loc + randVec, i/1024.f, data.mag(i) * 18.f));
    }
}

First I calculate a stride value that controls the speed at which we execute the following for loop (this evenly divides the number of particles across the x axis).  As you can see, this is determined by the volume (calculated as RMS) of the current audio buffer.  When creating each individual particle, the i/1024 argument determines the radius — as I mentioned this is based on the frequency value along the x axis.  To constrain along reasonable bounds, this value is further calculated as

sqrt( 1 / (i / 1024) )

The color of each particle is first determined using HSV.  The hue is randomly chosen, but the saturation and value (brightness) is based on the data.mag(i) * 18 argument — i.e. the amplitude of the frequency component.  This is then converted into an RGB value to set the color of the individual particle.

One of the primary challenges of getting this working properly is mapping the output from the DFT analysis and RMS to usable values in the particle engine.  That’s the reason for several of the “magic numbers” in the function above (1024 is not one of them; it’s double our sample length, which I said previously we only want from 0 – 0.5, and since the data stored is in interleaved format, the buffer is actually 1024 samples long).  Quite a lot of experimenting went into fine-tuning these values to get good results, but more can be done in this area, which I am interested in exploring further.  I feel that this is a good first step in this area, however, and I’ll be looking at ways to integrate this process into further projects and ideas I have brewing.

With all of that out of the way, here is the video demonstrating the result (note: my FPS took a hit due to recording, and my way-old, out-of-date video card didn’t help matters either):

DFT Spectral Particle Engine from Christian on Vimeo.