Tag Archives: interpolation

Audio Resampling: Part 2

This post has been edited to clarify some of the details of implementing the polyphase resampler [May 14, 2013].

Time to finish up this look at resampling.  In Part 1 we introduced the need for resampling to avoid aliasing in signals, and its implementation by windowed sinc FIR filters.  This is a costly operation, however, especially for real-time processing.  Let’s consider the case of upsampling-downsampling a 1 second audio signal at a sampling rate of 44.1 kHz and a resampling factor of 4X.  Going about this with the brute-force method that we saw in Part 1 would result in first upsampling the signal by 4X.  This results in a buffer size that has now grown to be 4 x 44100 = 176,400 that we now have to filter, which will obviously take roughly 4 times as long to compute.  And not only once, but twice, because the decimation filter also operates at this sample rate.  The way around this is to use polyphase interpolating filters.

A polyphase filter is a set of subfilters where the filter kernel has been split up into a matrix with each row representing a subfilter.  The input samples are passed into each subfilter that are then summed together to produce the output.  For example, given the impulse response of the filter


we can separate it into two subfilters, E0 and E1



where E0 contains the even-numbered kernel coefficients and E1 contains the odd ones.  We can then express H(z) as


This can of course be extended for any number of subfilters.  In fact, the number of subfilters in the polyphase interpolating/decimating filters is exactly the same as the resampling factor.  So if we’re upsampling-downsampling by a factor of 4X, we use 4 subfilters.  However, we still have the problem of filtering on 4 times the number of samples as a result of the upsampling.  Here is where the Noble Identity comes in.  It states that for multirate signal processing, filtering before upsampling and filtering after downsampling is equivalent to filtering after upsampling and before downsampling.

Noble Identities for upsampling and downsampling.

Noble Identities for upsampling and downsampling.

When you think about it, it makes sense.  Recall that upsampling involves zero-insertion between the existing samples.  These 0 values, when passed through the filter, simply evaluate out to 0 and have no effect on the resulting signal, so they are wasted calculations.  We are now able to save a lot of computational expense by filtering prior to upsampling the signal, and then filtering after downsampling.  In other words, the actual filtering takes place at the original sample rate.  The way this works in with the polyphase filter is quite clever: through a commutator switch.

Let’s take the case of decimation first because it’s the easier one to understand.  Before the signal’s samples enter the polyphase filter, a commutator selects every Mth sample (M is the decimation factor) to pass into the filter while discarding the rest.  The results of each subfilter is summed together to produce the signal, back to its original sample rate and without aliasing components.  This is illustrated in the following diagram:

Polyphase decimator with a commutator switch that selects the input.

Polyphase decimator with a commutator switch that selects the input.

Interpolation works much the same, but obviously on the other end of the filter.  Each input sample from the signal passes through the polyphase filter, but instead of summing together the subfilers, a commutator switch selects the outputs of the subfilters that make up the resulting upsampled signal.  Think of it this way: one sample passes into each of the subfilters that then results in L outputs (L being the interpolation factor and the number of subfilters).  The following diagram shows this:

Polyphase interpolating filter with a commutator switch.

Polyphase interpolating filter with a commutator switch that selects the output.

We now have a much more efficient resampling filter.  There are other methods that exist as well to implement a resampling filter, including Fast Fourier Transform, which is a fast and efficient way of doing convolution, and would be a preferred method of implementing FIR filters.  At lower orders however, straight convolution is just as fast (if not even slightly faster at orders less than 60 or so) than FFT; the huge gain in efficiency really only occurs with a kernel length greater than 80 – 100 or so.

Before concluding, let’s look at some C++ code fragments that implement these two polyphase structures.  Previously I had done all the work inside a single method that contained all the for loops to implement the convolution.  Since we’re dealing with a polyphase structure, it naturally follows that the code should be refactored into smaller chunks since each filter branch can be throught of as an individual filter.

First, given the prototype filter’s kernel, we break it up into the subfilter branches.  The number of subfilters (branches) we need is simply equal to the resampling factor.  Each filter branch will then have a length equal to the prototype filter’s kernel length divided by the factor, then +1 to take care of rounding error.  i.e.

branch order = (prototype filter kernel length / factor) + 1

The total order of the polyphase structure will then be equal to the branch order x number of branches, which will be larger than the prototype kernel, but any extra elements should be initialized to 0 so they won’t affect the outcome.

The delay line, z, for the interpolator will have a length equal to the branch order.  Again, each branch can be thought of as a separate filter.  First, here is the decimating resampling code:



Example of decimating resampler code.

As can be seen, calculating each polyphase branch is handled by a separate object with its own method of calculating the subfilter (processDownsample).  We index the input signal with variable M, advances at the rate of the resampling factor.  The gain adjust can be more or less ignored depending on how the resampling is implemented.  In my case, I have precalculated the prototype filter kernels to greatly improve efficiency.  However, the interpolation process decreases the level of the signal by an amount equal to the resampling factor in decibels.  In other words, if our factor is 3X, we need to amplify the interpolated signal by 3dB.  I’ve done this by amplifying the prototype filter kernel so I don’t need to adjust the gain during interpolation.  But this means I need to compensate for that in decimation by reducing the level of the signal by the same amount.

Here is the interpolator code:

Example interpolation resampling code.

Example of interpolation resampler code.

As we can see, it’s quite similar to the decimation code, except that the output selector requires an additional for loop to distribute the results of the polyphase branches.  Similarly though, it uses the same polyphase filter object to calculate each filter branch, using the delay line as input instead of the input signal directly.  Here is the code for the polyphase branches:

Code implementing the polyphase branches.

Code implementing the polyphase branches.

Again, quite similar, but with a few important differences.  The decimation/downsampling MACs the input sample by each kernel value whereas interpolation/upsampling MACs the delay line with the branch kernel.

Hopefully this clears up a bit of confusion regarding the implementation of the polyphase filter.  Though this method splits up and divides the tasks of calculating the resampling into various smaller objects than before, it is much easier to understand and maintain.

Resampling, as we have seen, is not a cheap operation, especially if a strong filter is required.  However, noticeable aliasing will render any audio unusable, and once it’s in the signal it cannot be removed.  Probably the best way to avoid aliasing is to prevent it in the first place by using band-limited oscillators or other methods to keep all frequencies below the Nyquist limit, but this isn’t always possible as I pointed out in Part 1 with ring modulation, distortion effects, etc.  There is really no shortage of challenges to deal with in digital audio!

Audio Resampling: Part 1

Resampling in digital audio has two main uses.  One is to convert audio into a sampling rate needed by a particular system or engine (e.g. converting 48kHz audio to the required 44.1kHz required by CDs).  The second is to avoid aliasing during signal processing by raising the Nyquist limit.  I will be discussing the latter.

Lately I’ve been very busy working on improving and enhancing the sound of ring modulation for a fairly basic plug-in being developed by AlgoRhythm Audio (coming soon).  I say basic becuase as far as ring modulation goes, there are few DSP effects that are simpler in theory and in execution.  Simply take some input signal, multiply it by a carrier signal (usually some kind of oscillator like a sine wave), and we have ring modulation.  The problem that arises, however, and how this connects in with resampling, is that this creates new frequencies in the resulting output that were not present in either signal prior to processing.  These new frequencies created could very well violate the Nyquist limit of the current sampling rate during processing, and that leads us to resampling as a way to clamp down on aliasing frequencies that can be introduced as a result.

Aliasing is an interesting phenomenon that occurs in digital audio, and in every sense of the word is an undesired noise that we need to make sure does not pollute our audio.  There are many resources around that go into more detail on aliasing, but I will give a brief overview of it with some audio and visual samples.

Aliasing occurs when there are frequencies present in a signal that are greater than the Nyquist limit (half of the sampling rate).  What happens in such a case is that the sampling rate is not high enough to properly capture (sample) the high frequency of the signal, and so the frequency “folds over” and creates aliases that are mirror images of the original frequency.  Here, for example, is a square wave at 4000Hz created using 20 harmonics at a sampling rate of 44.1kHz (keep in mind that 4000Hz is the fundamental frequency, and that square waves contain many additional frequencies above that depending on how many harmonics were used to create it, so in this case Nyquist is still being violated):

4000Hz square wave with aliasing

Notice the low tone below the actual 4000Hz frequency.  Here is the resulting waveform of this square wave that shows us visually that we’re not sampling fast enough to accurately reproduce the waveform.  Notice the inconsistencies in the waveform.


Now, going to the extreme a bit, here is the same square wave sampled at a rate of 192kHz.

4000Hz square wave with no aliasing

It’s a pure 4000Hz square wave tone.  Examining the waveform of this square wave shows us that the sampling rate was more that adequate to reproduce this signal digitally:


Not all signal processing effects are susceptible to aliasing, and certainly not to the same degree.  Because ring modulation produces additional inharmonic frequencies, it is a prime example of a process that is easily affected by aliasing.  Others include distortion and various other kinds of modulation techniques (especially when taken to the extreme).  However, ring modulation with a sine wave is generally safe as long as the frequency of the sine wave is kept  low enough because sine waves have no harmonics to them, only the fundamental.  Introducing other wavetypes into this process, however, can quite easily bring about aliasing.

Here is an example of ring modulation with a sawtooth wave sweeping up from about 200Hz to 5000Hz.  As it glides up, you will be able to hear the aliasing kicking in at around 0:16 or so.  The example with no aliasing has been upsampled by 3X before processing, then downsampled by 3X back to its original rate.

Ring modulation with a sawtooth sweep, with aliasing

Ring modulation with sawtooth sweep, with no aliasing

So how does upsampling and downsampling work?  In theory, and even to some extent in practice, it’s very straightforward.  The issue, as we will see, is in making it efficient and fast.  In DSP we’re always concerning ourselves with speedy execution times to avoid latency or audio dropouts or running out of memory, etc.

To upsample, we insert 0-valued samples in between every L-th (our upsampling factor) original sample.  (i.e. upsampling by 3X, [1, 2, 3, 4] becomes [1,0,0,2,0,0,3,0,0…])  However, this introduces aliasing into our audio so we need to interpolate these values so that they “join” the sample values of the original waveform.  This is accomplished by using an interpolating low-pass filter.  This entire process is known as interpolation.

Dowsampling is much the same.  We remove every M-th (our downsampling factor) sample from the original signal.  (i.e. downsampling by 2X, [1,2,3,4,5] becomes [1,3,5…])  This process does not introduce aliasing, but we do need to make sure the Nyquist limit is adhered to at the new sampling rate by low-pass filtering with a cutoff frequency at the new Nyquist rate.  This entire process is known as decimation.

Fairly straightforward.  Proceeding with this method, however, would be known as brute force — generally not a good way to go.  The reason why becomes clearer when we consider what kind of low-pass filter we need for this operation.  The ideal filter would be one that would brick-wall attenuate all frequencies higher than Nyquist and leave everything else untouched (thus preserving all the frequencies and tonal content of our original audio).  This is, alas, impossible, as it would require an infinitely long filter kernel.  The function that would implement this ideal filter is the rectangular function.


The rectangular function.

By taking the Fourier transform of the rectangular function we end up with the sinc function, which is given by:

y(x) = sin (x) / x, which becomes y(x) = sin (πx) / πx for signal processing.

Graph of the sinc function, sin (x) / x

Graph of the sinc function, sin (πx) / πx

The sinc functions trails on for infinity in both directions, which can be seen in the graph above, so we need to enforce bounds around it by applying a window function.  Windowing is a method of designing FIR filters by essentially “surrounding” a function (in our case the sinc) by the window, which enforces bounds so that we can properly derive a filter kernel for use in calculations.  The rectangular function shown above is a type of window, but as I mentioned, infinite slope is a deal-breaker in audio.

The Blackman window, given by the function

w(i) = 0.42 – 0.5 cos(2πi M) + 0.08 cos(4πi M),

where M is the length of the filter kernel, is a good choice for resampling because it offers a good stop-band attenuation of -74dB with good rolloff.  Putting this together with the sinc function, we can derive the filter kernel with the following formula:

Windowed-sinc kernel formula*

Windowed-sinc kernel formula*

where fc is the normalized cutoff frequency.  When i = M/2, to avoid a divide by zero, h(i) = 2fc.  K is a constant value used to achieve unity gain at zero frequency and can be ignored while calculating the kernel coefficients.  After all coefficients have been calculated, K can be found by summing together all the coefficients, h(i), and then dividing each by the resulting sum.

* Source: Smith, Steven W., “The Scientist and Engineer’s Guide to Digital Signal Processing”, Chapter 16.

Now that we have the filter design, let’s consider the properties of the FIR filter and compare them to IIR filter designs.  IIR filters give us better performance and attenuation at lower orders, meaning that they execute faster and perform better with fewer calculations than FIR filters.  However, IIR filters are still not powerful enough, even at slightly higher orders, to give us the performance we need for resampling, and trying to push IIR filters into very high orders can make them unstable and/or susceptible to quantization error due to the nature of recursion.  IIR filters also do not offer linear phase response.  FIR filters are the better choice for these reasons, but the unfortunate drawback is that they execute slowly due to being implemented by convolution.  In addition, they need to be pushed to high orders to give us the performance needed for attenuating aliasing frequencies.

However, the order of the interpolating low-pass filter can be negotiated based on the frequency content of the audio signal(s) involved.  If the audio is sufficiently oversampled, it will not contain enough frequencies near Nyquist, and as such a lower order filter can be used with a gentler rolloff without adversely affecting the audio and attenuating actual frequencies in the signal.  There are plenty of cases where we just don’t know the frequency content of the audio signals involved in processing, however, so a strong filter may be needed in these cases.  Here is a graph of a 264-order window-sinc filter (in other words, a filter kernel of length 265 including the sample x(0)):

264-order window-sinc low-pass filter frequency response

264-order window-sinc low-pass filter frequency response (cutoff frequency at 10kHz, resulting in a transition band of 882Hz)

With this in consideration, it can be easy to see that convolving a signal with a 264-order FIR filter is computationally costly for real-time processing.  There are a number of ways to improve upon this when it comes to resampling.  One is using the FFT to apply the filter.  Another interesting solution is to combine the upsampling/downsampling process into the filter itself, which can further be optimized by turning it into a polyphase filter.

The theory and implementation of a polyphase filter is a fairly long and involved topic on its own so that will be forthcoming in part 2,where we look at how to implement resampling efficiently.

The Making of a Plug-In: Part 5 (Beta & new features)

In this post I’m going to discuss two new features I have added to the Match Envelope plug-in that I’m pretty excited about.  And with some additional bug fixing, it’s in a good workable state, so I can offer up the plug-in as a Beta version.

The first of the new features I added is an option to invert the envelope, so instead of matching the source audio that you extract the envelope from, the resulting audio is opposite in shape.  Of course this can further be tweaked by use of the “match strength” and “gain” parameters.  The actual process of doing this is very simple; take the interpolated value (from cubic interpolation) ‘ival’ and subtract it from 1 (digital waveforms in floating point representation all have sample values between -1 to 1).  As simple as this was, it did introduce a bug that took a long time to find.

Occasionally I would get artefacting after applying the process when inverting the envelope, and I discovered that the resuling interpolated value was nan in these cases (nan = not a number).  When the source audio consisted of sharp attacks, and thus sharp rises in the waveform, the interpolated value exceeded 1 by a small amount.  Then, when it gets to this point in the formula for calculating match strength:

pow(ival, matchStr),

it will result in an imaginary number when matchStr is around 0.5.  In essence, this part of the formula ends up trying to square-root a negative number.  The easy fix for this was just to take the absolute value of ‘ival’ such that pow(fabs(ival), matchStr)) will then never result in nan.  This is a better solution than to just “floor” ival to 0 if it is negative because this will actually alter the audio slightly by messing with the interpolation.

The second feature I added is a “window offset” parameter, which shifts the envelope left or right.  Oddly enough, this one took less time to implement and required less testing/fixing than the invert feature even though it sounds more complex to code.  In fact, it’s pretty simple as well.  Similar to a circular buffer, instead of shifting the elements around in the buffers that contain the envelope data, I just offset the cursor that points to the location in the buffer.

If I want to shift the envelope to the left (the result will in effect anticipate the source audio’s envelope), the cursor needs to be offset by a positive number.  If I want to shift the envelope to the right (emulating a delay of the envelope), the cursor needs to be offset by a negative number.  This may seem a bit unintuitive at first, but we need to consider how the offset/placement of the cursor affects how the processing begins. i.e. if it is negatively offset, it will be delayed in processing the envelope data, thus shifting the envelope to the right.

Before moving on, here is a new video showcasing these new features, this time with some slightly more exciting audio for demonstration.

Lastly, in terms of bug fixing, I had to deal with an issue that arose in Soundforge due to the fact that it does not save the state of a plug-in once it’s invoked.  In other words, when a plug-in is opened in Audacity, it is not “destroyed” until the program is exited; it only switches to a suspend state while it’s not active.  This means that a plug-in’s constructor will only be called once, thus saving the state of parameters/variables between uses as long as the host isn’t closed.  Soundforge does not do this, however, and this caused problems with parameter and variable reinitialization.

Fortunately a fix was found, but it has further reaffirmed that VST is not really built for offline processing, and while we can certainly coax them into it, I’ve hit several limitations in terms of what I can accomplish within the bounds of the VST SDK as well as inadequate support for the offline capabilities it does provide by hosts.

As such, this might be one of the last entries in this particular making-of series on the Match Envelope plug-in.  I have learned a ton through this process, and through sticking it out when it became clear that this particular process is probably better suited to a standalone app or command-line program where I could have had much more control over things.

I look very much forward to developing my next plug-in, however, which will most assuredly be a real-time process of some kind, and I am excited to learn and to tackle the challenges that await!  Without further ado, here are links to the beta of the Match Envelope VST plug-in:

Match Envelope beta v0.9.2.23 — Mac

Match Envelope beta v0.9.2.23 — Win


The Making of a Plug-In: Part 2

This entry in my making of a plug-in series will detail what went into finalizing the prototype program for the Match Envelope plug-in.  A prototype of this kind is usually a command-line program wherein much of the code is actually written to implement functionality and features, and then later transferred into a plug-in’s SDK (in my case, the VST SDK).  Plug-ins, by their very nature, are not self-executable programs and need a host to run, so it is more efficient to test the fundamental code structure within a command-line program.

Having now completed my prototype, I want to first share some of the things I improved upon as well as new features I implemented.  One of the main features of the plug-in that I mentioned in part 1 was the match % parameter.  This effectively lets you control how strongly the envelope you’re matching affects the audio, and rather than being a linear effect, it is proportional to the difference between the amplitude of the envelope and the amplitude of the audio.  Originally this was the formula I used (from part 1):

We could see that this mostly gave me the results I was after, but if we look closely at the resulting waveform, there is some asymmetry in comparing it to the original (look at the bottom of the waveforms).  One of the mistakes in this formula was in comparing the interpolated value ‘ival’ with the actual sample amplitude of the ‘buffer’.  To remedy this, I now also extract the envelope of the destination audio that we are applying the envelope on to (with the same window width used to extract the source envelope) and use this value to compare the difference with ‘ival’.  This ensures a more consistent and accurate comparison of amplitudes.

The other mistake in the original formula was to linearly affect ‘a‘, the alpha value that is input by the user in %, by the term that calculates the difference between the amplitudes. So while the a value does affect the resulting ival proportionally, a itself was not.  The final equation then, just became:

I use two strategies when deciding on an appropriate mathematical formula for what I need.  One is considering how I want a value to change over time, or over some range of values, and turning to a kind of equation that does that (i.e. should it be a linear change, exponential, logarithmic, cyclical, etc.).  This leads to the second method, and that is to use a graph to visualize the shape of change I am after; this leads to an equation that defines that graph.

Here is a quick graphic and audio to illustrate these changes using the same flute source as the envelope and triangle wave as its destination from part1:

Flute envelope applied with 80msec window size at 100% match

Shortly, we will be seeing some much more interesting musical examples of the plug-in at work.  But before that, we can see another feature at work above that I implemented since last time: junction smoothing.

In addition to specifying the length of the envelope, the user also specifies a value (in msec) to smooth the transition from the envelope match to the original, unmodified audio.  Longer values will obviously make the transition more gradual, while shorter makes it more abrupt.  The process of implementing this feature turned out to be reasonably simple.  This is the basic equation:

where ‘ival’ is the interpolated value and ‘jpos’ is the current position within the bounds of the junction smoothing specified by the user.  ‘jpos’ starts at 0, and once the smoothing begins, it increments (within a normalized value) until it hits 1 at the end of the smoothing.  The larger ‘jpos’ gets, the less of the actual interpolated value we end up with in our ‘jval’ result, which is used to scale the audio buffer (just as ‘ival’ does outside of junction smoothing).  In other words, when ‘jpos’ hits 1, ‘jval’ will be 1 and so we multiply our audio buffer by 1; then we have reached the end of our process and the original audio continues on unmodified.

Before we move on, here is a musical example.  This very famous opening of Debussy’s “Prelude to the Afternoon of a Faun” seemed like a good excerpt to test my plug-in on.  This is the original audio (the opening is very very soft as most classical recordings are of quiet moments to preserve dynamic range, so I had to amplify it which is why there is some audible low noise):

Opening of “Prelude to the Afternoon of a Faun”

Using this as the source envelope, I used a window size of 250msec at 90% match to apply on to this flute line that I recorded in Logic, doubling the original from the audio above (the flute sample is from Vienna Special Edition).

Unmodified flute doubling of the Prelude opening

The result:

Flute doubling after Match Envelope

And here is the result mixed with the original audio:

Doubled flute line mixed with original audio

Junction smoothing was of course applied during the process to let the flute line fade out as in its original incarnation.  Without smoothing it would have abruptly cut off.  This gives us a seamless transition from the end of the flute solo into the orchestral answer.

It was very important in this example to specify a fairly large window duration, because we don’t want to capture the tremolo of the original flute solo as this would fight against the tremolo of the sampled recording that we are applying the envelope on to.  We can hear a little bit of this in places even with a 250msec window, so this will be something I intend to test further to see how this may be avoided or at least minimized.

This brings me to the other major challenge I faced in developing this plug-in since part 1: stereo handling.  Dealing with stereo files isn’t complicated in itself, but there were a few complexities I encountered along the way specific to how I wanted the plug-in to behave. Instead of only allowing a 1-to-1 correspondence (i.e. only supporting mono to mono, or stereo to stereo), I decided to allow for the two additional situations of mono to stereo and stereo to mono.

The first two cases are easy enough to deal with, but what should happen if the source envelope is mono and the destination audio is stereo, and vice versa?  I decided to allocate 2-dimensional arrays for the envelope buffers to hold the mono/stereo amplitude data and then use bitwise flags to store the states of each envelope:

This saves on having multiple variables representing channel states for each envelope, so I only have to pass around one variable that contains all of this information that is then parsed in the appropriate places to retrieve this information.

As we can see, only one variable (‘envFlags’) is used in the extraction of the source envelope, and we can find out mono/stereo information by using bitwise AND with the corresponding enum definition of the flag we’re after.  Furthermore, in the case of the source envelope being stereo but destination audio being mono, I combine the amplitude data of the two channels into one, according to either average or peak extraction method (also specified by user).

The difficulty in implementing this feature wasn’t so much in how to get it done, but how to get it done more efficiently, without a massive number of parameters passing around and a whole lot of conditional statements within the main processing loop to determine how many channels each envelope has.  We can see some of this at work within the main process:

I use another variable (‘stereo_src’) that extracts some information from the bitwise flags to take care of the case where the source envelope is mono but destination audio is stereo.  Since the loop covers the channels of the destination audio (in interleaved format), I needed a way to restrict out of bounds indexing of the source envelope.  If the source envelope is mono, ‘stereo_src’ will be 0, so the indexing of it will not exceed its limit.  If both envelopes are stereo, ‘stereo_src’ will be 1, so it will effectively “follow” the same indexing as the destination audio.

For the next, and last, musical example of this entry, we change things up a whole lot.  I’m going to show the application of this plug-in to electronic dance music.  This match envelope plug-in can emulate, or function as a kind of side chain compressor, which is quite commonly found in EDM.  Here is a simple kick drum pattern and a synth patch that goes on top:

Kick drum pattern

Synth pattern

By applying the match envelope plug-in to the synth pattern using the kick drum pattern above as the source envelope (window size of 100msec and 65% match), we achieve the kind of pumping pattern in the synth so characteristic of this kind of music.  The result, and mix, are as follows:

Match Envelope applied to synth pattern

Modified synth pattern mixed with percussion and bass

This part has really covered the preliminary features of what I’m planning to include in the Match Envelope plug-in.  As I stated at the start, the next step is to transfer the code into the VST SDK (that’ll be part 3), but this also comes with its share of considerations and complications, mainly dealing with UI.  How should this appear to the user?  How do you neatly package it all together to make it easy and efficient to use?  How should be parameters be presented so that they are intuitive?

Most all plug-ins/hosts offer up a default UI, which is what I’ll be working with initially, but eventually a nice graphic custom GUI will be needed (part 4? 5? 42?).

The Making of a Plug-In: Part 1

Well, it’s finally time to do something useful with all this stuff.  Not that command-line programs aren’t useful, but they have their limitations — especially these days.  Making a plug-in is a great way to apply all the things I’ve been doing, which really culminates into making a deliverable product that has a use.  I have thus decided to make a Match Envelope plug-in, inspired by a suggestion from my good friend Igor (thanks man!).  This first part of the “making of” blog will cover some of the initial conception and development of the prototype program as well as introducing some of the planned features and parameters of the plug-in.  Focus will not be on actual C++ code at this point, but more on the math and the concepts behind it.

This Match Envelope plug-in will be similar in many ways to an Envelope Follower.  It extracts the envelope from a source audio file and applies it to a destination audio file.  Envelope followers tend to be more geared towards MIDI and there are not too many (to my knowledge) stand-alone plug-ins that give you what you need.  They can also be found as features in filters and other kinds of plug-ins.

The usefulness of them can vary quite a bit as we will see in more detail.  Commonly we see Envelope Followers used to sync up a sound to a drum loop for example.  Another benefit of using this plug-in involves layering sounds together.  If a seamless blend is desired, the envelopes of the different layers must match fairly close or else we will hear the distinct layers.  With a mix of more percussive sounds with sustained ones, an envelope matcher can be quite useful.

Furthermore, Match Envelope can be used to approximate the attack or release of instruments in an orchestra that have been recorded.  Let’s say you wish to double the flute line, or even the bass line with something; the Match Envelope plug-in can assist in blending the two layers together.  As will be discussed below, there will be parameters and features to control the effect because there are situations where we certainly don’t want a “lifeless” 100% match.

In addition to its uses in music, it can also be used in sound design, where layers upon layers of different sound sources are often combined to great effect.  Given some of the cool and varied applications of this plug-in, and considering that there are not a great many of them out there, I felt this was an exciting project to take on!

So how does it work and where do we start?  To extract the envelope shape from the source audio, we use windows of a certain size that will either take the peak amplitude or the average amplitude within that window and store it in a buffer.  We then take those amplitude values and apply them onto the destination audio, effectively recreating its amplitude shape.  Fairly straightforward in concept, but there’s more to it than that.

If we just apply the extracted envelope values to the audio, we’ll get a staircase.  Thus we turn to interpolation.  At first I went with linear interpolation, given by the formula

where we want y(x) with position x between points (x0, y0) and (x1, y1), and this resulted in a fairly good and accurate match.  However, we need a better quality interpolation; one that will give us a smoother, more accurate curve that will result in better quality audio.  For this we turn to cubic interpolation.  The cubic equation will look familiar to most;

but as it has four unknowns (the coefficients a, b, c, d) we need four points in order to interpolate or solve it (remember that from math class?).  Solving this equation isn’t terribly complicated; we take the derivative of y(x) and then solve both equations for x = 0 and x = 1.  This assumes that the distance between successive x values is 1.  For a more detailed explanation of how to solve this equation for the coefficients go here: http://www.paulinternet.nl/?page=bicubic

There was one problem I encountered after implementing this, however, that is worth mentioning.  My window sizes were not of the unit value (1) and even moreso, can be changed by the user, so I did not have a constant for the distance between x values.  The solution is pretty simple (almost too much so as I was heavily focused on workarounds that were far too complex).  Basically we just scale the x increment value, which keeps track of our position, by the inverse of the window duration.

Let’s say we set our window duration to 20 milliseconds.  That gives us our window size of 882 samples (assuming a sample rate of 44.1kHz).  Our x increment value is then 2.268×10-5 (0.02 / 882) based on a window size of 0.02.  But we need a window size that is effectively 1, so 1 / 0.02 is 50, and this is our scaling factor.  The increment value is now 1.1334×10-3.

One final detail that caused a small issue was rounding error.  At the boundaries of some windows, I would end up with an x value of 3.999999 or similar.  This caused sample error that did not sound very good, but the solution was a simple matter of adding a very small value to the x position at the end of each window loop (0.0000001 for instance).  Some additional testing with varying window sizes will be done to make sure no more sample/rounding error occurs.

While on the topic of windows, their function is to affect the smoothness of the extracted envelope as well as its accuracy.  Smaller window sizes will result in a closer match to the source audio’s envelope, and larger ones will be more of an approximation.  Before moving on, let’s get a visual of how I’m testing out the functionality of the plug-in.

To really get a good idea of how the program is working, using something simple like a triangle wave is great for visualizing the outcome.

As I mentioned previously, there are two ways of taking the amplitude within each window: taking the maximum (peak) value, or taking the average.  Just below we see the difference between them, and it is the intention at this point to have this as an option for which to use.

Here we can see a number of things already at work.  The difference between peak and average amplitude is not very much.  But in this case, the source audio is quite smooth and the destination waveform is completely constant.  A following example will demonstrate the difference between these better.  We also see the effects of different window sizes in the visual above.  A very large window size (perhaps around 500 ms to 1 s) would retain much more of the shape of the destination audio and might be more useful for longer sustained sounds, while a shorter window length is better for capturing percussive sounds.  The example below uses a different audio as the source, with more percussive attacks.

There is a clear difference between peak and average, at least visually.  We will hear at the end of this post that they don’t differ a huge amount to our ears (at least not with these examples).

Before I wrap up this post, I want to discuss two additional features I’m planning at this point.  One is just a simple gain, that scales the result of the envelope match.  An extension of this feature will be to add an option where the user can specify that the amplitude at the end of the envelope will match the amplitude of the rest of the audio (i.e. at the junction point where the envelope ends, and the rest of the audio continues unmodified).  This would be useful if the user just wants to match a section from a source audio file.

Secondly, I have implemented a parameter called “match strength %” that will control how strongly the envelope will modify the destination audio.  The cool thing about this is that it’s proportional to the difference in amplitude between the envelope and the destination audio.  To accomplish this, I scale the interpolated value, ival, by the equation

where buffer is the amplitude of the destination audio and a is a value in % specified by the user.  We can see it visually in the image below.

The bigger the distance between the two amplitudes, the stronger the envelope affects the audio.  This ensures that the general shape of the envelope remains, while retaining more of the original shape in relation to the specified % by the user.  The middle waveform shows a gain factor of 1.6, but this was incorrectly implemented, as the gain modifies the interpolated value, ival, instead of the result of ival * buffer.  As it is in the image above, gain is also proportional, but I intend it to be a linear effect.

Here are a couple of examples of how this process sounds at this stage.  These audio samples use just the triangle wave that I have been using to test the program layered on top of the audio I extracted the envelope from.  Listen to how the triangle wave follows the shape of the audio and see if you can hear a difference between the peak version and the average version (the peak version has slightly stronger attacks from the triangle wave).

Peak windowing: layered triangle wave over source audio

Average windowing: layered triangle wave over source audio

This has been a long and wordy post, so it’s time to wrap up.  I’m pleased so far with the functionality and behavior of the plug-in and the parameters and features I am planning on implementing.  It should provide a good amount of flexibility to shape a given waveform/audio file from an extracted envelope source, and should have some fun and productive uses.