Basics of vocoder

The vocoder is a kind of simulation of our vocal organ which is primarily comprised by the vocal bands, and mouth. Additionally involved are the nose cavity and lungs for the air supply. The vocal bands vibrate (oscillate) when streamed with air and create a sound which mainly can vary in frequency (aha, the oscillator). The mouth and nose cavity form a filter which characteristic depends of many factors like mouth aperture, lip forming and tongue position while speaking (aha again, the VCF). Well this filter is a little bit more complex than the ones we know on synthesizers. There are a lot of complex models of filters to simulate the one we have onboard and they still can’t reach it. Ok they sound synthetic, but this is just what we want. But getting back to the vocoder, let’s call the things like they should be: The oscillating part is the CARRIER-section and the complex filtering part is the FORMANT-section.

In the vocoder, the formant section is implemented with a set of band-pass filters that extract the spectral information of the signal applied to it (usually our speech). The resulting amounts for each filter band in the formant are then applied as gain amplitude values to another set of band-pass filters which are applied to the carrier signal. The “quality” of the vocoder depends primarily of the characteristic of this band-pass filters. I write “quality” intentionally in quotes because it is hard to define what we mean with this term. If we want a good reproduction of our voice then we need a good “quality” but this is not what I expect from a vocoder. So what should be the quality? For me, the quality (without quotes) of a vocoder is a good intelligibility with a sound that is very different to my voice. Another important aspect for the quality of a vocoder is the ability to detect and reproduce unvoiced consonants like “s”, “t”, ”h” etc. If this feature is poor or doesn’t exists, there are methods to improve it a little.

Returning to the filters, the quality depends of the amount of band-pass filters and their characteristics. How much filters and what characteristics (db/octave) they should have will not be discussed here. A vocoder with 16 bands and 24 or 36 db/octave would be enough for me. Until now, I don’t know what are the specs for the R3’s vocoder (maybe someone can tell me), so I will assume nothing and try to get the best effect-quality (a new quality term?) out of it. I remember testing a vocoder with 256 bands and it sounded nearly equal to my voice which is not what I am looking for.

Effect quality?

Yes. Once the discussion about number and characteristics of the filter bank is settled and the fact that we have to live with what we get, let’s try how to get the best of it. 

Remember: these hints are useful for every kind of vocoder and that is why I don’t get specific to the R3 for now.

To achieve the best results it is important to select the appropriate sources for the formant and the carrier.

The carrier:

The carrier signals should be sharp and crispy to have a wide frequency spectrum in order to have something to feed to each filter. Selecting warm or soft sounds may sound nice but will decrease intelligibility and some filters will have nothing to do. Try using dry sounds. If you want you can apply some compressor or EQ. Avoid using effects like chorus, flangers or delays in this section, save them for the post-vocoder signal. Played dry it may sound dreadfully but trust me, these are the best ones.

The best waveforms are pulse and saw. You can add a little noise to them to increase intelligibility. For the VCF, set the cutoff to max  and resonance to a low or zero level. They can be applied (played) as chords or single tones. For improvement at single tones (solo and robot voices) you can play the same tone one or more octaves higher or lower simultaneously or use a suboscillator

The formant: 

Being the more active and important part of the vocoder, it is essential to apply a good formant signal to obtain satisfactory results. Here are some hints:

  • ·    The signal should be clear and strong. If speech is used, a good microphone is the best option. If there is the possibility to improve the dynamics and spectrum width (i.e. compressor EQ etc.) before it is applied to the formant input, the better will be the results. Just for demonstration try this: record the news speaker of your favourite radio station and apply it to the vocoder as the formant signal and you will get an idea what I mean. Why? Because their speech is maxed out to offer the best intelligibility  

  • ·    Another characteristic of the news speaker is that they don’t sing. They talk! If we play a melody on the carrier and we sing the same melody on the formant, then the filtering is centered on the melody’s tones, having a narrower effect than desired. This could decrease intelligibility. Instead of singing try to talk the text you want to vocode. I know it is very hard to play the melody and not singing it but try it. Another good result can be achieved when you whisper instead of talking

Intelligibility:

Part of the intelligibility of the text depends of the quality or ability of the vocoder to reproduce the unvoiced consonants like “s”, “t”, ”h” etc. Some vocoders offer special detection of these unvoiced sounds. If not, they at least should offer a High-Pass Filter (HPF) which permit bypassing part of the formant signal. As these unvoiced sounds are rich in high frequency, we use this HPF to add these unvoiced sounds from the original to the vocoded one. The setting for the amount of this HPF is a balance between hearing the unvoiced consonants without hearing the original formant. For me a good balance is when I can understand the text without being able to identify the speaker (in this case my voice)

Other settings:

Other common settings for vocoders are the envelope followers, band shifter (or band patch matrix) and band panorama.

  • ·     The envelope followers define how fast the variation of the formant filters will be applied to the gain of the carrier filters. These can be global or for each band. Fast settings are for good intelligibility, slower settings are more for a sort of dynamic equalizing or complex sound filtering.  

  • ·     The band shifter permits to shift the assignation from the format filters to the carrier filters. Some vocoders have the capability of routing each individual band from the formant to the carrier. This is usually done by patch cords or a matrix.  

  • ·     The band panorama specifies the stereo position for the output of each band. A good stereo position is achieved by setting them alternatively to right and left, but here all combinations are possible depending of your taste.

If you want to know more about vocoders, search for it in the internet. There are a lot of articles about them.  I wrote some articles too, but this was years ago before internet was accessible to me.