Simplified Phase Vocoder
Implementation of a simplified phase vocoder system in MATLAB to tune vocal input to the C Major scale. BE 3010 (Signals & Systems).
Project Overview & Motivation
Pitch correction systems (e.g. Antares Auto-Tune) are widely utilized in modern audio processing to refine intonation in vocal recordings. At their core, these systems estimate the fundamental frequency of input vocals and map them to the nearest note in a target musical scale, stabilizing a singer’s pitch for greater accuracy and stylistic effect.
From a signals and systems perspective, vocal recordings are non-stationary signals, meaning their frequency evolves continuously over time. Consequently, a traditional Fourier Transform (FT), which provides a global frequency representation, is insufficient for analysis. Each time-localized segment must be analyzed individually, which can be accomplished by the Short-Time Fourier Transform (STFT).
The aim of this project was to implement a simplified phase vocoder system in MATLAB, designed to tune a vocal input to the C Major scale. The system corrects pitches of a C Major scale sung a cappella with intentional errors, such as singing some notes flat (slightly lower frequency) and some notes sharp (slightly higher frequency).
Check out the project report here.
Methods
The pitch correction system was implemented using an STFT framework and spectral analysis akin to the analysis done by a Phase Vocoder.
Signal Analysis:
The input audio signal was first converted to mono and segmented into overlapping frames. An STFT window size of N=2048 was selected as a practical trade-off between frequency resolution and time resolution. At a sampling rate of fs=44.1 kHz, a 2048-sample window spans approximately 46 ms, long enough to assume the signal is locally stationary. The corresponding frequency resolution is 21.5 Hz, which is sufficient to resolve harmonic structure in vocal signals. A Hanning window was applied to each frame before applying the Fast Fourier Transform (FFT) to reduce spectral leakage. A hop size of N/4 (512 samples) resulting in 75% overlap was chosen to ensure smooth reconstruction.
Pitch Detection & Modification Pipeline:
The pitch correction logic was carried out through three steps:
- Peak Detection: The algorithm identifies the index corresponding to the maximum magnitude in the spectrum and calculates the current pitch.
- Quantization: The system compares the current frequency against a predefined array of target frequencies corresponding to the C Major scale starting at C4 (261.63 Hz, 293.66 Hz, …). The nearest target frequency is selected, and a shift factor is derived.
- Spectral Remapping: The magnitude spectrum is frequency-warped by mapping the energy from original bins to new bins based on the shift factor. Crucially, the original phase information was preserved and recombined with the shifted magnitude—a simplified phase-vocoder approach that avoids complex phase unwrapping but introduces phase incoherence artifacts.
Noise Reduction Strategy:
A two-stage filtering strategy was implemented to reduce artifacts and background noise:
- Harmonic Masking: A “comb filter” mask was generated in the spectral domain. Based on the target frequency, the mask preserves frequencies at harmonics of the fundamental frequency (up to 8 harmonics) and attenuates other frequencies with a bandwidth of ±60 Hz.
- Global Band-Pass Filter: A 4th-order Butterworth band-pass filter with cutoffs at 100 Hz and 3500 Hz was applied to preserve the fundamental frequency and its harmonics while attenuating background noise.
Signal Reconstruction:
The modified frequency spectrum was converted back to the time domain using the Inverse FFT. A second Hanning window was applied to the output frame before reconstructing the continuous signal via the overlap-add method.
Results
The pitch correction system was evaluated with spectral analysis and by inspection of the processed vocal recording. The spectrograms of the input signal and the pitch-corrected output demonstrated successful tuning of the input to the target C Major scale, with signal attenuation between harmonic partials.
The hybrid noise reduction process effectively reduced background noise but left vertical striations across the spectrum, likely corresponding to phase discontinuities at STFT frame boundaries. The processed audio signal had noticeably lower quality and clarity than the original signal, with “robotic” artifacts corresponding to phase discontinuities introduced by the simplified approach of preserving original phase information rather than explicitly propagating phase through frames.
Key Insights & Future Work
This work successfully demonstrates that frequency-domain pitch correction using an STFT-based framework is effective for refining discrete pitch errors. However, the simplified implementation reveals the need for phase-aware approaches to achieve more natural vocal processing.
To enhance the naturalness of the output in future iterations, a standard phase vocoder algorithm could be implemented to explicitly calculate and propagate phase through each frequency frame, which would enforce continuous phase through frame boundaries and likely reduce the “robotic” striations visible in the output spectrogram. Additionally, more precise adjustments could be made by estimating the entire spectral envelope (e.g. Linear Predictive Coding) and only adjusting the original frequencies by shifting the fine harmonic structure. Finally, data-driven machine learning models could be incorporated to learn pitch trajectories and correction strengths from real vocal performance, enabling more adaptive and natural pitch correction.
Skills Used
- Programming: MATLAB, Signal Processing Toolbox
- Signal Processing Concepts: STFT, Phase Vocoder, FFT/IFFT, Windowing, Overlap-Add Method
- Audio Processing: Pitch Detection, Spectral Analysis, Harmonic Filtering, Band-Pass Filtering