Audio Feature Extraction: Designing an Audio Metrics Pipeline
This post explores techniques for extracting quantitative metrics from human voice recordings. Not subjective ratings, but actual numbers: pitch variability, loudness, harmonic quality, pause patterns. The approach below shows how to pull a range of speech metrics out of WAV files using Parselmouth, librosa, and a handful of scipy calls.
The problem
Analyzing the human voice is one of those areas where you can get surprisingly far with classical signal processing. No deep learning needed. Upload a voice recording, run some FFTs and autocorrelations, and you get metrics that are actually interpretable.
When analyzing the human voice, the metrics tend to cluster around a few dimensions: loudness (how loud), frequency (pitch height and variation), voice quality (resonance and spectral balance), and timing (silence patterns and rhythm). Each breaks down further.
Loudness
- RMS Energy Frame-level amplitude
- LUFS Perceived loudness (ITU-R BS.1770)
Frequency
- F0 (pyin) Probabilistic fundamental freq
- F0 (Praat) Autocorrelation-based
- Variability Std deviation of F0
- Range Max F0 minus min F0
Voice Quality
- HNR Harmonic-to-noise via HPSS
- Nasalance Band energy ratios
- VLHR Low-to-high frequency ratio
Timing
- Count Silences > threshold
- Rate Pauses per minute
- Avg Duration Mean pause length (ms)
- Variability Coefficient of variation
Loudness: RMS and LUFS
Loudness is the simplest dimension. Two metrics: frame-level RMS energy for a timeseries view, and integrated LUFS for a single-number loudness score.
RMS comes from librosa.feature.rms with a 512-sample frame length and 256-sample hop. That gives a smooth amplitude envelope you can plot over time. The hop length matters: too large and you lose transient detail, too small and the curve gets noisy without adding information.
y, sr = librosa.load(path)
rms = librosa.feature.rms(y=y, frame_length=512, hop_length=256, center=True)
# LUFS needs the original sample rate — load separately
data, rate = sf.read(path)
meter = pyln.Meter(rate)
loudness = meter.integrated_loudness(data)For perceived loudness, raw RMS isn't enough. LUFS (Loudness Units Full Scale) follows the ITU-R BS.1770 standard, which applies K-weighting to match human hearing perception. pyloudnorm handles this well. You feed it the audio signal loaded via soundfile (not librosa, because pyloudnorm expects the original sample rate) and get back a single integrated loudness number. A recording at -14 LUFS is conversational; -6 LUFS is someone yelling into the mic.
One gotcha: librosa.load resamples everything to 22050 Hz by default. That is fine for feature extraction, but LUFS needs the original sample rate. This means loading the audio twice: once with librosa (resampled) for most processing, once with soundfile (native rate) for LUFS and Parselmouth.
Frequency: two pitch extractors, and why both matter
Pitch extraction is the most studied problem in audio signal processing and there is still no single algorithm that works for everyone. Running two pitch trackers on every file and returning both results gives the most flexibility.
librosa pyin
librosa.pyin implements probabilistic YIN, which estimates fundamental frequency by autocorrelation with a probabilistic threshold. Setting fmin=C2 (~65 Hz) and fmax=C5 (~523 Hz) covers the normal speaking range. The output includes NaN values for unvoiced frames (breaths, silence, consonants), which I filter out before computing statistics.
From the cleaned F0 series, three derived metrics: mean F0 (average speaking pitch), pitch variability (standard deviation of F0, higher means more expressive), and pitch range (max minus min, captures the full span of the voice).
f0, voiced_flag, voiced_probs = librosa.pyin(
y, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C5')
)
f0_clean = f0[~np.isnan(f0)]
pitch_variability = np.std(f0_clean)
pitch_range = np.max(f0_clean) - np.min(f0_clean)Parselmouth (Praat)
parselmouth is a Python wrapper around Praat, which has been the gold standard for phonetic analysis for decades. Praat's pitch algorithm uses autocorrelation with a different set of heuristics than pyin: it handles octave jumps better, handles noise differently, and uses 0 Hz instead of NaN for unvoiced frames.
sound = parselmouth.Sound(data, rate)
pitch = sound.to_pitch()
pm_f0 = pitch.selected_array['frequency']
pm_f0_voiced = pm_f0[pm_f0 > 0] # drop unvoiced (0 Hz) frames
pm_variability = np.std(pm_f0_voiced)
pm_range = np.max(pm_f0_voiced) - np.min(pm_f0_voiced)The results are noticeably different from librosa's. On the same recording, Parselmouth tends to give a smoother F0 contour with fewer spurious jumps, but it can also miss quieter voiced segments that pyin catches. Neither is "correct" in all cases. Returning both gives you more data to work with, and whichever contour looks more useful for a particular recording can be used downstream.
Voice quality: where it got tricky
The human voice carries a lot of information in its spectral characteristics beyond just pitch. A "chest voice" has more low frequency energy, a "head voice" has more high frequency energy, and nasal resonance sits in a specific mid-range band. Quantifying these qualities from a raw audio signal took some creative signal processing.
HNR via HPSS
Classical Harmonic-to-Noise Ratio measures how much of a voice signal is periodic (harmonic) versus aperiodic (noise). The standard approach uses autocorrelation, but an alternative is harmonic/percussive source separation (HPSS) on the STFT, then computing the energy ratio.
librosa.decompose.hpss splits the spectrogram into harmonic (sustained tones) and percussive (transient) components using median filtering. The HNR is then 10 * log10(E_harmonic / E_percussive). Higher values mean a cleaner, more resonant voice. This isn't exactly the same as Praat's HNR (which uses autocorrelation on the time domain), but it captures a similar quality dimension and works well in practice.
D = librosa.stft(y)
D_harmonic, D_percussive = librosa.decompose.hpss(D)
E_harmonic = np.sum(np.abs(D_harmonic) ** 2)
E_percussive = np.sum(np.abs(D_percussive) ** 2)
hnr = 10 * np.log10(E_harmonic / E_percussive)Nasalance via band energy
Nasalance is usually measured with a physical device (a nasometer) that separates nasal and oral airflow. It can be approximated using frequency band energy ratios instead.
The nasal and oral resonance bands (commonly cited in the literature as roughly the low-to-mid hundreds through low thousands of Hz for nasal, and wider for oral) get isolated using a semitone filterbank. librosa provides semitone_filterbank to generate a set of second-order-section (SOS) filters between two frequencies, spaced by semitones. Each filter is applied with scipy.signal.sosfiltfilt (zero-phase filtering) and the results are summed. The nasalance percentage is the filtered band energy divided by total signal energy, times 100.
This is an approximation. A physical nasometer would give different numbers. But the relative values across recordings are consistent and useful for comparison, which is what most voice analysis applications need.
VLHR
Voice Low-to-High Ratio measures the balance between low-frequency and high-frequency energy in the voice. You compute the power spectral density using Welch's method with a windowed FFT, then split the spectrum at a cutoff frequency.
The cutoff is adaptive: if the fundamental frequency F0 is available, the cutoff is set to a multiple of mean F0 (typically in the 4-5x range, based on voice science literature). If F0 isn't available, a fixed fallback around 600 Hz works as a reasonable default. The idea is to roughly separate the fundamental and lower harmonics from the upper harmonics and formant frequencies. Low band runs from around 65 Hz up to the cutoff, high band from the cutoff up to around 8000 Hz.
def calculate_vlhr(audio, sr, f0=None, cutoff=600, f0_mult=4.5):
if f0 is not None:
cutoff = f0_mult * np.mean(f0[f0 > 0])
freqs, power = welch(audio, fs=sr, nperseg=1024)
low = np.sum(power[(freqs >= 65) & (freqs < cutoff)])
high = np.sum(power[(freqs >= cutoff) & (freqs <= 8000)])
return low / high if high > 0 else float('inf')A high VLHR indicates a voice with more chest resonance. A low VLHR indicates more brightness or head resonance. The adaptive cutoff matters because a fixed cutoff is wrong for speakers at both extremes of the pitch range — it'll capture too many or too few harmonics depending on the speaker's fundamental frequency.
Timing and silence detection
Pausing patterns say a lot about how someone speaks. Fast talkers with no pauses sound anxious. Too many long pauses sound unprepared. The right rhythm depends on context, but you need the data first.
pydub's detect_silence function works well for this. You configure a dBFS threshold (something in the -20 to -30 range depending on recording quality) and a minimum silence duration (300-500ms is typical for catching real pauses without flagging natural inter-word gaps). Both parameters need tuning for your audio quality — noisier recordings need a lower threshold.
From the raw silence segments, four metrics: pause count, pauses per minute (normalized by recording length), average pause duration in milliseconds, and pause variability as the coefficient of variation (standard deviation divided by mean, times 100). The CV is more useful than raw standard deviation here because a 200ms standard deviation means very different things if your average pause is 500ms versus 3000ms.
The double-load problem
If you're wrapping this in a FastAPI endpoint, there's an annoying practical issue. librosa's load resamples to 22050 Hz and returns mono float32. Fine for RMS, pitch, and spectral analysis. But pyloudnorm needs the native sample rate, and parselmouth.Sound works best with original rate too.
The workaround: read the full file into a BytesIO buffer and rewind it for each library that needs a fresh read. Not pretty, but for typical speech recordings the memory overhead is nothing compared to the FFT computations.
Also worth noting: pitch extraction (Parselmouth specifically) tends to be the performance bottleneck, not spectral decomposition. Adding per-stage timing early on is worth the effort.
Making the numbers useful
Raw metrics by themselves aren't that helpful. The thing that makes them useful is comparison against a reference. If you have a "good" recording and a recording to evaluate, percent deviation per metric tells you exactly what's different. Pitch variability 40% lower than reference? Pause rate 25% higher? Those are concrete, actionable numbers.
A simple comparison table with percent deviation color-coded green (better) or red (worse) turns out to be more useful than overlaid charts. Someone can glance at the table and immediately know what to focus on.
| Metric | Reference | Sample | Deviation |
|---|---|---|---|
| LUFS | -18.0 | -15.5 | +13.9% |
| Pitch variability | 45.0 Hz | 30.0 Hz | -33.3% |
| Pitch range | 200.0 Hz | 120.0 Hz | -40.0% |
| HNR | 15.0 dB | 17.0 dB | +13.3% |
| Nasal nasalance | 30.0% | 34.0% | +13.3% |
| VLHR | 3.00 | 3.50 | +16.7% |
| Pauses/min | 5.0 | 8.0 | +60.0% |
| Avg pause (ms) | 700 | 1200 | +71.4% |
| Pause variability | 25.0% | 50.0% | +100.0% |
Hypothetical comparison: reference vs sample recording (illustrative values)
Common gotchas
- pydub expects file paths, not BytesIO. The
AudioSegment.from_wavfunction accepts file-like objects in theory, but behavior is inconsistent across versions. Writing the buffer to a temp file for pause detection is clunky but reliable. - Parselmouth isn't in most requirements.txt files. It needs to be installed separately and depends on compiled C++ binaries. On some deployment targets (serverless, minimal Docker images), getting it to install requires extra system libraries. Worth it for the pitch quality, but budget time for the deployment.
- Watch how you apply windowing with Welch's method. Welch internally breaks the signal into segments and windows each one. If you accidentally apply a window to the full signal before passing it to Welch, you get garbage results on anything longer than a few seconds.
- F0 = 0 and F0 = NaN mean different things. Parselmouth uses 0 Hz for unvoiced frames. librosa uses NaN. If you forget to handle this and mix the two, your mean pitch calculation drops to nonsense values. Always filter before computing statistics.
Worth trying next
Run pitch extraction on native sample rate for both libraries. librosa's resample to 22050 Hz loses some pitch resolution for very low-pitched voices. Parselmouth already uses native rate, so the comparison is slightly unfair. In a future version I'd pass sr=None to librosa.load for the pitch stage specifically.
Add formant extraction. Formant frequencies (F1, F2, F3) carry vowel quality information that's useful for speech analysis. Praat can compute these via LPC analysis, and Parselmouth exposes the API. They're the obvious next feature for any voice analysis pipeline.
Use Crepe or FCPE for pitch. Neural pitch trackers like Crepe and FCPE outperform both pyin and Praat's autocorrelation on noisy recordings. They're heavier (Crepe loads a small CNN), but for a server-side API processing one file at a time, the latency is acceptable.
Stream the results. Each metric category is independent, so you could return partial results as they complete (power first since it's fast, Parselmouth pitch last since it's slow). WebSocket or SSE would make the experience much more responsive instead of waiting for everything to finish.
Wrapping up
You don't need a deep learning model to do useful audio feature extraction. Classical signal processing (FFTs, autocorrelation, bandpass filters) gives you metrics people can actually work with. "Pitch variability is 30% below the reference" is actionable. "Latent embedding is 0.3 cosine distance from the reference" is not.
librosa handles most general audio work well. Parselmouth fills the gap for speech-specific analysis where Praat's decades of phonetics research matter. scipy for spectral analysis and pydub for silence detection round it out. Not the most sophisticated stack, but it ships and it's debuggable.
If you're working through similar problems, feel free to reach out.