Audio Quality and Speech Recognition

The accuracy of a speech recognizer depends to a large degree on the quality of the speech being recognized.

Distorted speech, heavy background noise, cross talk, and heavy processing will in general result in a high number of recognition errors.

It's also worth knowing that a lossy transformation of the audio – that is a processing step that removes or alters part of the speech signal like e.g. mp3-compression or excessive down sampling – can't be undone by converting the lossy audio back to e.g. 16 kHz WAV.

In order to optimize the audio for recognition you should therefore:

Avoid clipping the signal – e.g. by over-amplifying the signal into saturation
Keep one speaker per channel (stereo) if possible
Disable (or reduce) any processing of the audio (like gating, filtering, noise reduction, etc.)
Don't down sample the audio to below 16 kHz if possible (8 kHz telephone audio is fine)
Avoid compressing the audio unnecessarily.

Recommended codec settings are:

Uncompressed PCM (WAV), 16 kHz, 16 bits/sample
FLAC, 16 kHz, 16 bits/sample
OPUS, complexity 10
MP3, 64 kb/sec stereo or 32 kb/sec mono

Converting/compressing audio files

Avoid – if at all possible – to down sample the audio using lame, since it uses an extremely weak anti-aliasing filter (resulting in aliasing). Use e.g. sox, if you need to down sample the signal before storing as mp3.

Down sample using sox and convert to mp3 using lame:

$ sox input.wav -r 16000 -t wav - | lame -b 64 -q 0 - output.mp3

Convert to opus using opusenc (complexity == 10 by default):

$ opusenc input.wav output.opus