The accuracy of a speech recognizer depends to a large degree on the quality of the speech being recognized.

Distorted speech, heavy background noise, cross talk, and heavy processing will in general result in a high number of recognition errors.

It's also worth knowing that a lossy transformation of the audio – that is a processing step that removes or alters part of the speech signal like e.g. mp3-compression or excessive down sampling – can't be undone by converting the lossy audio back to e.g. 16 kHz WAV.

In order to optimize the audio for recognition you should therefore:

  • Avoid clipping the signal – e.g. by over-amplifying the signal into saturation

  • Keep one speaker per channel (stereo) if possible

  • Disable (or reduce) any processing of the audio (like gating, filtering, noise reduction, etc.)

  • Don't down sample the audio to below 16 kHz if possible (8 kHz telephone audio is fine)

  • Avoid compressing the audio unnecessarily.

Recommended codec settings are:

  1. Uncompressed PCM (WAV), 16 kHz, 16 bits/sample

  2. FLAC, 16 kHz, 16 bits/sample

  3. OPUS, complexity 10

  4. MP3, 64 kb/sec stereo or 32 kb/sec mono

Converting/compressing audio files

Avoid – if at all possible – to down sample the audio using lame, since it uses an extremely weak anti-aliasing filter (resulting in aliasing). Use e.g. sox, if you need to down sample the signal before storing as mp3.

Down sample using sox and convert to mp3 using lame:

$ sox input.wav -r 16000 -t wav - | lame -b 64 -q 0 - output.mp3

Convert to opus using opusenc (complexity == 10 by default):

$ opusenc input.wav output.opus