Skip to main content

Introduction

In this article we will give you an elementary introduction to quality in speech recognition and how we work with it, as well as some guiding steps on what you and we can do to improve the quality of a tracker, in case you should experience difficulties.

Working with trackers, and speech analytics in general, is often a fine and delicate balance between two primary factors:

  1. Precision - how often the “hits” are correct

  2. Recall - how many of the total actual hits the recogniser finds

We use these two factors as a way to measure tracker quality. If you go for 100% precision you will sacrifice recall, and vice versa, sacrifice precision if you go for maximum recall. 

In other words, striking the optimal balance between these two factors can be tricky business but highly important and thus something we are continuously working to improve. 


Let’s start with a bit about quality in general, to get a basic idea of the factors that may influence your experience when working with speech recognition. 


Factors affecting the quality of speech recognition

When transforming speech to text there are a lot of different factors that may affect the quality of the outcome of this transformation.  


Going into detail on this, is not the objective of this article, but to give you a general understanding we have listed some of the factors below, as well as a brief comment about each. 



The list below is neither exhaustive nor in any specific order but gives you insight into some of the external and internal factors. 


External factors

  • Recording quality
    Just like playing Mozart on a mobile phone makes it difficult to hear the nuances of the music, compared to listening on a proper set of hi-fi speakers, speech recognition software will struggle to hear the nuances and distinguish between words if the recording quality is poor.   

    In practice you can often identify poor quality when it sounds like your agents are speaking into a tin can, there are “fall outs” and words are cut off or muffled.   


  • Cross talk
    This is only a problem for recordings made in mono, which is unfortunately still the majority. In mono both speakers share the same audio track, and thus words get mixed acoustically when both speakers talk at the same time. 
It is not a problem on stereo recordings, as each speaker is in separate tracks, and thus never “overlap”.    


  • Noise
    This could be all sorts of noise, like background talk, music, wind, traffic, heavy breathing etc. In an effort to mitigate the effect of the noise on the speech, we have an industrial research collaboration with Aalborg University where we are working on removing noise pollution on recordings.   


  • Pronunciation
    Given that Capturi’s main method for detecting a word is how that word “sounds”, the pronunciation is obviously highly important for the quality of the output. 


    • Incorrect pronunciation
      Often a problem with longer and more complex words, product names, words from foreign languages etc. A classic example is elderly people who have problems pronouncing modern product names.  


    • Dialects
      The same word can often be pronounced in various ways depending on the dialect. You probably know it from experience, that it may be difficult to understand people from other regions of your country. 

      While you may struggle to interpret what your uncle from the other side of the country is saying, Capturi can make up for this variation by adding multiple phonetic transcriptions of the same word. If you find that Capturi does not catch these variations, please collect examples and contact support. Then we will make sure to add these to the pronunciation dictionary. 


    • Mumbling
      Let’s just be honest - a lot of us mumble, speak too fast, too low, cut off words and sentences etc. All which adds to the difficulty of detecting what is actually being said.

  • Mic placement

    We often experience that the sound from agents is quite low or fluctuate across different calls. This is often because they are not careful to place the boom correctly. The microphone should ideally be facing the mouth and placed around 2.5 cm (1 inch) away from the mouth. This will help reduce noise resulting from the microphone touching the face or breath noise.

How to optimise conditions for speech analytics software

What you can do to ensure optimal conditions for the speech analytics software:

  • Instruct your agents to speak clearly. Not only is this highly important for your customers to understand what they are saying, it will also improve the tracker quality in Capturi.

  • Get agents to repeat what customers are saying, if they can hear that the pronunciation is off. It helps alignment in the conversation as well as the output in Capturi.

  • Reduce the noise level in the call center to minimise background noise. No customers like to be disturbed by background music or chatter from your colleagues anyway.  


  • Make sure the agents use headsets with a boom, that the boom is placed correctly and that the headset settings are correctly configured. Some headsets even have settings that is optimised for speech analytics. 


  • If possible, make sure to check that the recording setting on your phone system is set to “high”. Don’t compromise quality because the phone system wishes to compress the file as much as possible to reduce storage space. Get maximum quality and manage storage by reducing the time you keep the recordings instead. Capturi has no extra cost on storage, and it is much easier to access the recordings there anyway.

Internal factors - software

This is where it might get a bit tricky to follow, but give it try.

We continually test new methods for doing speech recognition and analysis and for the time being the best tool for the job is the speech recognition toolkit named Kaldi. It is highly adaptable and extendable and is widely used for speech recognition research.

The versatility comes at a cost, though, since one still must assemble the parts – choose a feature extraction frontend, acoustic model structure, training strategy etc.

Figuratively speaking, Kaldi is not a take-away restaurant, but rather a well-equipped kitchen.

Capturi makes use of three models

  1. An acoustic model
    Assigns a likelihood to the “sounds” (phones) produced when a person utters a phrase or word for all phonemes of the given language.


  2. A pronunciation dictionary
    All words known to the model are located in the pronunciation dictionary, which is a mapping from each word (the orthography) to the phoneme sequence comprising it.


  3. A language model
    Assigns a probability to all (probable) combinations of word sequences

There are a number of example setups for training acoustic models and language models in the Kaldi repository [2] – also one for Danish. However, the included setup is relatively basic in a number of ways and not anywhere near being state-of-the-art accuracy-wise.

Some of our improvements include:

  • Adding a domain relevant data augmentation strategy

  • Adding more domain relevant acoustic training data

  • Domain adaptation of language models

  • Upgrading the pronunciation dictionary

  • Upgrading the neural network topology

The first two bullets relates to the robustness of the recognizer. We wish for it to work in challenging settings like noisy or reverbed environments, but the training data available is clean read speech. We therefore augment (corrupt) the training data with noise and reverberation in order to harden the model. Moreover, since the target domain is phone calls we’ve added a set of telephone recordings, and we’ve frequency-limited the non-telephone data set.

We’ve used our combined knowledge as well as several iterations of experiments in order to choose the augmentation scheme best suited to our main domains:

  • Customer service

  • Support and service desk

  • Sales

This is of cause something that needs frequent updating, as our knowledge of the domains increase, and new tools and approaches are uncovered in the research community.

Adapting the language model

Speech recognizers utilize statistical language models to know which sequences of words are likely to occur and which are not – and should be trained on texts as close to the target domain as possible.

For instance, in a sales conversation, it is far more likely to encounter the words "the price is ..." compared to "the rice is ...".

For that reason, we’ve created our own proprietary models.

 Our existing language models are trained on three different corpora: 


  1. A small corpus of highly curated text (KorpusDK from the Society for Danish Language and Literature)

  2. A small corpus of conversation-like text (subtitles)

  3. A huge corpus of non-curated text (Common Crawl).

We create a combined language model by interpolating the three corpora, using a development corpus of transcribed speech to determine the weights.

Adapting Capturi to new customers

Every time a new customer joins Capturi, we do two preliminary things to ensure, that the foundation for delivering a good experience (quality) for that specific customer, is in place. 


  1. We strengthen the Capturi language model by adding text from scraping the company’s website and potential other websites in the same sector. This improves the statistical language model’s ability to know which sequences of words are likely to occur and which are not, by adding material from the customers own domain. 


  2. Upgrading the pronunciation dictionary. Add domain and customer specific words that may be unknown to the Capturi dictionary (which contains more than 75k active words already)
This is a manual process and requires linguistic knowledge to ensure that each word has the correct phoneme sequence. 

The perceived quality of speech recognition

We can use precision and recall as a basis for discussing how you perceive the quality of your trackers, depending on the use case you are working on. Sometimes recall is the most important part, and other times you want to favour precision. 


Let’s take a few examples: 


  1. Quality assurance
    Let’s say you have a standard phrase that should always be mentioned in a given context, and you want to monitor how well the agents are at remembering this. In this case you will/should filter your data set to match the given context, and thus expect to get a hit on all the conversations you find. So, you will want to favour recall over precision, as you have narrowed down the data set significantly and expect a high hit percentage.  


  2. Finding examples for review, training and giving feedback 
    In this case you often want high precision, and don’t necessarily care too much about recall. You just want to go straight to the examples and not waste time listening through false hits. 


  3. Monitoring “behaviour”
    This may be twofold:

    1. Monitoring expected good behaviour, for example the correct introduction phrase; “welcome to Capturi”. Here you want a high recall to catch as many examples as possible, as you expect a high hit percentage and thus the amount of false hits to be limited. 


    2. Monitoring unexpected and undesirable behaviour, for example inappropriate language. This must be expected to happen rarely, and you want to make sure that the hits you get are correct. Thus, you should favour precision over recall. 

 

So, based on these examples the common rule of thumb should be: 


  1. If you expect a high volume of hits, you can accept lower precision and higher recall, as the number of false hits will be relatively limited both percentage and volume vice.

  2. If you expect a low volume of hits, you want high precision, as favouring recall is more likely to result in a relatively large percentage (and possibly volume) of false hits.

My tracker is off - what’s up?

When working with your own trackers you are, at some point, likely to experience four generic challenges:  


  1. Too many hits - Recall is good but the precision is off - you get most of the right hits but also a lot of false hits.  


  2. Too few hits - Precision if good but recall is too low - the hits are correct, but the system does not find enough of the possible hits. 


  3. Correct hits - wrong context - Capturi finds the right words, but the context is wrong. 


  4. The wrong hits - both precision and recall are off 


 

To mitigate these challenges there are several things to try. Some of which you can do yourself and others which the Capturi team can do.

Below is a list of the possibilities you/we have: 

Level 1: In-app adjustments

  • Check sound quality
    Review your conversations according to all the various factors described earlier. Perhaps the external factors make is unnecessarily difficult for the recogniser. 


  • Check segmentation - is it the correct speaker?
    Especially relevant if the tracker is limited to only listening on either agent or customer. Try to set it to listen on “all” and see if that changes the hit rate.  

    (Why is segmentation an issue? If your recordings are in mono and Capturi is splitting the track into employee and customer tracks, there may be areas where words are “assigned” to the wrong speaker. This is especially a challenge when cross talking occurs, and when the speaker switches from one to the other. In this transition, the first word of the next speaker may be recognised as being said from the former speaker. So, if the phrase you are looking for is often the first word in a switch between two speakers, it may be placed in the wrong “track”.

  • Add words to the near and not near setting
    Adds context to the phrases. Especially good if you get the right hits, but the context is wrong. 


  • Narrow down the “search” by adjusting the “active in…”
    If you are looking for phrases that you know are primarily being said in the beginning or end of the conversations, the try to adjust the “active in” functionality. 


  • Boost recall by adding multiple long phrases with the target word

    Most often there are multiple ways to phrase the same sentence. If you only use one or two examples, you are likely to sacrifice recall. Sometimes you need to add 10 variations of the same phrase, to make sure you got it covered. There is no limit to how many phrases you can have, just be creative.  


  • Boost phrases

    Your customer success manager can help to boost a phrase to increase recall. Boost should be used with caution but is great for things that you know occurs frequently and at specific times in the conversation, or specific scenarios. 


  • Increasing precision
    The opposite of boosting. Increases precision by increasing the threshold of how certain the recogniser should be before marking a hit. Can be done by your customer success manager.