Anyone who’s ever used Siri has likely experienced the frustration of not being understood. There’s a fundamental – almost existential – panic that surfaces when someone else doesn’t know what you’re saying.
It’s an even bigger problem for people with accents other than “General American English”. Voice interfaces struggle with accents, from regional American accents to foreign-accented speech. But why?
It all comes back to data. Most modern A.I. applications are built using machine learning, which uses a bunch of data to generate a predictive model mapping inputs to outputs. In the case of speech recognition, these models are generally trained using audio files as inputs and transcriptions of those files as outputs. Thus, the model that’s generated will depend on the data used to train it.
Unfortunately, the go-to training sets for speech recognition tend to be dominated by General American English speakers. So while the model might learn to recognize words produced in this accent very well, it will have a much harder time with, say, Southern American English or a Boston accent. A litany of different North American accents and their respective features is described here.
To some, this might not seem like a big deal. But there are both social and financial implications. As voice interfaces become more common – not just in phones, but cars, smart homes, etc. – people with other accents will be out of luck. Eventually, they’ll look for better options, which means that the best way to retain that share of the market is to build better speech recognition technology that understands a variety of accents. Companies such as Baidu, in China, have apparently recognized this problem, and are consciously building models that can recognize a wider variety of languages and accents.
There are a couple of possible solutions. One solution is to get ahold of a more diverse dataset. This should be done anyway, and it will almost certainly improve the accuracy and applicability of the technology. Unfortunately, gathering data is time-consuming and costly; researchers need to collect the audio, then painstakingly transcribe it. Given the amount of data required – often hundreds of hours per accent – this takes a while. Still, it’s necessary, so any voice interface company not doing this already needs to get to work.
Another solution, which can be done at the same time, is to look to how humans understand accented speech. People struggle with accents, to be sure, but we usually don’t need hundreds of hours to learn to understand someone we meet. Often, it seems like we can “calibrate” our language model to someone new, such that we infer what they were trying to say, given what they said; in other words, we have different expectations constraining how we process their speech than when we talk to speakers with other accents. Of course, the research is far from settled on the matter, but labs such as the Language Acquisition and Sound Recognition Lab at UC San Diego are actively working on this question.
People can adapt to understanding a new accent, but machines can’t. If voice interface companies want to remain competitive, they should start worrying about the growing population of customers with accented speech (speech other than General American English).
One way to do this is to get more diverse data; another way is to invest more research into how it is that people adapt to new accents. Either way, it’s a problem that needs solving.
(Note: I should thank fellow UCSD Cognitive Science graduate student Reina Mizrahi for bringing this problem to my attention in the first place; a future post will discuss her research on how children understand accented speech in more detail.)
 More controversially called “Standard American English”, GAE isn’t meant to imply that all (or even most) Americans have this accent. Rather, it seems to be the accent preferred by public speakers such as newscasters or politicians (except, notably, when politicians are attempting to express cultural affinity with a particular accent). Many people with regional accents who go into a public speaking role work to reduce their accent, since certain accents sometimes lead listeners (wrongly) to make inferences about the speaker’s intelligence, class, status, and so on. This is certainly problematic and is, in fact, probably part of the reason why machine learning researchers haven’t thought to train their models on many regional accents.
 The manner in which this predictive model is generated is dependent on which machine learning algorithm is used. Algorithms range considerably in complexity, from approaches like supervised linear regression to deep learning with neural networks.
 This is not unlike the problem of “racist algorithms”, in which machine learning algorithms take on the bias of their datasets. Work being done to raise awareness for and combat this is discussed here and here.