Sorry, I Didn’t Understand That

Almost all visions of “the future” include computers that we can talk to. Something about language seems central to our understanding of intelligence, and often it is taken as a given that intelligent machines will be conversant with humans.

After the failure to expand early dialogue systems to do anything useful, A.I. researchers gave up hope for many years. More recently, however, there’s been a resurgence of interest in spoken dialogue systems (SDS), in both academic and commercial spheres. Notable commercial applications include: Siri (Apple), Alexa (Amazon), and Cortana (Microsoft[1]). Google is also at work on a voice assistant, and there are many start-ups working on language interfaces[2].

But as anyone who’s ever struggled with Siri will know, fully conversant language interfaces are still a long way off. Speech recognition has made great progress[3], but actually figuring out what to do with that speech-to-text transcription is another matter. Humans, of course, converse with remarkable fluidity and ease, despite all the difficulties inherent to producing and processing speech at such a breakneck pace (Garrod and Pickering, 2004). But we also have access to an incredibly rich set of extra-linguistic knowledge, which is very difficult to hand-code into a machine, including: social customs and practices, embodied experiences in the world, and so on[4].

Thus, what we’re left with are machines that can only understand a small set of input, which raises the problem of habitability.

The Habitability Problem

Let’s say we build a machine that can carry out 3 different kinds of tasks (e.g. turn on/off the lights, turn on/off the AC, and open/close the garage door). For each of those tasks, the machine knows about 2 ways to give the command in English (“turn on the lights” + “can you turn the lights on?”). This means there are 6 sentence formats that machine understands. We’ll call this range of “accepted input” Gm. This is the machine’s model of what language is.

Now we have a person who wants to use this machine. We’ll assume the person is a native English speaker, and knows many more words and ways to say things than the machine – including the 3 kinds of tasks. For example, the person might say “it’s getting pretty dark in here” as a request to turn on the lights. We’ll call this set of utterance GE, which is (roughly) the set of ways to say all the things one might want to say in English.

By definition, Gm is smaller than GE. The habitability problem is the problem of a person thinking that Gm is larger than it is – in other words, thinking that the machine will be able to understand something that it doesn’t (Watt, 1968; Bernstein & Kaufmann, 2006; Moore, 2016).

Ideally, the person will be able to quickly learn the parameters of Gm, and so understand what the machine does and doesn’t understand. But this can be difficult. Unlike humans, so-called “intelligent machines” are often pretty dumb, and don’t understand even a simple change to an accepted input. For example, a machine might understand “turn on the lights”, but not “turn the lights on”.

People are understandably frustrated when a machine doesn’t understand them. So how do we get people to avoid using language the machine won’t understand?

Is Language All-Or-None?

Since a fully fluent machine is still very much intractable, designers focus on building systems that can translate a restricted set of input (Gm) into a restricted set of actions (Am). The implicit assumption is that:

  1. Users will learn Gm quickly and without too much frustration.
  2. Even though Gm is small, it is still useful and desirable to have (as opposed to, say, just a touch interface).

 

Moore (2016) suggests that these assumptions may not hold, and in fact argues that spoken language could be “all or none”. His argument is that language might be fundamentally different than other ways of interacting with a system. In order for a language interface to be functional and usable, he argues, it should be fully conversant – otherwise the advantages of allowing language input simply do not outweigh the frustrating costs.

He compares the problem to the uncanny valley, the phenomenon whereby making a robot more humanlike increases likability until a certain point, at which the resemblance – but not equivalence – induces a sense of “creepiness” and unease. Similarly, as we add functionality to a language interface, it becomes more and more difficult for people to learn the parameters of what that interface can understand, resulting in many communicative missteps. This is illustrated in the figure below, taken from his paper; note that this figure is not based on any sort of actual data, and it’s also odd that the curve dips “below” the box itself, suggesting that perhaps he should’ve opted for a visualization which includes negative y-values.

 

Screen Shot 2017-07-27 at 12.56.05 PM.png

Not all hope is lost, however. Moore suggests that the way forward is to study how humans engage in conversation with “mismatched” partners – e.g. people who have less linguistic knowledge than you. This happens in conversations with children (e.g. child-directed speech), foreigners who don’t speak the language very well, or even with dogs, who have a very limited model of a language.

A Problem of Interfaces

Is Moore correct?

Some of you might recognize this problem as an instantiation of a more general case I’ve discussed before – the problem of “Trust”, or building a proper model of what a machine can and can’t do. It’s generally discussed in relationship to A.I., but it’s actually a problem faced by anyone building any sort of technology.

Crucially, a good interface intuitively constrains the actions a user can and can’t take to interact with a kind of technology. For example, in order to make use of a car’s basic functionality, a driver simply has to learn that one pedal makes the car go and another makes it stop, that the wheel makes it turn in different directions, and that the car can be put in at least two modes (“park” and “drive”). A mechanic needs to know more, of course, but this level of abstraction generally suits a driver’s daily needs[5].

Moore seems to doubt the possibility that a language interface can be constrained in this way. While I agree that language is a difficult problem, I think people can still learn the kinds of language use allowed by any given machine.

As an example, I point to Google (and other search engines). People who grew up with early engines learned that certain ways of querying were much more successful; in other words, users acquired a sort of “Google-language”. While people generally don’t make full use of Google’s functionality (and, of course, Google has gotten much better at understanding different kinds of queries), I think the point still holds: people adapted their language input to what they thought the system understood.

Another argument comes from the research on how people interact and talk to each other. Moore himself (2016) suggests we look to this research, so I doubt he’d disagree with this conclusion. A powerful piece of evidence comes from work on conversational alignment (Garrod and Pickering, 2004; Brennan and Clark, 1996; Branigan et al, 2011), in which people start to take on the speech patterns of the person they’re talking to, including: their choice of words, their speaking rate, and even their syntax.

One solution, then, is to get people to learn Gm (the utterances a machine understands) by exploiting this phenomenon of alignment – for example, if a machine only understands a certain set of words, the machine should use those words in interactions, which makes the person more likely to use those words (as opposed to some synonym)[6].

The Takeaways

We’re still a long way off from general-purpose language interfaces. Someday, maybe, we’ll be able to talk to our machines about anything we please – but for now, they only understand a limited set of sentences.

This can be frustrating, but it doesn’t have to be. What’s important is that the machine makes clear its affordances – what it can do, and what it can’t. Some interfaces, like Siri, already recognize this. You can ask Siri “what can you do”, and it’ll display a set of actions and sentences. Fundamentally, this is a problem of interface design. My argument here is that we can design language interfaces that elicit the right sort of language from the user – e.g. the language they will understand.

And what does this mean for the average user of an interface like Siri or Alexa? Until these language interfaces integrate these design considerations, we’ll have to pay closer attention to what they do and don’t understand. Importantly, we must recognize that unlike a person, small changes in how we say something can determine whether or not the machine understands it.


References:

Minsky, M. (1974). A Framework for Representing Knowledge. Artificial Intelligence.

Bernstein, A., & Kaufmann, E. (2006). GINO – A Guided Input Natural Language Ontology Editor. The Semantic Web-ISWC 2006, LNCS 4273(November), 144–157. http://doi.org/10.1007/11926078_11

Moore, R. K. (2016). Introducing a Pictographic Language for Envisioning a Rich Variety of Enactive Systems with Different Degrees of Complexity. International Journal of Advanced Robotic Systems, 13(2), 1–20. http://doi.org/10.5772/62244

Watt, W. C. (1968). Habitability *. Journal of the Association for Information Science and Technology, 19(3).

Garrod, S., & Pickering, M. J. (2004). Why is conversation so easy? Trends in Cognitive Sciences, 8(1), 8–11. http://doi.org/10.1016/j.tics.2003.10.016

Brennan, S. E., & Clark, H. H. (1996). Conceptual pacts and lexical choice in conversation.   Journal of Experimental Psychology: Learning, Memory, and Cognition, 22(6), 1482.

Branigan, H. P., Pickering, M. J., McLean, J. F., & Cleland, A. A. (2007). Syntactic alignment and participant role in dialogue. Cognition, 104(2), 163-197.


Footnotes:

[1] Another blog post ought to deal with the problematic phenomenon of making all these voice assistants female by default, which probably says something about our assumptions regarding gender and power structures.

[2] Examples include: Wit.ai, Meet Sally, Rasa, and more.

[3] Though, importantly, recognition of speech with accents other than what’s usually called “General American” is still difficult – again, another blog post will deal with this issue.

[4] These are sometimes called frames (Minsky, 1974).

[5] It’s only when a car breaks down that a driver realizes the extent to which they truly don’t understand how their car works.

[6] A future post will discuss in more detail a paper by Branigan et al (2011) showing lexical alignment between people and machines.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s