Many things in our lives have rhythms: music, poetry, the pace at which we walk, and even the rate at which we talk.
One of the marvels of everyday conversation – overlooked, perhaps, because it seems so obvious and so easy – is turn-taking. That is, when one speaker finishes talking, someone else usually starts talking right away. Not only is this transition between speakers relatively immediate, and surprisingly lacking in overlaps (both speakers talking at the same time), the new utterance is typically related to the previous turn – speakers aren’t just producing random sequences of words. This phenomenon happens largely outside the realm of our conscious experience; despite this (or because of it), turn-taking is remarkably successful, and remarkably fast.
This observation raises several empirically testable questions. The one I’ll discuss in this blog post is: which factors affect the speed of turn-taking? We’ve all had the experience of talking to somebody whose responses seem to take eons; alternatively, perhaps we’ve felt like the slower one, while the speaker runs circles around us. The speed of turn-taking is sometimes argued to be culturally dependent; people in New York City are said to talk (and transition between speakers) very quickly, while those in Texas are said to talk (and transition between speakers) more slowly. Are these alleged cultural differences real, or imagined?
Turn-Taking across Cultures
Recently, a group of linguists studied turn-taking in speakers across 10 languages (Stivers et al, 2009), spanning 5 continents, and ranging from major world languages to those spoken by indigenous peoples in Papua New Guinea. They were interested in precisely the question asked above: to what extent do cultures vary in the speed at which speakers take turns talking?
Across languages, the mean response time – the average time it took for a speaker to respond to somebody else – was 208ms. Each of the ten languages surveyed had a mean response time falling within a ~250ms window around that mean.
Speakers in each of the ten cultures surveyed all responded remarkably fast, with some (e.g. Japanese speakers) responding with an average of under 10ms! Similarly, each of the cultures surveyed showed a unimodal peak around 200ms, demonstrating that the most frequent response time across languages was ~200ms (see Figure 2).
While there was still considerable variability in the means across cultures (see Figure 1), the authors argue that this variability is not evidence for “fundamentally different types of turn-taking systems in the different languages”, instead suggesting that these differences are essentially variations along a universal turn-taking system (Stivers et al, 2009, pg. 4). This argument is supported by the finding that turn delays – e.g., situations in which speakers take longer than usual to respond – are also caused by equivalent factors across languages. For example, speakers provide confirmation responses (e.g. Yes) about 100-500ms faster than disconfirmation responses. This is consistent with other research that suggests that certain responses are “dispreferred”, such as rejections, and thus might be either more delayed, or contain more explanations and justifications (Pomerantz & Heritage, 2013).
The speed at which humans take turns talking may appear, at first thought, to be a strange topic of investigation. But it’s important to anyone interested in the origins of language and human communication, or even, as the authors suggest, questions about the origins of human social dynamics. A fundamental feature of language is that it is used by humans to communicate with other humans, in the communicative system we call conversation. This system has many essential “rules” that we all implicitly learn and follow, one of which is that we take turns talking. From this perspective, the finding of commonalities across languages from disparate languages families in turn-taking timing is groundbreaking because it suggests that this turn-taking system is quite old, evolutionarily-speaking.
But of course, there are still questions that remain. For one, the authors surveyed 10 languages, across very diverse language families and cultures, but there are obviously many more than 10 languages in the world. A more comprehensive survey would include more languages, and also compare speakers of the same language across regions (e.g. NYC residents to Texans).
Perhaps even more intriguing is the question of how humans manage to transition between speakers so quickly at all. A mean turn-transition time of ~200ms (and as low as ~7ms in some languages!) does not leave room to first identify that the first speaker is done talking, process fully what they’ve said, then plan what you want to say next (and then actually say it, of course). All this points to a role for ongoing prediction in language comprehension and production. But which cues precisely help a listener identify that a speaker is almost done talking? How are these cognitive processes (speech comprehension, utterance planning, etc.) coordinated neurally? And what constraints do these time pressures place on the comprehension and production of language in real-time, dynamic conversations?
Stivers, T., Enfield, N. J., Brown, P., Englert, C., Hayashi, M., Heinemann, T., … & Levinson, S. C. (2009). Universals and cultural variation in turn-taking in conversation. Proceedings of the National Academy of Sciences, 106(26), 10587-10592.
Pomerantz, A., & Heritage, J. (2013). Preference. The handbook of conversation analysis, 210-228.
Evans, N., & Levinson, S. C. (2009). The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and brain sciences, 32(5), 429-448.
 I’m using commonalities instead of so-called “universals” because making claims about any sort of language universal is inherently fraught. See Evans & Levinson (2009) for more.