The Rhythm of Conversation (pt. 2)

People take turns talking during conversation. As discussed previously, the timing of this turn-taking process is remarkably fast, and happens largely beyond our conscious awareness. This raises the obvious question: how do speakers manage to transition between turns so quickly, and so successfully?

Why Conversation is Remarkable

Conversations, for the most part, do not follow any explicit script. Unlike formal debates, speeches, or other prescriptive ceremonies, there are no predetermined rules about the topic of conversation, or who gets to speak (and when). Rather, these features of conversation are managed locally between the participants (Wilson & Wilson, 2005).

Consider an imagined conversation between two speakers, Paul and Sue. Before Paul starts talking, Sue is not told what Paul will talk about[1], nor when he will be finished talking. In order to say something suitable in reply, Sue needs to: 1) determine what Paul is talking about; 2) determine when Paul will be done speaking; 3) start constructing a response that’s somehow relevant to what Paul is talking about; and 4) produce that response soon after (but not before) Paul is done talking. It’s also possible that Sue will miscalculate when Paul is finished talking, and thus accidentally interrupt his conversational turn; in these cases, both Paul and Sue need to rapidly work out who gets to continue talking. All of these calculations are performed on the order of milliseconds, perhaps even while both speakers are thinking about what they plan to cook for dinner.

As further illustration of why this is remarkable, think about conversations you’ve had that seemed awkward or somehow “wrong”. Often, this sense of awkwardness comes from deviations in the standard turn-taking system. Maybe Sue takes too long to respond, or maybe she responds too quickly. Maybe Sue says something completely irrelevant to what Paul said. Maybe Sue’s response suggests that she wasn’t really listening, but was instead just thinking about what she would say next – and really, given the demands of conversation, can she be blamed?

Turn-taking and Oscillations

So how do listeners form such precise predictions about when a turn will end? Wilson & Wilson (2005) argue that listeners are likely opportunistic in which cues they use[2], but that ultimately, the fluidity of this process is enabled by entrainment between the two speakers.

Entrainment is the process whereby over time, two or more rhythmic systems (e.g. oscillators) adopt the same or compatible frequencies. This occurs at many levels: A roomful of people clapping often converge on the same frequency, resulting in the emergent phenomenon of applause (Néda et al, 2000); two people walking side-by-side tend to adopt similar gaits (van Ulzen et al, 2009); and crucially, people conversing with one another entrain along a number of dimensions, including posture (Shockley et al, 2003), breathing patterns (McFarland, 2001), and even the words they use (Garrod & Pickering, 2004).

In this case, Wilson & Wilson (2005) argue that listeners become attuned to the rate of syllable production in particular. Natural human speech is segmented into syllables, and each syllable has an onset and an offset – in other words, there are points in time at which any given speaker is maximally “ready” to begin a syllable, and points at which they are minimally “ready”. The authors suggest that these readiness points become inversely correlated in speakers and listeners: “the listener’s potential to initiate is at a minimum when the speaker begins a syllable and is at a maximum when the speaker is mid-syllable” (Wilson & Wilson, 2005, pg. 962).

Put another way, speakers and listeners are constantly “alternating” in their readiness to speak; this alternation process should become more refined over the course of a conversation, as speakers and listeners align their speech rates (Street, 1984).

This model also predicts that conversations should become more difficult to sustain as the number of participants grows. Based on personal experience, I’ve found this to largely be true: the gaps between turns grow, and it becomes more likely that two people will begin speaking at the same time. Inevitably, a group of ten splinters off into two or three smaller groups. And in larger groups, usually the conversational floor is dominated by one or two speakers, the “storytellers” of the group.

The Takeaway

Wilson & Wilson (2005) proposed a model to explain how it is that we are so successful at turn-taking. The model is impressive in the simplicity of its assumptions, arguing only that speakers and listeners unconsciously entrain their rates of syllable production. Of course, there are other possible explanations. As the authors point out, listeners are likely opportunistic in the turn-taking cues they sample during conversation. Nonetheless, it’s quite possible that syllable rate is a consistent cue used by many speakers across cultures, across situations – though this claim, of course, requires extensive experimentation and observation.

Again, perhaps the easiest way to understand the centrality of this turn-taking process, and its temporal precision, is to consider situations in which it breaks down. For example, Skype conversations often become “awkward” or disjointed when the Internet connection is poor; even small delays in the signal’s transmission affect the fluidity of the interaction. It disrupts our ability to easily understand and produce speech, often resulting in extreme frustration.

But this observation raises yet another question: given the difficulty we experience when the typical timing of turn-taking is adjusted even incrementally, to what extent is our ability to comprehend and produce language dependent on this process?


[1] Of course, Sue might have predictions about what Paul will say. These predictions could be conditioned on a whole slew of factors, including previous interactions with Paul or the situational context.

[2] These could range from what the speaker says (e.g. syntactic markers that a turn is almost done) to what the speaker does (e.g. eye movements, gestures, etc.). Furthermore, listeners might indicate their intention to speak through interjections, sharp intakes of breath, or more.

