Magic of synchronous conversation

Some quick unfiltered takes on what makes synchronous communication better than asynchronous communication.

I just read this pitch for a “hyperphone” by @TsviBT , or an interface which enables high-bandwidth asynchronous conversation, branching/threads, etc. I’m pretty excited by this sort of thing, especially for enabling conversations with larger numbers of simultaneous participants.

It does, however, feel like something is inevitably lost with these interfaces, some kind of magic that only real-time (often in-person) communication can provide, and reading this post prompted me to think a bit about what this magic is exactly. A few thoughts come to mind:

Conversation is more collaborative than we realize.

We often describe conversation as fundamentally about taking turns sharing ideas, and it’s usually considered polite to give people the space to express themselves individually, one at a time. One person speaks, the listener processes their speech and then responds. There is, however, a lot of subtle back and forth coordination happening in conversations with little facial expressions or verbal interjections (called backchanneling in linguistics).

This first became really salient to me when I learned about how it is in Japanese. There they constantly interject on the speaker (backchanneling about 3x as often as in English). They have this concept called Aizuchi (literally mutual-hammer) to express how a listener participates in collaboratively constructing a conversation. (I highly recommend this short video by a linguist summarizing some studies on the topic: Aizuchi: Why it’s impolite not to “chime in” in Japanese).

The part of this that stands out to me is just how short these feedback loops are; a person gets some kind of response within a second or two of speaking. It seems likely to me that all this backchanneling is doing a lot more than just communicating respect, but is in fact making the information exchange way more efficient by communicating things like level of surprise, understanding, interest, readiness to respond, or much subtler signals of disconnect between what is being said and what is being heard.

Synchrony prompts our inner simulator.

There is also a related predictive processing/ inner simulator story for why this synchrony is important: Speaking to someone who is actively listening forces you to maintain a good inner simulation of them in order to accurately predict their interjections and responses. Because of this, I think the fact of being physically immersed in an environment where you are likely to get immediate responses to your speech ends up prompting you in ways that can make communication even more efficient.

My go to example for this “inner simulator” thing is the anecdote of being in school, getting stuck on a class problem, going up to the teacher to ask for help, and then having the answer to the problem magically pop into my head. Something about putting myself in an environment where the teacher was about to respond allowed me to instantly predict their responses, even though that prediction was inaccessible to me from my desk.

The prediction requires immersion, and that immersion is something that seems really hard to replicate. I think there is something pretty different about speaking when you can really feel that someone is, in that moment, actively listening to you.

Hierarchical communication

To be unnecessarily speculative and theoretical, I think there is also a case for how high-fidelity inner sims can effectively multiply the amount of communication taking place by a significant factor. If each participant is maintaining a high-fidelity model of their partner, we actually have a hierarchical system that looks a bit like this:

It could be that a large fraction of the “communication” taking place is actually happening between each participant’s self model and the simulation of their partner, saving inter-participant communication for areas of high uncertainty, or where the simulation is most likely to diverge from the ground truth.

To be even more unnecessarily speculative, there might be a direct analogy here to active learning on human preferences in ML. There, a system typically learns from a model of human preferences rather than the human directly, getting human feedback only for those samples where the reward model is most uncertain. This allows the full system to most efficiently use the human’s time, which is generally the biggest bottleneck.

In this analogy, in-person verbal communication is probably already highly optimized to increase the amount of information being exchanged.

Final notes

I’m not sure how to go about replicating this magic offline, in more asynchronous ways. One first-pass idea is to just try to replicate the feeling of “being listened to” via AI generated simulations of active listening, but this definitely still feels quite far off from you can get from a human listener, and it’s hard to get these simulations to remain tethered to a human-generated ground truth.

Overall though, it does seem like there should be a lot of upshot to making significant improvements to how people communicate. Just because there are tradeoffs to asynchronous communication interfaces doesn’t mean the payoff isn’t well worth it, though it seems especially interesting to think about how we might improve or expand on synchronous communication as well.