Time to Hang up on Voice

I recently spent some time playing with the new Moto 360 smartwatch to develop some intuitions about watches as a form factor for wearable technology. And it spurred some thinking about the utility of voice commands as an interface.  

Voice is not the future of interface design, and it isn’t going to be even as voice technology improves. Speaking is not a great way to interact with the world. So, while voice technology is worthwhile and has a role, I think it is one of the most currently over-hyped technologies from an interface design perspective.

There are some meaningful applications for voice commands and dictation.  Voice is convenient for people who have a hard time interfacing with computers for various reasons (literacy barriers, those who can’t type, the elderly).  It is also a good option for a lot of people who use languages that are difficult to type on a keyboard, like Chinese. For example, voice notes are common in messaging products in Asian countries.

Voice can be convenient while driving, although let’s not forget that that is a short-term problem since we won’t be driving ourselves much in the future.  

However, I believe that the science fiction world that we grew up with, where Jean-Luc Picard and his friends on the USS Enterprise mostly interfaced with computers by speaking to them, is just wrong. Thanks to the growth of typing and photography, speech is less important than it has ever been in human history.  Attempting to retrofit computers with speech input will not reverse the fact that, in the future, people are likely to speak less not more.

Beyond that, voice isn’t the killer interface for interacting with technology for a few key reasons.

First, aural space is precious, and multiplexing, namely sending multiple signals over one channel in physical space, doesn’t work.  

Think of a world where instead of being on a train with everyone silently texting and reading, you were on a train full of people talking to their devices.  It would be distracting and basically impossible to deal with. So much so, that the second you wanted to use voice, you would have to introduce noise canceling headphones or some sort of “cone of silence” technology right along with it, a hard barrier to cross in public spaces.  You might argue that this is a limited problem, but as offices adopt open floor plans over private spaces, this seems like an increasingly real problem.

Second, voice really messes with audience control—it is the anti-Snapchat. One of the most important aspects of modern text communication is being able to control the distribution of your expression.  I can be in the back of a cab or in an open bullpen at work and have private conversations with limited audiences in text.  Just imagine a world where expression was spoken.  There would be much that I would want to express that I couldn’t say with other people around.  Because voice is public to those around you it limits what you can say.

Third, using gestures and typing are more efficient than speaking.  There are several reasons for this. First, voice expression tends to be mostly concurrent with thought.  

There generally is no editing layer to a conversation; you just say things as you think of them.  This means that what is said is usually less precise than what is written and therefore harder for the recipient to efficiently consume.  You could theoretically edit voice memos before sending them, but if you have ever tried to edit via voice, it turns out that describing edit commands is much more time consuming and inefficient than editing by pointing.  

Typing and gestures can be faster than speech.  I can type as fast as I can speak, if not faster. It is also generally easier to point at options than it is to describe them verbally.  Context and application switching is much easier with gestures than with speech. Think how hard it would be to use excel via a voice interface, or how much slower it would be to switch browser tabs via a verbal command versus command-tab.  

Voice is less efficient than typing or gestures because speech is fundamentally single-threaded—you can only hold one conversation at one time. But with gestures it is easy to point at different things simultaneously. This single threadedness of voice might be good if you think of your interface with the world as being a single “Hal,” but the reality of how we interact with things around us will, I hope, be more nuanced than that.

At lot of this was evident while using the Motorola smartwatch. I found myself unable to issue commands from the back of a cab with music playing and was reluctant to send my wife a note while eating in a noisy cafe. When I needed to look for directions, I consistently felt that it would have been faster to take out my phone and type rather than to search via speech.

The upshot is that voice recognition is a classic example of a technology that makes engineers extremely excited because it is challenging. Certain companies like it because it represents a theoretically simple model of interaction where you interact with one monolithic application (which they want to be).  While I’m glad that companies and engineers find it intriguing, because there are some useful applications, the importance of it gets frequently overstated from a product and interface perspective resulting in over-investment in products that rely on voice as an input.