Peter's MVIS Blog: Voice User Interface

A week ago I attended a meetup that featured an Amazon engineer (echo show focus... yes it was fascinating... ) the presentation she gave was about how they program the device for speech recognition.

It's WAY more involved (tedious?) than programming for visuals. When you're programming a typical program you can easily lay out all of the options that you will allow for a user, and they must choose one of them. The ways all of us speak varies a lot, and that makes programming for voice recognition particularly challenging.

The programming example she used was a "skill" for choosing a dog, and Alexa asks the user "what size dog do you want?"

With a visual program, you could handle this in a variety of ways that make getting a usable answer easy: You show options: small, medium, large. You could even use a sliding scale from "tiny" to "huge."

The engineer described the difficulty of programming for voice... so they want a big dog... how do you process all the possible words for "big?" If they're doing a good job of "conversing" you're more likely to get something strange.

{big, large, a hundred pounds, colossal, enormous, gigantic, huge, immense, massive, substantial, vast, a whale of a, ginormous, hulking, humongous, jumbo, mammoth, the size of a baby elephant, super colossal}

... and if the system is to respond, each option must be programmed into the skill...

Which makes me quite sure that visuals need to be integrated into even the best voice programmed devices -- trying to program for all of the nuances of voice communication would be extremely difficult.

We make requests, build relationships, and get some information through speech.

We get most of our information visually. I don't ask a smart speaker for the weekly weather report, because receiving the speech necessarily requires more time -- I would use speech to request the weekly forecast if I could then SEE it -- because making the request by voice can save me time, and seeing it saves me time.

Voice plus display will save time, and be more efficient...

Medium.com

VUI allows for hands free, efficient interactions that are more ‘human’ in nature than any other form of user interface. “Speech is the fundamental means of human communication,” writes Clifford Nass, Stanford researcher and co-author of Wired for Speech, “…all cultures persuade, inform and build relationships primarily through speech.” In order to create VUI systems that work, developers need to fully understand the intricacies of human communication. Consumers expect a certain level of fluency in human idiosyncrasies, as well as a more conversational tone from the bots and virtual assistants they’re interacting with on a near-daily basis.

Natural Language Processing — we’re not currently capable of developing a VUI with an inbuilt, natural and complex understanding of human communication, not yet. Regional accents, slang, conversational nuance, sarcasm… some humans struggle with these aspects of communication, so at this point can we really expect much more from a machine?

Visual feedback — including an element of visual feedback helps to reduce the level of frustration and confusion in users who aren’t sure whether or not the device is listening to or understanding what they’re saying. Alexa’s blue light ring, for example, visually communicates the device’s current status e.g. when the device is connecting to the WIFI network, whether or not ‘do not disturb mode’ has been activated, and when Alexa is getting ready to respond to a question…etc.

Thanks Carmen O!

Links to Pages

Friday, September 7, 2018

Voice User Interface

No comments:

Post a Comment