|
Speech Recognition and Natural Dialogue
with Virtual Agents
By James Hammerton
May 2007
Today, many call centers employ speech applications where the customer speaks to
a computer rather than to an agent. The benefits of doing so are well known:
providing 24/7 coverage, reducing costs, and lessening the burden on call center
agents by handling simple tasks automatically. Rather than replacing human
agents, speech applications complement them, handling the more mundane tasks
while leaving humans to deal with more complicated tasks or with situations
where a customer insists on talking to an agent.
Typically, the speech applications employed today are rigid, menu-driven
applications, asking users for one piece of information at a time. However,
recent advances in speech-recognition technology and dialogue systems support
more natural forms of dialogue. This enables callers to conduct their business
more quickly, increases the range of calls that can be handled automatically,
and makes the speech applications easier to use.
Natural dialogue in speech applications:
Supporting natural dialogue poses a number of challenges. The grammars, which
define the utterances that can be understood by the system, have to be larger,
taking more effort to develop. Larger grammars also increase the risk of error
as the system has more options from which to choose, and there is a greater risk
that a caller's utterance will be ambiguous, requiring intelligent handling to
resolve the ambiguity. Once an application does more than ask for a single
piece of information at a time, there is also a greater risk that it can follow
a completely inappropriate path because of errors. These challenges have been
addressed by the development of a range of new technologies.
Statistical Language Models (SLMs):
Used both in conjunction with, and in place of, traditional grammars, SLMs
compute the probability of a word occurring, usually based on the two previous
words. This information is used to help decide what the caller has said. SLMs
are effective at improving speech-recognition performance but require large
numbers of transcribed utterances to provide the necessary statistics.
Auto-generation of grammars:
This reduces the burden on grammar writers. Although grammar writing cannot be
totally automated, it is possible, for example, to provide grammars that have
slots in them so that they merely need customization, rather than every single
phrase being written from scratch.
Backing-off:
The recognition system first tries a grammar tailored to the current application
prompt, and if it doesn't get a match, it then backs off to a grammar covering a
wider context. This allows narrow-coverage grammars to be exploited for their
good recognition performance while enabling wide-coverage grammars to be
exploited for their flexibility.
N-best recognition results:
A speech recognizer returns several best guesses of what was said, and the
application chooses the guesses that best match the current context.
Look ahead capabilities:
This is where the dialogue engine is aware of the future paths an application
can take and uses this information to match user input to both current and
future fields as needed.
Value confirmation:
The dialogue engine is capable of providing
implicit and/or explicit confirmation of what a caller has said. With implicit
confirmation, the system tells the caller what values it recognized, before
giving the next prompt. With explicit confirmation, the system asks the caller
whether the recognized values are correct before using them. Implicit
confirmation allows the caller to move on quickly if the values are correct or
to provide correction at the risk of having to undo some work. Explicit
confirmation slows things down, but the application won't need to undo actions
taken because of an error. Judicious use of both types of confirmation helps
ensure that what the caller thinks is happening matches what the application is
doing.
Coping with errors:
An important aspect of any speech system is its
error handling. Recognition errors will occur, because it's not possible
for speech recognition to cope with all possible sources of error, such as
strong accents, noisy lines, background noise, coughing, repeated words, and the
use of words that don't occur in the system's vocabulary.
With more natural dialogue, error handling is doubly important to minimize the
risk of the system taking the wrong path. For this reason, various strategies
are often employed for minimizing errors. Examples include acoustic
disambiguation where an application will ask the user, "Did you say X, Y or
Z?" when it can't find a clear winner in the n-best results; acoustic
verification where a dialogue system will ask the user, "Did you say X?"
when the recognition confidence is low; and semantic disambiguation where
the system asks, "Did you mean X or Y?" if the input can be interpreted in more
than one way.
Handing over to a human is still sometimes
necessary: A common problem with
speech systems is that callers find themselves trapped when things go wrong and
are unable to get to a human without starting over and risking being stuck
again. By keeping track of variables such as recognition confidences, how often
a caller has been re-prompted, and whether (or how often) the caller has tried
to correct the system, a speech application can monitor how well the dialogue is
going. Should things go badly, it can transfer over to a human agent.
Additionally, a good system should also enable a caller to get to a person
quickly if it needs to. Ideally, the human agent will also receive information
about what the caller was trying to do.
Handing the call over to a person is still necessary at times, because no matter
how good the recognition or dialogue engines are, they may not be able to cope
with a strong accent, or a bad line, or a caller using words that are not in the
system's vocabulary, and then the dialogue will break down. This will leave
callers frustrated unless they can speak to a human who can then deal with the
situation.
Conclusion:
Menu-driven speech applications are well
established, and it is now possible to improve these applications by allowing
more flexible forms of dialogue. This can lead to a better caller experience,
makes speech applications easier to use, and it further reduces the burden on
human contact center agents. However, adequate error-handling mechanisms and
the ability to transfer to a human agent must be provided, or callers will
become frustrated.
Dr. James Hammerton is a natural language
processing and artificial intelligence specialist at Graham Technology (www.grahamtechnology.com)
and the principal researcher behind the dialogue engine in Graham Technology's
agent247 technology, an extension to the company's business-process modeling
system, GT-X7, which enables business processes to be driven via speech or
textual input from the user.
Return
to List of Articles || Read more articles at MyArticleArchive.com
|