VoiceXML application structure

The VoiceXML specification provides ways to request and control prompts and speech recognition:

Define the flow of a dialog (within a <form> element).
Specify text to be spoken to the caller (using the <prompt> element).
Specify needed speech grammars (using the <grammar> element).
Configure recognition processing. For each recognition event, determine which grammars/models to use, set timers, define the bargein state (using the <property> element).
Request recognition of the collected speech (using the <field> element).
Receive recognition results (in the VoiceXML variable application.lastresult$) and process them for further action. See Getting recognition results.

Example: directed dialog using Nuance Recognizer

The following simplified example illustrates a common way in which VoiceXML implements these actions. The example provides the current weather for a requested city and state using Nuance Recognizer.

The application sets the grammar to be used for this session (cityandstate.grxml).
The application begins with an announcement and an advertisement for the service. The caller is prohibited from interrupting the ad.
Following the welcome ad, the application asks for the city and state for which the caller wants to know the weather. If the caller does not respond within 5 seconds, the prompt is repeated twice.
If there is still no answer, the application asks again, more specifically. If the caller provides the state, the application asks for the city. Note that the request for the city name includes the state that the application understood from the preceding question, as a way of confirming it.

The MRCP requests

This MRCP excerpt corresponds to the beginning of the VoiceXML example.

An MRCPv2 SPEAK request initiates speech.

Client->Server:

 MRCP/2.0 386 SPEAK 543257

 Channel-Identifier:32AECB23433801@speechsynth

 Kill-On-Barge-In:false

 Voice-gender:neutral

 Voice-age:25

 Prosody-volume:medium

 Content-Type:application/ssml+xml

 Content-Length:...

 <?xml version="1.0"?>

 <speak version="1.0"

  xmlns="http://www.w3.org/2001/10/synthesis"

  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

  =xsi:schemaLocation="http://www.w3.org/2001/10/synthesis

  http://www.w3.org/TR/speech-synthesis/synthesis.xsd"

  xml:lang="en-US">

<p>

   <s>Welcome to the weather information service.</s>

   <audio src="http://www.online-ads.example.com/wis.wav"/>

  </p>

 </speak>

Server->Client:

 MRCP/2.0 49 543257 200 IN-PROGRESS

 Channel-Identifier:32AECB23433801@speechsynth

 Speech-Marker:timestamp=3D857205015059

The synthesizer finishes the SPEAK request.

S->C:

 MRCP/2.0 48 SPEAK-COMPLETE 543257 COMPLETE

 Channel-Identifier:32AECB23433801@speechsynth

 Completion-Cause:000 normal

 Speech-Marker:timestamp=3D857207685213

The synthesizer sends the next prompt.

C->S:

 MRCP/2.0 386 SPEAK 543257

 Channel-Identifier:32AECB23433801@speechsynth

 Kill-On-Barge-In:true

 Voice-gender:neutral

 Voice-age:25

 Prosody-volume:medium

 Content-Type:application/ssml+xml

 Content-Length:...

 <?xml version="1.0"?>

 <speak version="1.0"

  xmlns="http://www.w3.org/2001/10/synthesis"

  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

  =xsi:schemaLocation="http://www.w3.org/2001/10/synthesis

  http://www.w3.org/TR/speech-synthesis/synthesis.xsd"

  xml:lang="en-US">

<p>

   <property name="timeout" value="5s"/>

   <s>For what city and state would you like the weather?</s>

  </p>

 </speak>

S->C:

 MRCP/2.0 49 543257 200 IN-PROGRESS

 Channel-Identifier:32AECB23433801@speechsynth

 Speech-Marker:timestamp=3D857205015059

The synthesizer finishes the SPEAK request.

S->C:

 MRCP/2.0 48 SPEAK-COMPLETE 543257 COMPLETE

 Channel-Identifier:32AECB23433801@speechsynth

 Completion-Cause:000 normal

 Speech-Marker:timestamp=3D857207685213

The recognizer is issued a request to listen for the customer choices.

C->S:

 MRCP/2.0 343 RECOGNIZE 543258

 Channel-Identifier:32AECB23433801@speechrecog

 No-Input-Timeout: 5000

 Content-Type:application/srgs+xml

 Content-Length:...

 <?xml version="1.0"?>

  <!-- the default grammar language is US English -->

  <grammar xmlns="http://www.w3.org/2001/06/grammar"
    xml:lang="en-US" version="1.0" root="request">

  </grammar>

S->C:

 MRCP/2.0 49 543258 200 IN-PROGRESS

 Channel-Identifier:32AECB23433801@speechrecog

Example: raw recognition with Krypton-only

Note: Nuance Recognizer and Dragon Voice applications require different artifacts. (They do not share artifacts.) To create Dragon Voice artifacts, contact Nuance to get access to Nuance Command Line Interface or Nuance Experience Studio or Nuance Mix Tools .

Note: The content in this topic is for Dragon Voice in on-premise deployments.

This rudimentary example begins with a prompt, collects the information provider by the caller, and ends.

In preparation for recognizing collected speech, the VoiceXML document loads a domain language model (DLM) and two wordsets (which expand the vocabulary of the DLM) into the Krypton engine. For details, see Triggering the Dragon Voice recognizer.
A greeting collects input from the caller. The prompt is open-ended: " This is a test. Please Speak."
If the caller say anything, the VoiceXML document disconnects (exits).
If the caller says nothing, or if the speech is not recognized, the VoiceXML document repeats the prompt.

Example: open-dialog using Dragon Voice

This example begins with an open-ended prompt, collects the information provider by the caller, and uses a directed dialog to prompt for each of the remaining information slots. The scenario for this VoiceXML page is a banking application where callers transition to this page after indicating the desire to make a payment. The page collects the processing information: amount, date, payee, and account. The page can collect multiple slots per dialog turn, and the caller can change slot values at any time.

The VoiceXML document loads the models and dynamic content (in this case, two wordsets) into the Dragon Voice engines in preparation for recognizing and interpreting the collected speech. For details, see Triggering the Dragon Voice recognizer.
A greeting collects input from the caller. The prompt is open-ended: "Thank you for calling, how can I help you?"
If the caller provides all of the information needed to satisfy the dialog (all fields filled), confirmation follows.

For example, to the initial prompt a caller might say "Pay thirty dollars to my Visa from checking on February first 2018." This utterance fills all slots (provides all entity values). Confirmation follows: "Thanks, we'll pay Visa thirty dollars from account 123412341234 on February first, two thousand eighteen."

If the caller does not provide all of the information needed, he or she is prompted to provide the missing information. If the utterance fills two of the slots—the caller says, for example, "Pay fifty dollars to Visa" (thus filling the AMOUNT and PAYEE slots)—he or she is prompted to provide the remaining pieces of information (the date as per the DATE field, and the account from which to make the payment as per the FROM_ACCOUNT field).

Similarly, if the caller says simply "Pay Visa", he or she will be prompted for the amount, date, and account.
When collection is complete, the VoiceXML document confirms the information collected (AMOUNT, PAYEE, DATE, and FROM_ACCOUNT slots) and disconnects (exits).

VoiceXML example document

<?xml version="1.0"?>
<!DOCTYPE vxml 
  PUBLIC "-//W3C//DTD VOICEXML 2.1//EN" 
  "http://www.w3.org/TR/2007/REC-voicexml21-20070619/vxml.dtd">
<vxml version="2.1"
 xmlns="http://www.w3.org/2001/vxml">
 
 <var name="concepts" expr="new Object()"/>
 <var name="intent"/>
 
 <form id="openDialog">		 
    <grammar src="http://base_path/nle_obfuscated.zip?nlptype=nle"/>
    <grammar src="http://base_path/dlm.zip?nlptype=krypton&dlm_weight=0.7" />
    <grammar src="http://base_path/PAYEE_wordset.json?nlptype=wordset" />
    <grammar src="http://base_path/FROM_ACCOUNT_wordset.json?nlptype=wordset" />

     <initial cond="intent==undefined" name="start_od">
      <prompt count="1"> Thank you for calling, how can I help you? </prompt>
      <prompt count="2" cond="INTENT!=undefined"> How much would you like to pay? </prompt>
      <prompt count="2"> Please say something like "I want to pay five hundred dollars to visa". </prompt>
      <nomatch count="1">
         I'm sorry, I didn't understand, how can I help you?"
         <reprompt/>
      </nomatch>
      <noinput count="1">
         I'm sorry. I didn't get that, how can I help you?"
         <reprompt/>
      <noinput>
     </initial>				
     <field name="INTENT" cond="false">
     </field>

     <field name="AMOUNT">
         <prompt>How much would you like to pay?</prompt>
         <filled>
           <assign name="entities.amount" expr="AMOUNT"/>
         </filled>
     </field>
     <field name="DATE">
         <prompt>When would you like to make the payment?</prompt>
         <filled>
           <if cond="DATE.indexOf('?')!=-1 || DATE.length!=8">
             <clear namelist="DATE"/>
             <prompt>Please provide a full date, such as March 5th, 2019</prompt>
           <else/>		
             <assign name="concepts.DATE" expr="DATE"/>
           </if>
         </filled>
     </field>
     <field name="PAYEE">
         <prompt>Who would you like to pay?</prompt>
         <filled>
           <assign name="concepts.PAYEE" expr="PAYEE"/>
         </filled>
     </field>
     <field name="FROM_ACCOUNT">
         <prompt>Which account would you like to use to make the payment?</prompt>
         <filled>
           <assign name="concepts.FROM_ACCOUNT" expr="FROM_ACCOUNT"/>
         </filled>
     </field>		
     <filled mode="all" namelist="AMOUNT DATE PAYEE FROM_ACCOUNT">
         <prompt> Thanks, we'll pay <value expr="concepts.PAYEE"/> $<value expr="entities.AMOUNT"/> from account <say-as interpret-as="digits"><value expr="concepts.FROM_ACCOUNT"/></say-as> on <say-as interpret-as="date"><value expr="concepts.DATE"/></say-as> </prompt>
         <goto next="#done"/>
     </filled>   
 </form>
   
 <form id="done">
    <block>
      <exit/>
    </block>
 </form>   
</vxml>

VoiceXML application structure

Example: directed dialog using Nuance Recognizer

Example: raw recognition with Krypton-only

Example: open-dialog using Dragon Voice

Related topics