VoiceXML application structure
The VoiceXML specification provides ways to request and control prompts and speech recognition:
- Define the flow of a dialog (within a <form> element).
- Specify text to be spoken to the caller (using the <prompt> element).
- Specify needed speech grammars (using the <grammar> element).
- Configure recognition processing. For each recognition event, determine which grammars/models to use, set timers, define the bargein state (using the <property> element).
- Request recognition of the collected speech (using the <field> element).
- Receive recognition results (in the VoiceXML variable application.lastresult$) and process them for further action. See Getting recognition results.
Example: directed dialog using Nuance Recognizer
The following simplified example illustrates a common way in which VoiceXML implements these actions. The example provides the current weather for a requested city and state using Nuance Recognizer.
- The application sets the grammar to be used for this session (cityandstate.grxml).
- The application begins with an announcement and an advertisement for the service. The caller is prohibited from interrupting the ad.
- Following the welcome ad, the application asks for the city and state for which the caller wants to know the weather. If the caller does not respond within 5 seconds, the prompt is repeated twice.
- If there is still no answer, the application asks again, more specifically. If the caller provides the state, the application asks for the city. Note that the request for the city name includes the state that the application understood from the preceding question, as a way of confirming it.

<?xml version="1.0" encoding="UTF-8"?>
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/vxml
http://www.w3.org/TR/voicexml20/vxml.xsd">
<form id="weather_info">
<grammar src="cityandstate.grxml" type="application/srgs+xml"/>
<!-- Caller can't barge in on today's advertisement. -->
<block>
<prompt bargein="false">
Welcome to the weather information service.
<audio src="http://www.online-ads.example.com/wis.wav"/>
</prompt>
</block>
<initial name="start">
<property name="timeout" value="5s"/>
<prompt bargein="true">
For what city and state would you like the weather?
</prompt>
.
.
.
<!-- If user is silent, reprompt once, then try directed prompts. -->
<noinput count="1"> <reprompt/></noinput>
<noinput count="2"> <reprompt/> <assign name="start" expr="true"/>
</noinput>
</initial>
<field name="state">
<prompt>What state?</prompt>
</field>
<field name="city">
<prompt>Please say the city in <value expr="state"/>
for which you want the weather.
</prompt>
</field>
</form>
</vxml>

This MRCP excerpt corresponds to the beginning of the VoiceXML example.
An MRCPv2 SPEAK request initiates speech.
Client->Server:
MRCP/2.0 386 SPEAK 543257
Channel-Identifier:32AECB23433801@speechsynth
Kill-On-Barge-In:false
Voice-gender:neutral
Voice-age:25
Prosody-volume:medium
Content-Type:application/ssml+xml
Content-Length:...
<?xml version="1.0"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
=xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>Welcome to the weather information service.</s>
<audio src="http://www.online-ads.example.com/wis.wav"/>
</p>
</speak>
Server->Client:
MRCP/2.0 49 543257 200 IN-PROGRESS
Channel-Identifier:32AECB23433801@speechsynth
Speech-Marker:timestamp=3D857205015059
The synthesizer finishes the SPEAK request.
S->C:
MRCP/2.0 48 SPEAK-COMPLETE 543257 COMPLETE
Channel-Identifier:32AECB23433801@speechsynth
Completion-Cause:000 normal
Speech-Marker:timestamp=3D857207685213
The synthesizer sends the next prompt.
C->S:
MRCP/2.0 386 SPEAK 543257
Channel-Identifier:32AECB23433801@speechsynth
Kill-On-Barge-In:true
Voice-gender:neutral
Voice-age:25
Prosody-volume:medium
Content-Type:application/ssml+xml
Content-Length:...
<?xml version="1.0"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
=xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<property name="timeout" value="5s"/>
<s>For what city and state would you like the weather?</s>
</p>
</speak>
S->C:
MRCP/2.0 49 543257 200 IN-PROGRESS
Channel-Identifier:32AECB23433801@speechsynth
Speech-Marker:timestamp=3D857205015059
The synthesizer finishes the SPEAK request.
S->C:
MRCP/2.0 48 SPEAK-COMPLETE 543257 COMPLETE
Channel-Identifier:32AECB23433801@speechsynth
Completion-Cause:000 normal
Speech-Marker:timestamp=3D857207685213
The recognizer is issued a request to listen for the customer choices.
C->S:
MRCP/2.0 343 RECOGNIZE 543258
Channel-Identifier:32AECB23433801@speechrecog
No-Input-Timeout: 5000
Content-Type:application/srgs+xml
Content-Length:...
<?xml version="1.0"?>
<!-- the default grammar language is US English -->
<grammar xmlns="http://www.w3.org/2001/06/grammar"
xml:lang="en-US" version="1.0" root="request">
</grammar>
S->C:
MRCP/2.0 49 543258 200 IN-PROGRESS
Channel-Identifier:32AECB23433801@speechrecog
.
.
.
Example: raw recognition with Krypton-only
Note: Nuance Recognizer and Dragon Voice applications require different artifacts. (They do not share artifacts.) To create Dragon Voice artifacts, contact Nuance to get access to Nuance Command Line Interface or Nuance Experience Studio or Nuance Mix Tools .
Note: The content in this topic is for Dragon Voice in on-premise deployments.
This rudimentary example begins with a prompt, collects the information provider by the caller, and ends.
- In preparation for recognizing collected speech, the VoiceXML document loads a domain language model (DLM) and two wordsets (which expand the vocabulary of the DLM) into the Krypton engine. For details, see Triggering the Dragon Voice recognizer.
- A greeting collects input from the caller. The prompt is open-ended: " This is a test. Please Speak."
- If the caller say anything, the VoiceXML document disconnects (exits).
- If the caller says nothing, or if the speech is not recognized, the VoiceXML document repeats the prompt.

<?xml version="1.0"?>
<!DOCTYPE vxml
PUBLIC "-//W3C//DTD VOICEXML 2.1//EN"
"http://www.w3.org/TR/2007/REC-voicexml21-20070619/vxml.dtd">
<vxml version="2.1" xmlns="http://www.w3.org/2001/vxml">
<meta http-equiv="Cache-Control" content="no-cache"/>
<!-- Test reco-only feature-->
<form id="test">
<field name="dow">
<grammar src="http://base_path/dlm.zip?nlptype=krypton&dlm_weight=0.7" />
<grammar src="http://base_path/myWordset1.json?nlptype=wordset />
<grammar src="http://base_path/myWordset2.json?nlptype=wordset />
<prompt count="1"> This is a test. Please Speak.</prompt>
<catch event="nomatch noinput">
I'm sorry. I didn't get that.
<reprompt/>
</catch>
<filled>
<prompt>Done</prompt>
<goto next="#done"/>
</filled>
</field>
</form>
<form id="done">
<block>
<exit/>
</block>
</form>
</vxml>

Here is a corresponding MRCP example for the DEFINE-GRAMMAR recognition. It points to a domain language model and a wordset:
MRCP/2.0 337 DEFINE-GRAMMAR 5 Content-Length: 86 Channel-Identifier: 1@speechrecog Content-Type: text/uri-list Content-Id: http://mt-nr11-myplatform-c01:8090/vxml_sample/dlm.zip?nlptype=krypton&dlm_weight=0.7 -1 -1 10000 Fetch-Timeout: 10000 http://mt-nr11-myplatform-c01:8090/vxml_sample/dlm.zip?nlptype=krypton&dlm_weight=0.7
http://mt-nr11-myplatform-c01:8090/vxml_sample/myWordset1.json?nlptype=wordset
http://mt-nr11-myplatform-c01:8090/vxml_sample/myWordset2.json?nlptype=wordset
For Krypton-only recognition, you must provide one grammar element in the VoiceXML document. The voice browser will send one DEFINE-GRAMMAR message to the Speech Server plus one for each wordset file.
Below is a sample MRCP excerpt for the RECOGNIZE method. It points to the model and two wordsets:
MRCP/2.0 1518 RECOGNIZE 9 Content-Length: 483 Cancel-If-Queue: true Start-Input-Timers: false Channel-Identifier: 1@speechrecog Content-Type: text/grammar-ref-list Content-Id: 1518125471954
<session: http://mt-nr11-myplatform-c01:8090/vxml_sample/dlm.zip?nlptype=krypton&dlm_weight=0.7 -1 -1 10000>
Example: open-dialog using Dragon Voice
Note: Nuance Recognizer and Dragon Voice applications require different artifacts. (They do not share artifacts.) To create Dragon Voice artifacts, contact Nuance to get access to Nuance Command Line Interface or Nuance Experience Studio or Nuance Mix Tools .
This example begins with an open-ended prompt, collects the information provider by the caller, and uses a directed dialog to prompt for each of the remaining information slots. The scenario for this VoiceXML page is a banking application where callers transition to this page after indicating the desire to make a payment. The page collects the processing information: amount, date, payee, and account. The page can collect multiple slots per dialog turn, and the caller can change slot values at any time.
- The VoiceXML document loads the models and dynamic content (in this case, two wordsets) into the Dragon Voice engines in preparation for recognizing and interpreting the collected speech. For details, see Triggering the Dragon Voice recognizer.
- A greeting collects input from the caller. The prompt is open-ended: "Thank you for calling, how can I help you?"
-
If the caller provides all of the information needed to satisfy the dialog (all fields filled), confirmation follows.
For example, to the initial prompt a caller might say "Pay thirty dollars to my Visa from checking on February first 2018." This utterance fills all slots (provides all entity values). Confirmation follows: "Thanks, we'll pay Visa thirty dollars from account 123412341234 on February first, two thousand eighteen."
If the caller does not provide all of the information needed, he or she is prompted to provide the missing information. If the utterance fills two of the slots—the caller says, for example, "Pay fifty dollars to Visa" (thus filling the AMOUNT and PAYEE slots)—he or she is prompted to provide the remaining pieces of information (the date as per the DATE field, and the account from which to make the payment as per the FROM_ACCOUNT field).
Similarly, if the caller says simply "Pay Visa", he or she will be prompted for the amount, date, and account.
- When collection is complete, the VoiceXML document confirms the information collected (AMOUNT, PAYEE, DATE, and FROM_ACCOUNT slots) and disconnects (exits).

<?xml version="1.0"?> <!DOCTYPE vxml PUBLIC "-//W3C//DTD VOICEXML 2.1//EN" "http://www.w3.org/TR/2007/REC-voicexml21-20070619/vxml.dtd"> <vxml version="2.1" xmlns="http://www.w3.org/2001/vxml"> <var name="concepts" expr="new Object()"/> <var name="intent"/> <form id="openDialog"> <grammar src="http://base_path/nle_obfuscated.zip?nlptype=nle"/> <grammar src="http://base_path/dlm.zip?nlptype=krypton&dlm_weight=0.7" /> <grammar src="http://base_path/PAYEE_wordset.json?nlptype=wordset" /> <grammar src="http://base_path/FROM_ACCOUNT_wordset.json?nlptype=wordset" /> <initial cond="intent==undefined" name="start_od"> <prompt count="1"> Thank you for calling, how can I help you? </prompt> <prompt count="2" cond="INTENT!=undefined"> How much would you like to pay? </prompt> <prompt count="2"> Please say something like "I want to pay five hundred dollars to visa". </prompt> <nomatch count="1"> I'm sorry, I didn't understand, how can I help you?" <reprompt/> </nomatch> <noinput count="1"> I'm sorry. I didn't get that, how can I help you?" <reprompt/> <noinput> </initial> <field name="INTENT" cond="false"> </field> <field name="AMOUNT"> <prompt>How much would you like to pay?</prompt> <filled> <assign name="entities.amount" expr="AMOUNT"/> </filled> </field> <field name="DATE"> <prompt>When would you like to make the payment?</prompt> <filled> <if cond="DATE.indexOf('?')!=-1 || DATE.length!=8"> <clear namelist="DATE"/> <prompt>Please provide a full date, such as March 5th, 2019</prompt> <else/> <assign name="concepts.DATE" expr="DATE"/> </if> </filled> </field> <field name="PAYEE"> <prompt>Who would you like to pay?</prompt> <filled> <assign name="concepts.PAYEE" expr="PAYEE"/> </filled> </field> <field name="FROM_ACCOUNT"> <prompt>Which account would you like to use to make the payment?</prompt> <filled> <assign name="concepts.FROM_ACCOUNT" expr="FROM_ACCOUNT"/> </filled> </field> <filled mode="all" namelist="AMOUNT DATE PAYEE FROM_ACCOUNT"> <prompt> Thanks, we'll pay <value expr="concepts.PAYEE"/> $<value expr="entities.AMOUNT"/> from account <say-as interpret-as="digits"><value expr="concepts.FROM_ACCOUNT"/></say-as> on <say-as interpret-as="date"><value expr="concepts.DATE"/></say-as> </prompt> <goto next="#done"/> </filled> </form> <form id="done"> <block> <exit/> </block> </form> </vxml>

Here is a corresponding MRCP example for the DEFINE-GRAMMAR method for interpretation (NLE semantic model):
MRCP/2.0 321 DEFINE-GRAMMAR 4 Content-Length: 78 Channel-Identifier: 1@speechrecog Content-Type: text/uri-list Content-Id: http://mt-nr11-myplatform-c01:8090/vxml_sample/nle_obfuscated.zip?nlptype=nle -1 -1 10000 Fetch-Timeout: 10000 http://mt-nr11-myplatform-c01:8090/vxml_sample/nle_obfuscated.zip?nlptype=nle
Here is the DEFINE-GRAMMAR MRCP message for recognition. It points to the domain language model:
MRCP/2.0 337 DEFINE-GRAMMAR 5 Content-Length: 86 Channel-Identifier: 1@speechrecog Content-Type: text/uri-list Content-Id: http://mt-nr11-myplatform-c01:8090/vxml_sample/dlm.zip?nlptype=krypton&dlm_weight=0.7 -1 -1 10000 Fetch-Timeout: 10000 http://mt-nr11-myplatform-c01:8090/vxml_sample/dlm.zip?nlptype=krypton&dlm_weight=0.7
For semantic recognition, you must provide two grammar elements in the VoiceXML document, one for NLE and one for Krypton. Therefore, the voice browser will send at least two DEFINE-GRAMMAR messages to the Speech Server, and one additional message for each wordset file that you include.
Below is a sample MRCP excerpt for the RECOGNIZE method. Notice that it points to the models and the wordsets.
MRCP/2.0 1518 RECOGNIZE 9 Content-Length: 483 Cancel-If-Queue: true Start-Input-Timers: false Channel-Identifier: 1@speechrecog Content-Type: text/grammar-ref-list Content-Id: 1518125471954 <session:http://mt-nr11-myplatform-c01:8090/vxml_sample/nle_obfuscated.zip?nlptype=nle -1 -1 10000>;weight="329" <session:http://mt-nr11-myplatform-c01:8090/vxml_sample/dlm.zip?nlptype=krypton&dlm_weight=0.7 -1 -1 10000>;weight="329" <session:http://mt-nr11-myplatform-c01:8090/vxml_sample/PAYEE_wordset.json?nlptype=wordset <session:http://mt-nr11-myplatform-c01:8090/vxml_sample/FROM_ACCOUNT_wordset.json?nlptype=wordset