N-gram grammars
Recognizer supports grammar syntax for n-gram (Markovian) stochastic grammars in VoiceXML.
Normally, you can set an n-gram order in an SLM by setting the ngram_order parameter in the SLM training file header. This is the recommended method. However, Recognizer also supports the Stochastic Language Models (N-Gram) Specification W3C draft proposal (3 January 2001). There are important format differences between them:
- SWIlanguageModel n-gram specified in an SRGS grammar: Provided for backwards compatibility with old releases of the OpenSpeech Recognizer (OSR). This form of the n-gram grammar is used to specify a language model for an SRGS grammar. See SWIlanguageModel n-gram.
- Standalone n-gram grammar as used by Recognizer:
N-gram language models can be used to predict the likelihood that sequences of words, such as word pairs (bigrams) or word triples (trigrams) will be spoken as part of a user utterance. Like other natural language models, the n-gram language model is constructed using a large sampling of training sentences that display the characteristics expected in regular user input.
The grammar compiler can write an n-gram grammar when compiling Statistical Language Models (SLMs). See Compiling n-gram grammars.
Syntax and standards
Recognizer supports the vocabulary (<vocab>) and the count tree (<tree>) elements of the W3C n-gram draft proposal, which allows for a broad interpretation of its contents. Recognizer also extends the specification to include meta tags (<meta>) and pronunciation dictionaries (<lexicon>).
- The W3C draft allows for arbitrary n-grams; Recognizer is limited to bigrams and trigrams.
- Recognizer does not allow direct input of weights. Instead, count structures are used as input (ideally from a training corpus), and these are computed into the probabilities used for recognition.
- For details, see the W3C proposal.
Media types and import guidelines
The media type for n-gram source grammars is:
application/x-swi-ngram+xml
The media type for a compiled n-gram grammar is the same as for SRGS:
application/x-swi-grammar
Below are guidelines for importing:
- An n-gram grammar has an implicit root rule, and SRGS grammars can import them directly.
- The n-gram grammar may import SRGS rules for modeling as lexical tokens with unigram, bigram, and trigram counts and weights.
- Recursive importing between the same SRGS and n-gram grammars is not allowed (and results in an error).
Compiling n-gram grammars
You can compile, load, and activate n-gram grammars like any other speech grammar. For example, the following command compiles a grammar and produces a file mytrigram.gram:
sgc mytrigram.ngxml
Note: Precompile the parent SRGS grammar for SWIlanguageModel n-grams.
The following command produces a binary grammar for the example in SWIlanguageModel n-gram:
sgc grammar.grxml
When you train a Statistical Language Model (SLM), the sgc compiler creates an intermediate form of the grammar with n-gram count information. Using the -dump_ngram_grammar switch, you can save this information as an n-gram XML file.
The following command compiles an SLM training file and writes the n-gram XML grammar (sfgram.ngxml):
sgc -dump_ngram_grammar -train sfgram.xml
Note: There is no n-gram XML form of Statistical Semantic Models (SSMs).
Elements and attributes
An n-gram grammar contains the following elements and attributes:
Element |
Attribute |
---|---|
<N-Gram> |
xml:lang (optional) |
<meta> |
name content |
<lexicon> |
uri xml:lang (optional) |
<vocab> |
|
<token> |
index xml:lang (optional) |
<ruleref> |
uri type (optional) xml:lang (optional) |
<tree> |
(none) |
<node> |
(none) |

The example below shows a small, but syntactically correct n-gram grammar:
<N-Gram>
<vocab>
<token index="1">-pau-</token>
<token index="2">A</token>
<token index="3">-pau2-</token>
</vocab>
<tree>
<node>3 500</node> <!-- root -->
<node>1 100</node>
<node>2 300</node>
<node>3 100</node>
</tree>
</N-Gram>
Other than the sentence begin and end tokens (-pau- and -pau2- respectively), this grammar only has one word "A". The first node element says there are three words with a total count of 500. There were 100 occurrences of the sentence begin token, 300 occurrences of "A", and 100 occurrences of the sentence end token. This implies that there were 100 training sentences, each with on average three occurrences of the word "A".
N-gram format
A simple, complete n-gram grammar appears below:
<N-Gram xml:lang="en-us">
[<meta name="name" content="content"/> …]
[<lexicon uri="[protocol:[//host/]][path/]file[?query]"/> …]
<vocab>
<token index="#" [xml:lang="en-us"]>
CDATA |
<ruleref uri="[protocol:[//host/]][path/]file[?query][#rule]"
[xml:lang="en-us"] [type="media-type"] />
</token>
…
</vocab>
<tree>
<node> #num-uni #total-count </node> <!-- root -->
<node> #tokenindex [#succ-entities] #entitycount </node>
…
</tree>
</N-Gram>
N-gram header
Like any other XML format document, an n-gram document begins with a header specifying important global information about the document.

The first item in the n-gram document is the <N-Gram> element, which specifies the language with the xml:lang attribute.

The <meta> element is allowed in n-gram grammars, except in SWIlanguageModel n-gram grammars. Its implementation is taken directly from the SRGS specification. All <meta> configuration parameters allowed by Recognizer in SRGS grammars are also allowed in n-gram grammars unless otherwise noted (see below).
All <meta> elements must occur before the first <vocab> tag.
The following SRGS meta element names are ignored in n-gram grammars:
- swirec_compile_parser
- swirec_fsm_grammar
- swirec_fsm_wordlist
If a disallowed meta element appears, a warning ("Element <meta> incorrectly nested in n-Gram XML file <filename>") is written to the diagnostic log.

Like the <meta> element, the <lexicon> element is allowed in n-gram grammars other than SWIlanguageModel n-gram grammars.
The <lexicon> element specifies a user pronunciation dictionary for an n-gram in the same way as it does for SRGS grammars. You can supply multiple dictionaries with multiple <lexicon> elements. All <lexicon> elements must occur before any <vocab> elements. They take the format:
<lexicon uri="[
protocol:[//
host/]][
path/]
file[?
query]"/>
For example:
<lexicon uri="http://foobar.com/userdict.xml?SWI.type=backup
"/>
N-gram document main body
The main body of an n-gram document consists of two sections:
- A vocabulary section, delimited by the <vocab> element, which defines all the words an imported rules that will appear in the n-gram tree.
- A tree section, delimited by the <tree> element, which represents the n-gram counts with <node> elements as described below.

The <vocab> section defines the words and imported rules represented in the tree. Each token is given a unique index number which is used in the count tree. Sequential indexes are recommended for clarity, as they make it easier to follow the tree structure. No index or token may occur twice in the lexicon. Format:
<vocab [xml:lang="en-us"]>
<token index="#" [xml:lang="en-us"]>
CDATA |
<ruleref uri="[protocol:[//host/]][path/]file[?query][#rule]"
[xml:lang="en-us"] [type="media-type"] />
</token>
…
</vocab>
Notes:
- The token indexes must be whole integers greater than 0.
- The token indexes do not need to be sequential, but it is recommended.
- No two vocabulary tokens can have the same index.
- No two indexes can have the same lexical token.
- Words with embedded spaces are allowed. Left- and right-hand whitespace is removed, but whitespace inside CDATA is preserved (for example, '<token index="30"> New York </token>' represents "New York").
- The '-pau-' and '-pau2-' tokens represent sentence beginnings (-pau-) and ends (-pau2). While these tokens are not required to be modeled in the n-gram, it is highly recommended that they be listed. If they are not modeled, then the sgc compiler will automatically add them with a unigram count of 0, and back-off weights will then be calculated for those arcs. This automation is only done to produce a working language model; it is much better to model the beginning and ending probabilities of the true utterance.

<vocab>
<token index='1'>word1</token>
<token index='3'>
Word2
</token>
<token index='0'>-pau-</token> <!--ERROR: index less than 1. -->
<token index='6'>-pau2-</token>
<token index='20'>New York</token>
<token index='3'>Word3</token> <!--ERROR: repeat of index 3.-->
<token index='4B'>Word4</token> <!--ERROR: non-numeric index -->
<token index='2'> word1</token> <!--ERROR: repeat of 'word1' -->
<token index='5'> <ruleref uri="alphabet.grxml"/> </token>
<token index='12'> <ruleref uri="date.grxml#Months"/> </token>
<token index='9'> <ruleref uri=
"http://example.com/cities.cgi?Ohio"/> </token>
<token index='11'> <ruleref uri="bigram.ngxml"/> </token>
</vocab>

The event count tree (<tree>) is a depth-first representation of the raw unigram, bigram, and trigram counts. The tree syntax diverges from standard XML practices. It contains numbers (separated by whitespace) within the body of the <node> tag. As a whole, the nodes represent a flattened tree form where the structure is determined by specific number fields in predecessor nodes. This is done to remove the significant overhead of an XML tree structure. See Count tree (<tree>) for a full discussion.