N-gram grammars

Recognizer supports grammar syntax for n-gram (Markovian) stochastic grammars in VoiceXML.

Normally, you can set an n-gram order in an SLM by setting the ngram_order parameter in the SLM training file header. This is the recommended method. However, Recognizer also supports the Stochastic Language Models (N-Gram) Specification W3C draft proposal (3 January 2001). There are important format differences between them:

SWIlanguageModel n-gram specified in an SRGS grammar: Provided for backwards compatibility with old releases of the OpenSpeech Recognizer (OSR). This form of the n-gram grammar is used to specify a language model for an SRGS grammar. See SWIlanguageModel n-gram.
Standalone n-gram grammar as used by Recognizer:
N-gram language models can be used to predict the likelihood that sequences of words, such as word pairs (bigrams) or word triples (trigrams) will be spoken as part of a user utterance. Like other natural language models, the n-gram language model is constructed using a large sampling of training sentences that display the characteristics expected in regular user input.
The grammar compiler can write an n-gram grammar when compiling Statistical Language Models (SLMs). See Compiling n-gram grammars.

Syntax and standards

Recognizer supports the vocabulary (<vocab>) and the count tree (<tree>) elements of the W3C n-gram draft proposal, which allows for a broad interpretation of its contents. Recognizer also extends the specification to include meta tags (<meta>) and pronunciation dictionaries (<lexicon>).

The W3C draft allows for arbitrary n-grams; Recognizer is limited to bigrams and trigrams.
Recognizer does not allow direct input of weights. Instead, count structures are used as input (ideally from a training corpus), and these are computed into the probabilities used for recognition.
For details, see the W3C proposal.

Media types and import guidelines

The media type for n-gram source grammars is:

application/x-swi-ngram+xml

The media type for a compiled n-gram grammar is the same as for SRGS:

application/x-swi-grammar

Below are guidelines for importing:

An n-gram grammar has an implicit root rule, and SRGS grammars can import them directly.
The n-gram grammar may import SRGS rules for modeling as lexical tokens with unigram, bigram, and trigram counts and weights.
Recursive importing between the same SRGS and n-gram grammars is not allowed (and results in an error).

Compiling n-gram grammars

You can compile, load, and activate n-gram grammars like any other speech grammar. For example, the following command compiles a grammar and produces a file mytrigram.gram:

sgc mytrigram.ngxml

Note: Precompile the parent SRGS grammar for SWIlanguageModel n-grams.

The following command produces a binary grammar for the example in SWIlanguageModel n-gram:

sgc grammar.grxml

When you train a Statistical Language Model (SLM), the sgc compiler creates an intermediate form of the grammar with n-gram count information. Using the -dump_ngram_grammar switch, you can save this information as an n-gram XML file.

The following command compiles an SLM training file and writes the n-gram XML grammar (sfgram.ngxml):

sgc -dump_ngram_grammar -train sfgram.xml

Note: There is no n-gram XML form of Statistical Semantic Models (SSMs).

Elements and attributes

An n-gram grammar contains the following elements and attributes:

Element	Attribute
<N-Gram>	xml:lang (optional)
<meta>	name content
<lexicon>	uri xml:lang (optional)
<vocab>
<token>	index xml:lang (optional)
<ruleref>	uri type (optional) xml:lang (optional)
<tree>	(none)
<node>	(none)

N-gram format

A simple, complete n-gram grammar appears below:

<N-Gram xml:lang="en-us">

 [<meta name="name" content="content"/> …]

 [<lexicon uri="[protocol:[//host/]][path/]file[?query]"/> …]

 <vocab>

  <token index="#" [xml:lang="en-us"]>

   CDATA |

   <ruleref uri="[protocol:[//host/]][path/]file[?query][#rule]"

   [xml:lang="en-us"] [type="media-type"] />

  </token>

…

 </vocab>

 <tree>

  <node> #num-uni #total-count </node> <!-- root -->

  <node> #tokenindex [#succ-entities] #entitycount </node>

…

 </tree>

</N-Gram>

N-gram header

Like any other XML format document, an n-gram document begins with a header specifying important global information about the document.

The <lexicon> section

Like the <meta> element, the <lexicon> element is allowed in n-gram grammars other than SWIlanguageModel n-gram grammars.

The <lexicon> element specifies a user pronunciation dictionary for an n-gram in the same way as it does for SRGS grammars. You can supply multiple dictionaries with multiple <lexicon> elements. All <lexicon> elements must occur before any <vocab> elements. They take the format:

<lexicon uri="[protocol:[//host/]][path/]file[?query]"/>

For example:

<lexicon uri="http://foobar.com/userdict.xml?SWI.type=backup"/>

N-gram document main body

The main body of an n-gram document consists of two sections:

A vocabulary section, delimited by the <vocab> element, which defines all the words an imported rules that will appear in the n-gram tree.
A tree section, delimited by the <tree> element, which represents the n-gram counts with <node> elements as described below.

The <vocab> section

The <vocab> section defines the words and imported rules represented in the tree. Each token is given a unique index number which is used in the count tree. Sequential indexes are recommended for clarity, as they make it easier to follow the tree structure. No index or token may occur twice in the lexicon. Format:

<vocab [xml:lang="en-us"]>

 <token index="#" [xml:lang="en-us"]>

  CDATA |

  <ruleref uri="[protocol:[//host/]][path/]file[?query][#rule]"

  [xml:lang="en-us"] [type="media-type"] />

 </token>

…

</vocab>

Notes:

The token indexes must be whole integers greater than 0.
The token indexes do not need to be sequential, but it is recommended.
No two vocabulary tokens can have the same index.
No two indexes can have the same lexical token.
Words with embedded spaces are allowed. Left- and right-hand whitespace is removed, but whitespace inside CDATA is preserved (for example, '<token index="30"> New York </token>' represents "New York").
The '-pau-' and '-pau2-' tokens represent sentence beginnings (-pau-) and ends (-pau2). While these tokens are not required to be modeled in the n-gram, it is highly recommended that they be listed. If they are not modeled, then the sgc compiler will automatically add them with a unigram count of 0, and back-off weights will then be calculated for those arcs. This automation is only done to produce a working language model; it is much better to model the beginning and ending probabilities of the true utterance.