SLM training file main body
The main part of a training file defines the vocabulary allowed in the SLM, and a training section that lists example sentences which use words of the vocabulary.
The vocabulary section
The vocabulary section of the training file defines all words allowed in the training section. Words that appear in the training sentences but not in the vocabulary section will be ignored by the compiler.
The vocabulary section is defined by the <vocab> element. Within the <vocab> elements, words and classes are defined with <item> and <ruleref> elements.

In a training file vocabulary, each <item> declares a single vocabulary word or a phrase joined by underscores. Recognizer treats the entire item as a single word. It does not consider each word of a phrase joined by underscores individually.
Optionally, you can replace any <item> with External training files (and compression).

The optional <ruleref> element imports a rule from an external grammar into the training file:
- The <ruleref> element appears in the vocabulary and sentences of training files.
- When used inside of <vocab>, the <ruleref> acts as a declaration for the grammar, which is used later inside of sentences. Without this declaration, the <ruleref> cannot appear in training sentences.
The words are not imported individually from the grammar and they cannot be used individually in sentences. Instead, they are imported as a class, and they can only be specified in sentences as a <ruleref> class.
- When used inside of <sentence>, the <ruleref> declares a placeholder in the named grammar. Recognizer treats those words as a class, and for the purpose of statistical modeling, they are treated as a word.
Classes help generalize results from the training sentences.
For example, if an application accepts restaurant reservations, it would be useful to have a restaurant <ruleref> instead of writing individual restaurant names in training sentences. Doing this has numerous advantages: it automates the creation of additional sentences with all names in the restaurant class; it ensures that no restaurant name is accidently omitted from the sentences; and when the list of available restaurant names changes in the restaurant grammar, no change is needed for the training sentences.
See Extending SLMs with grammar classes.
The <ruleref> element allows the following attributes:
- uri: Required. This attribute declares the URI of the external grammar containing the rule. Paths are relative to the current document. For example:
<vocab>
<item>sample</item>
...
<ruleref uri="http://myserver/grammars/currency.xml#amount"/>
...
<item>vocabulary</item>
</vocab>
- words: Optional. This attribute provides a way to remember the original phrase or sentence that has been converted into a ruleref. This is useful to the creator of the training file, but is not currently used by the compiler.
From an archival point of view, it is an important best practice, when available, to keep the words from the original transcription.
For example, if the application recognizes dates, you might convert actual dates that are spoken and transcribed during data collection into a generalized date ruleref. This attribute provides a way to record the original words spoken.
The example implies that the phrase "two pesos" was seen during data collection and has been generalized to be a number followed by a currency. This attribute is currently used strictly for annotation and has no effect on the resulting SLM.
<sentence>
I would like
<ruleref uri="http://myserver/grammars/number.xml"
words="two" />
<ruleref uri="http://myserver/grammars/currencies.xml"
words="pesos" />
</sentence>
The training section
The training section of the file lists example sentences that use the words from the vocabulary section. These sentences are used by the compiler to determine the probabilities to be used in recognizing user utterances.
The training section is defined by the <training> element. Within the <training> elements, sentences are defined with <sentence> element pairs.
The order of the sentences has no effect on the trained results.

In an SLM training file, each <sentence> defines a valid utterance. Individual words within a sentence are separated by spaces.
If phrases in sentences are best modeled as a class, you can replace them with a <ruleref> element. See The <ruleref> element (grammar classes) and Extending SLMs with grammar classes.
Define every word in every sentence in the vocabulary section of the training file:
- If a word or <ruleref> appears in a <sentence> but not in the <vocab>, it is out-of-vocabulary (OOV), and no n-gram will be formed with that word or rule reference. Essentially, the word is ignored.
- If a word or <ruleref> appears in the <vocab> but not in a <sentence>, it will be recognizable, but will have a low probability.
Optionally, you can replace any <item> with External training files (and compression).
For estimates of the number of sentences needed, see Data collection for training files.

The count attribute multiplies the occurrence of a sentence. It is a short-cut that repeats the same sentence multiple times.
In a training sentence, this attribute increases the sentence’s weight in the resulting models. The following example implies that the sentence is 100 times more common than phrases with a count of 1 (the default):
<sentence count="100">this is a sample</item>
The test section
Optionally, the training file may include a test section defined by a pair of <test> elements. This section lists sentences that could be used to test the SLM.
The test section is not used during SLM training. However, it comes into play when the SLM is used to support an SSM (see SLMs).
Very large training files
Although large training files can improve the quality of an SLM, those large files can be difficult to manage. They can become unwieldy to edit, and some third-party software may not accept files over a certain size.
Two techniques for dividing large training files into collections of smaller files are described below.

You can avoid problems handling very large training files by using XML “entities,” which enable you to break large XML files into more manageable subsets, and import them into the training file by reference. The following skeletal example illustrates importing two vocabulary files and two training files into the main training file. See the XML 1.0 recommendation.
<?xml version="1.0"?>
<!DOCTYPE slm_train [
<!ENTITY vocab1 SYSTEM "vocab1.xml">
<!ENTITY vocab2 SYSTEM "vocab2.xml">
<!ENTITY training1 SYSTEM "training1.xml">
<!ENTITY training2 SYSTEM "training2.xml">
]>
<SLMTraining version="1.0.0" xml:lang="en-us">
<vocab>
&vocab1;
&vocab2;
</vocab>
<training>
&training1;
&training2;
</training>
</SLMTraining>

Optionally, you can replace any <item> or <sentence> in a <vocab>, <training>, or <test> section with an external training file. This is especially useful for large test sets because the external file requires no XML elements and can be in a compressed format.
The <external> element has one attribute, uri, which is set to the URI of an external file. The URI must be a local path. The external file must use UTF-8 encoding. Format:
<external uri="myLocalPath\myFilename"/>
The URI can specify a file compressed with Gnu Zip. For example:
<external uri="vocab.txt.gz"/>
External training files have different headers depending on where they appear. The header is the first line of the file, and is one of the following:
Header |
Description |
---|---|
::VOCAB |
Header for the <vocab> section. |
::SLMDATA |
Header for the <training> or <test> sections. |
After the header, each line contains training data:
- For <vocab> each line defines one vocabulary word.
- For <training> and <test> each line defines one sentence.
The system assumes the default language unless words have language identifiers. This consists of the exclamation mark "!" followed by a language code. For example, !en-us. (Blank spaces around the ! are allowed.) Here are vocabulary words with mixed languages:
::VOCAB
this
is!en-us
a !en-us
vocabulary! en-us
and ! en-us
esto !es-us
es !es-us
un !es-us
vocabulario !es-us
For a <training> or <test> section, each word of a sentence can have a language identifier. The langcode refers to a single word (not a phrase). Any words with no identifier use the default language. Here are test sentences with mixed languages:
::SLMDATA
this!en-us is a vocabulary !en-us
esto !es-us es!es-us un!es-us vocabulario!es-us
Training file can indicate the weight of each sentence by adding a "count" and "prior" prefix:
- Count multiplies the occurrence of a sentence in the training data. It is a short-cut that repeats the same sentence multiple times. The value is an integer, and the default is 1.
- Prior is a log probability. The value is a floating point, and the default is 1.0.
The default weight is 1.0. Specify the count and prior at the beginning of a sentence, separated by whitespace and followed by a comma. The following sentences are valid, and have the same meaning:
this!en-us is a vocabulary !en-us
, this!en-us is a vocabulary !en-us
1 1.0, this!en-us is a vocabulary !en-us
Above, the example repeats the default behavior. In the following example, the first sentence has a count of 10 and a prior of 2.0, the second sentence has a count of 10 and a default prior of 1.0, and the last two sentences have default weights:
10 2.0, sentence number one
10, sentence number two
,sentence number three
sentence number four
You can use rule references anywhere in an SLM external file where a word is allowed. The purpose is the same as using the <ruleref> element in an XML file, but the syntax is different. Format:
$$<URI>
Examples:
$$grammar.grxml
$$http://grammarServer.com/grammar.grxml
Here is a complete external SLM vocabulary file:
::VOCAB
are
animals
$$animals.grxml
Here is a complete external SLM training file:
::SLMDATA
2.5,$$animals.grxml are animals