SLM training file header
The initial header lines define the content and structure of the XML file:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE SLMTraining SYSTEM "SLMTraining.dtd">
<SLMTraining version="1.0.0" xml:lang="en-us">
XML declaration
The first element in the header is always the XML declaration. This element specifies the version of XML used in the document (1.0 or 1.1). It also specifies the encoding that applies for the document, which determines the language(s) that can or cannot be used. Both version and encoding are required attributes.
See XML declaration and encoding type for details.
Document type and system
Optionally, you can use the !DOCTYPE element to define the document type. For a training file, this type is "SLMTraining", as shown in the example above.
The SYSTEM attribute specifies a document type definition (DTD), which must be described in a .dtd file. Specifying a DTD is optional, but is recommended to catch XML formatting errors. The installation includes a SLMTraining.dtd file, which is located in the %SWISRSDK%\config directory.
The example above assumes that the training file is located in the same directory as the DTD file. However, if the training file is located elsewhere, you must add the full relative path to the DTD file in the training file header.
<SLMTraining> and language declaration
The <SLMTraining> element opens the main section of an SLM training file. It has two required attributes: the version (1.0.0), and the xml:lang attribute that specifies the main language for the training file.
You can create SLMs for any language installed for Recognizer. Use the xml:lang attribute to specify the target language. The value is a string indicating the language code, for example, en-us. See Setting the language in the grammar header.
Configuration parameters
There are several configuration parameters that can be used in training files. You can specify these parameters and set their values by using the <param> and <value> elements in your training file header:
<param name="ngram_order"><value> 2 </value></param>
<param name="fsm_out"><value>sample.fsm</value></param>
<param name="wordlist_out"><value>sample.wordlist</value></param>
The default values for these parameters are typically acceptable for your initial training iterations. In later iterations, you can test parameter values during tuning. See Tuning SLMs for additional details.
The exception to using default value for the first iteration is smooth_weights, which recommends a non-default setting for interpolating models.
Many of the training file parameters tune n-grams. For an overview of n-grams, see SLMs. For a more detailed discussion, see N-gram grammars.
Available SLM configuration parameters:
SLM parameter |
Description |
---|---|
Indicates when to remove bigrams or trigrams that occur infrequently, and are thus statistically insignificant. |
|
Optional. Specifies an input filename for a discounts file (.dcnt), which lets you control the impact of n-grams not contained in the training set. |
|
Optional. Specifies an output filename for computed discounts (see discounts_in). |
|
Optional. Specifies an output filename for the finite state machine (FSM) file that is created when you generate an SLM. |
|
Specifies whether to create a bigram or trigram language model. |
|
Optional. Specifies an output file for writing the SLM in the ARPA format. |
|
Optional. Applies an industry-standard algorithm while training the language model. |
|
Shows the interpolation weights used when interpolating 2-gram and 3-gram probabilities with 1-gram probabilities. |
|
Optional. Directs the sgc compiler to create a vocabulary word list to accompany the n-gram file specified via fsm_out (see fsm_out). |

Indicates when to remove bigrams or trigrams that occur infrequently, and are thus statistically insignificant.
For example, a value of 2 removes any bigram that occurs two times or fewer. The values specified for the cutoffs parameter must be an integer greater than or equal to zero (the default).
Typically, the default value of this parameter is not changed. When you are training a language model on a large quantity of data, however, the model can grow very large. In these instances, the cutoffs parameter dramatically reduces the size of the language model, while having only a minimal effect on accuracy. With large training sets, a setting of 1 can reduce grammar size by 50% while having almost no detrimental effect on the grammar's accuracy.
For higher-order n-grams, you may specify more than one value for the cutoffs parameter. Use the ngram_order parameter (ngram_order) to determine the number of values required for bigrams or trigrams:
- For a bigram, the cutoffs parameter requires one value. For example:
<param name="cutoffs">
<value>2</value>
</param>
- For a trigram, the cutoffs parameter requires two values separated by a space. The first value applies to bigrams, and the second applies to trigrams. When you create a trigram, the trigram contains internal bigrams.
In the following example, 2 is applied to bigrams and 3 is applied to trigrams (the bigram cutoff must be less than or equal to the trigram cutoff):
<param name="cutoffs">
<value>2 3</value>
</param>
If you specify two values for the cutoffs parameter, but then use ngram-order to create a bigram, the second value is not needed and is ignored if present.

Optional. Specifies an input filename for a discounts file (.dcnt), which lets you control the impact of n-grams not contained in the training set.
For example:
<param name="discounts_in">
<value> file.dcnt </value> </param>
In the context of SLM training, a discount is a multiplier which the compiler uses to account for word permutations that are not explicitly included in the training file as sentences. Discounts are calculated using an industry-standard algorithm, as determined by the smooth_alg parameter.
Without discounting, any n-gram not seen in the training has zero probability, which makes it unlikely to be recognized in the test data. With discounts, part of the probability mass is reserved to accommodate such n-grams, by multiplying the counts of the training n-grams by a constant smaller than 1.
If you omit this parameter, Recognizer computes Good-Turing discounts from the input training set. This is suitable if you have small amounts of training data, originating from different applications. Use discounts_in when the smooth_alg parameter specifies a Good-Turing value and you do not want to re-compute the Good-Turing discounts.
A sample discounts file appears below:
Sample discount file |
---|
Good-Turing discounts 6 0.39789599 0.68462503 0.71244669 0.87349498 0.90344799 0.79262167 8 0.21760599 0.56865501 0.72032666 0.79289001 0.82419997 0.81001669 0.81023431 0.95779902 7 0.0000000 0.46160099 0.67955333 0.67672497 0.74140197 0.75773168 0.83446717 |
The first line of a discount file is a comment (above, “Good-Turning discounts”).
The remaining lines correspond to the n-gram order (1, 2, and 3), and consists of floating point discount multipliers. The first digit on the line indicates the number of factors on that line. Each multiplier corresponds to a n-gram count in the training data.
In the example, there are 6 multipliers for 1-grams, 8 for 2-grams, and 7 for 3-grams. An n-gram with an original count equal to i will be modified by multiplying i by the i-th float on that line. If i is greater than the number of multipliers, the last multiplier is used.

Optional. Specifies an output filename for computed discounts (see discounts_in).
It writes the computed discounts, which can be used later as input in future training sessions.

Optional. Specifies an output filename for the finite state machine (FSM) file that is created when you generate an SLM.
For example:
<param name="fsm_out">
<value>filename.fsm</value>
</param>
You can import the FSM file in conjunction with the wordlist into an SRGS grammar using <meta>s.
When usign fsm_out, use wordlist_out to specify the filename of the accompanying vocabulary list. It’s a good idea to use similar names for the .fsm and .wordlist files to help manage the files as a pair.

Specifies whether to create a bigram or trigram language model.
<param name="ngram_order">
<value>2</value>
</param>
The value for ngram_order must be an integer, either 2 (bigram) or 3 (trigram).
A trigram is preferable, because it provides the best accuracy for typical scenarios. However, you can build a bigram if your training set is small (not enough sentences), or if you need to speed up the training (because a bigram requires less processing), or to reduce the size of the final grammar.

Optional. Specifies an output file for writing the SLM in the ARPA format.
For example:
<param name="print_arpa">
<value> file.arpa </value>
</param>
ARPA stands for Advanced Research Projects Agency. Because the format is ASCII text, you can inspect language model probabilities () and exchange language models with users who use different language modeling tools.

Optional. Applies an industry-standard algorithm while training the language model.
For a overview of how discounts are used, see discounts_in.
The value is one of the following strings:
Value |
Description |
---|---|
GT-disc |
Good/Turing discounting, no interpolation. |
GT-disc-int |
Good/Turing discounting, interpolation with 1-gram probabilities (using the smooth_weights parameter). |
GT-discw-int |
GT discounting using sentence weights, interpolation with 1-gram probabilities (using the smooth_weights parameter). |
INT |
No discounting, interpolation with 1-gram probabilities (using the smooth_weights parameter). |
WB-disc |
Discounted but non-interpolated Witten-Bell. This is the default. |
WB-int |
Regularly-interpolated Witten-Bell with controlling smooth_weights parameter. |
For example:
<param name="smooth_alg">
<value> GT-disc </value>
</param>

Optional. Defines the interpolation weights used when interpolating 2-gram and 3-gram probabilities with 1-gram probabilities.
The value is a list of smoothing weights (the number of values equals the n-gram order).
<param name="smooth_weights">
<value> 0.1 0.9 0.9 0.4 </value>
</param>
The last weight is only used by the regular Witten-Bell method.
One use of this parameter is when the training set has less than 2 million words. See the example in Determining the order of the model.
See also, discounts_in and smooth_alg.

Optional. Directs the sgc compiler to create a vocabulary word list to accompany the n-gram file specified via fsm_out (see fsm_out).
The value is the output filename of the word list. For example:
<param name="wordlist_out">
<value>filename.txt</value></param>
Specify this parameter whenever you use fsm_out. It is recommended that the fsm and wordlist files have similar names.
User dictionaries
You can use the <lexicon> element to specify a user dictionary in the training file. See Pronunciation dictionaries.
The <meta> element
The <meta> element defines a configuration parameter inside the resulting grammar. These parameters are applied to the grammar during compilation. In general, the values are local to the grammar even if the grammar imports (or is imported by) another grammar.
- swirec_compile_parser
- swirec_enable_robust_compile
- swirec_first_pass_grammar
- swirec_fsm_grammar
- swirec_fsm_wordlist
- swirec_max_dict_prons
- swirec_multiword_replace
- swirec_normalize_to_probabilities
- swirec_optimization
- swirec_training_grammar
Note: Metas are not saved in the finite state machine and wordlist files. Put them in the wrapper grammar.