SLM training file header

SLM parameter	Description
cutoffs	Indicates when to remove bigrams or trigrams that occur infrequently, and are thus statistically insignificant.
discounts_in	Optional. Specifies an input filename for a discounts file (.dcnt), which lets you control the impact of n-grams not contained in the training set.
discounts_out	Optional. Specifies an output filename for computed discounts (see discounts_in).
fsm_out	Optional. Specifies an output filename for the finite state machine (FSM) file that is created when you generate an SLM.
ngram_order	Specifies whether to create a bigram or trigram language model.
print_arpa	Optional. Specifies an output file for writing the SLM in the ARPA format.
smooth_alg	Optional. Applies an industry-standard algorithm while training the language model.
smooth_weights	Shows the interpolation weights used when interpolating 2-gram and 3-gram probabilities with 1-gram probabilities.
wordlist_out	Optional. Directs the sgc compiler to create a vocabulary word list to accompany the n-gram file specified via fsm_out (see fsm_out).

cutoffs

Indicates when to remove bigrams or trigrams that occur infrequently, and are thus statistically insignificant.

For example, a value of 2 removes any bigram that occurs two times or fewer. The values specified for the cutoffs parameter must be an integer greater than or equal to zero (the default).

Typically, the default value of this parameter is not changed. When you are training a language model on a large quantity of data, however, the model can grow very large. In these instances, the cutoffs parameter dramatically reduces the size of the language model, while having only a minimal effect on accuracy. With large training sets, a setting of 1 can reduce grammar size by 50% while having almost no detrimental effect on the grammar's accuracy.

For higher-order n-grams, you may specify more than one value for the cutoffs parameter. Use the ngram_order parameter (ngram_order) to determine the number of values required for bigrams or trigrams:

For a bigram, the cutoffs parameter requires one value. For example:
```
<param name="cutoffs">
```
```
    <value>2</value>
```
```
</param>
```
For a trigram, the cutoffs parameter requires two values separated by a space. The first value applies to bigrams, and the second applies to trigrams. When you create a trigram, the trigram contains internal bigrams.
In the following example, 2 is applied to bigrams and 3 is applied to trigrams (the bigram cutoff must be less than or equal to the trigram cutoff):
```
<param name="cutoffs">
```
```
    <value>2 3</value>
```
```
</param>
```
If you specify two values for the cutoffs parameter, but then use ngram-order to create a bigram, the second value is not needed and is ignored if present.

discounts_in

Optional. Specifies an input filename for a discounts file (.dcnt), which lets you control the impact of n-grams not contained in the training set.

For example:

<param name="discounts_in">

    <value> file.dcnt </value> </param>

In the context of SLM training, a discount is a multiplier which the compiler uses to account for word permutations that are not explicitly included in the training file as sentences. Discounts are calculated using an industry-standard algorithm, as determined by the smooth_alg parameter.

Without discounting, any n-gram not seen in the training has zero probability, which makes it unlikely to be recognized in the test data. With discounts, part of the probability mass is reserved to accommodate such n-grams, by multiplying the counts of the training n-grams by a constant smaller than 1.

If you omit this parameter, Recognizer computes Good-Turing discounts from the input training set. This is suitable if you have small amounts of training data, originating from different applications. Use discounts_in when the smooth_alg parameter specifies a Good-Turing value and you do not want to re-compute the Good-Turing discounts.

A sample discounts file appears below:

Sample discount file
Good-Turing discounts 6 0.39789599 0.68462503 0.71244669 0.87349498 0.90344799 0.79262167 8 0.21760599 0.56865501 0.72032666 0.79289001 0.82419997 0.81001669 0.81023431 0.95779902 7 0.0000000 0.46160099 0.67955333 0.67672497 0.74140197 0.75773168 0.83446717

Sample discount file

Good-Turing discounts

6 0.39789599 0.68462503 0.71244669 0.87349498 0.90344799 0.79262167

8 0.21760599 0.56865501 0.72032666 0.79289001 0.82419997 0.81001669 0.81023431 0.95779902

7 0.0000000 0.46160099 0.67955333 0.67672497 0.74140197 0.75773168 0.83446717

The first line of a discount file is a comment (above, “Good-Turning discounts”).

The remaining lines correspond to the n-gram order (1, 2, and 3), and consists of floating point discount multipliers. The first digit on the line indicates the number of factors on that line. Each multiplier corresponds to a n-gram count in the training data.

In the example, there are 6 multipliers for 1-grams, 8 for 2-grams, and 7 for 3-grams. An n-gram with an original count equal to i will be modified by multiplying i by the i-th float on that line. If i is greater than the number of multipliers, the last multiplier is used.

discounts_out

Optional. Specifies an output filename for computed discounts (see discounts_in).

It writes the computed discounts, which can be used later as input in future training sessions.

fsm_out

Optional. Specifies an output filename for the finite state machine (FSM) file that is created when you generate an SLM.

For example:

<param name="fsm_out">

    <value>filename.fsm</value>

</param>

You can import the FSM file in conjunction with the wordlist into an SRGS grammar using <meta>s.

When usign fsm_out, use wordlist_out to specify the filename of the accompanying vocabulary list. It’s a good idea to use similar names for the .fsm and .wordlist files to help manage the files as a pair.

ngram_order

Specifies whether to create a bigram or trigram language model.

<param name="ngram_order">

    <value>2</value>

</param>

The value for ngram_order must be an integer, either 2 (bigram) or 3 (trigram).

A trigram is preferable, because it provides the best accuracy for typical scenarios. However, you can build a bigram if your training set is small (not enough sentences), or if you need to speed up the training (because a bigram requires less processing), or to reduce the size of the final grammar.

print_arpa

Optional. Specifies an output file for writing the SLM in the ARPA format.

For example:

<param name="print_arpa">

    <value> file.arpa </value>

</param>

ARPA stands for Advanced Research Projects Agency. Because the format is ASCII text, you can inspect language model probabilities () and exchange language models with users who use different language modeling tools.

smooth_alg

Optional. Applies an industry-standard algorithm while training the language model.

For a overview of how discounts are used, see discounts_in.

The value is one of the following strings:

Value	Description
GT-disc	Good/Turing discounting, no interpolation.
GT-disc-int	Good/Turing discounting, interpolation with 1-gram probabilities (using the smooth_weights parameter).
GT-discw-int	GT discounting using sentence weights, interpolation with 1-gram probabilities (using the smooth_weights parameter).
INT	No discounting, interpolation with 1-gram probabilities (using the smooth_weights parameter).
WB-disc	Discounted but non-interpolated Witten-Bell. This is the default.
WB-int	Regularly-interpolated Witten-Bell with controlling smooth_weights parameter.

For example:

<param name="smooth_alg">

    <value> GT-disc </value>

</param>

smooth_weights

Optional. Defines the interpolation weights used when interpolating 2-gram and 3-gram probabilities with 1-gram probabilities.

The value is a list of smoothing weights (the number of values equals the n-gram order).

<param name="smooth_weights">

    <value> 0.1 0.9 0.9 0.4 </value>

</param>

The last weight is only used by the regular Witten-Bell method.

One use of this parameter is when the training set has less than 2 million words. See the example in Determining the order of the model.

SLM training file header

XML declaration

Document type and system

<SLMTraining> and language declaration

Configuration parameters

User dictionaries

The <meta> element

Related topics