SLM training file main body

The <item> element

In a training file vocabulary, each <item> declares a single vocabulary word or a phrase joined by underscores. Recognizer treats the entire item as a single word. It does not consider each word of a phrase joined by underscores individually.

Optionally, you can replace any <item> with External training files (and compression).

The <ruleref> element (grammar classes)

The optional <ruleref> element imports a rule from an external grammar into the training file:

The <ruleref> element appears in the vocabulary and sentences of training files.
When used inside of <vocab>, the <ruleref> acts as a declaration for the grammar, which is used later inside of sentences. Without this declaration, the <ruleref> cannot appear in training sentences.
The words are not imported individually from the grammar and they cannot be used individually in sentences. Instead, they are imported as a class, and they can only be specified in sentences as a <ruleref> class.
When used inside of <sentence>, the <ruleref> declares a placeholder in the named grammar. Recognizer treats those words as a class, and for the purpose of statistical modeling, they are treated as a word.

Classes help generalize results from the training sentences.

For example, if an application accepts restaurant reservations, it would be useful to have a restaurant <ruleref> instead of writing individual restaurant names in training sentences. Doing this has numerous advantages: it automates the creation of additional sentences with all names in the restaurant class; it ensures that no restaurant name is accidently omitted from the sentences; and when the list of available restaurant names changes in the restaurant grammar, no change is needed for the training sentences.

See Extending SLMs with grammar classes.

The <ruleref> element allows the following attributes:

uri: Required. This attribute declares the URI of the external grammar containing the rule. Paths are relative to the current document. For example:

<vocab>

    <item>sample</item>

...

    <ruleref uri="http://myserver/grammars/currency.xml#amount"/>

...

    <item>vocabulary</item>

</vocab>

words: Optional. This attribute provides a way to remember the original phrase or sentence that has been converted into a ruleref. This is useful to the creator of the training file, but is not currently used by the compiler.
From an archival point of view, it is an important best practice, when available, to keep the words from the original transcription.
For example, if the application recognizes dates, you might convert actual dates that are spoken and transcribed during data collection into a generalized date ruleref. This attribute provides a way to record the original words spoken.
The example implies that the phrase "two pesos" was seen during data collection and has been generalized to be a number followed by a currency. This attribute is currently used strictly for annotation and has no effect on the resulting SLM.
```
<sentence>
```
```
    I would like 
```
```
    <ruleref uri="http://myserver/grammars/number.xml"
```
```
        words="two" />
```
```
    <ruleref     uri="http://myserver/grammars/currencies.xml"
```
```
        words="pesos" />
```
```
</sentence>
```

The <sentence> element

In an SLM training file, each <sentence> defines a valid utterance. Individual words within a sentence are separated by spaces.

If phrases in sentences are best modeled as a class, you can replace them with a <ruleref> element. See The <ruleref> element (grammar classes) and Extending SLMs with grammar classes.

Define every word in every sentence in the vocabulary section of the training file:

If a word or <ruleref> appears in a <sentence> but not in the <vocab>, it is out-of-vocabulary (OOV), and no n-gram will be formed with that word or rule reference. Essentially, the word is ignored.
If a word or <ruleref> appears in the <vocab> but not in a <sentence>, it will be recognizable, but will have a low probability.

Optionally, you can replace any <item> with External training files (and compression).

For estimates of the number of sentences needed, see Data collection for training files.

Sentence counts

The count attribute multiplies the occurrence of a sentence. It is a short-cut that repeats the same sentence multiple times.

In a training sentence, this attribute increases the sentence’s weight in the resulting models. The following example implies that the sentence is 100 times more common than phrases with a count of 1 (the default):

<sentence count="100">this is a sample</item>

XML entity references

You can avoid problems handling very large training files by using XML “entities,” which enable you to break large XML files into more manageable subsets, and import them into the training file by reference. The following skeletal example illustrates importing two vocabulary files and two training files into the main training file. See the XML 1.0 recommendation.

<?xml version="1.0"?>

<!DOCTYPE slm_train [

  <!ENTITY vocab1 SYSTEM "vocab1.xml">

  <!ENTITY vocab2 SYSTEM "vocab2.xml">

  <!ENTITY training1 SYSTEM "training1.xml">

  <!ENTITY training2 SYSTEM "training2.xml">

]>

<SLMTraining version="1.0.0" xml:lang="en-us">

    <vocab>

      &vocab1;

      &vocab2;

    </vocab>

    <training>

      &training1;

      &training2;

    </training>

</SLMTraining>

External training files (and compression)

Optionally, you can replace any <item> or <sentence> in a <vocab>, <training>, or <test> section with an external training file. This is especially useful for large test sets because the external file requires no XML elements and can be in a compressed format.

The <external> element has one attribute, uri, which is set to the URI of an external file. The URI must be a local path. The external file must use UTF-8 encoding. Format:

<external uri="myLocalPath\myFilename"/>

The URI can specify a file compressed with Gnu Zip. For example:

<external uri="vocab.txt.gz"/>

External training files have different headers depending on where they appear. The header is the first line of the file, and is one of the following:

Header	Description
::VOCAB	Header for the <vocab> section.
::SLMDATA	Header for the <training> or <test> sections.

After the header, each line contains training data:

For <vocab> each line defines one vocabulary word.
For <training> and <test> each line defines one sentence.

The system assumes the default language unless words have language identifiers. This consists of the exclamation mark "!" followed by a language code. For example, !en-us. (Blank spaces around the ! are allowed.) Here are vocabulary words with mixed languages:

::VOCAB

this

is!en-us

a !en-us

vocabulary! en-us

and ! en-us

esto !es-us

es !es-us

un !es-us

vocabulario !es-us

For a <training> or <test> section, each word of a sentence can have a language identifier. The langcode refers to a single word (not a phrase). Any words with no identifier use the default language. Here are test sentences with mixed languages:

::SLMDATA

this!en-us is a vocabulary !en-us

esto !es-us es!es-us un!es-us vocabulario!es-us

Training file can indicate the weight of each sentence by adding a "count" and "prior" prefix:

Count multiplies the occurrence of a sentence in the training data. It is a short-cut that repeats the same sentence multiple times. The value is an integer, and the default is 1.
Prior is a log probability. The value is a floating point, and the default is 1.0.

The default weight is 1.0. Specify the count and prior at the beginning of a sentence, separated by whitespace and followed by a comma. The following sentences are valid, and have the same meaning:

this!en-us is a vocabulary !en-us

, this!en-us is a vocabulary !en-us

1 1.0, this!en-us is a vocabulary !en-us

Above, the example repeats the default behavior. In the following example, the first sentence has a count of 10 and a prior of 2.0, the second sentence has a count of 10 and a default prior of 1.0, and the last two sentences have default weights:

10 2.0, sentence number one

10, sentence number two

,sentence number three

sentence number four

You can use rule references anywhere in an SLM external file where a word is allowed. The purpose is the same as using the <ruleref> element in an XML file, but the syntax is different. Format:

$$<URI>

Examples:

$$grammar.grxml

$$http://grammarServer.com/grammar.grxml

Here is a complete external SLM vocabulary file:

::VOCAB

are

animals

$$animals.grxml

Here is a complete external SLM training file:

::SLMDATA

2.5,$$animals.grxml are animals

SLM training file main body

The vocabulary section

The training section

The test section

Very large training files

Related topics