Count tree (<tree>)
The event count tree (<tree>) is a depth-first representation of the raw unigram, bigram, and trigram counts (Recognizer does not support n-grams beyond a trigram). The notation is highly compressed and hard-to-read, as the syntax was designed to be generated by a computer program rather than human beings.
The tree syntax diverges from standard XML practices. It contains numbers (separated by whitespace) within the body of the <node> tag. As a whole, the nodes represent a flattened tree form where the structure is determined by specific number fields in predecessor nodes. This is done to remove the significant overhead of an XML tree structure.
Format:
<tree>
<node> #num-uni #total-count </node> <!-- root -->
<node> #tokenindex [#succ-entities] #entitycount </node>
…
</tree>
#num-uni |
The number of unigrams represented in the tree. Used for the first node (the 0-gram root) of the tree. |
#total-count |
The total number of tokens found in the training corpus (which is usually the sum of all unigram counts; see below). Used for the first node (the 0-gram root) of the tree. |
#tokenindex |
The vocabulary token index for the node. |
#succ-entities |
The number of successor nodes at the next level of the tree (if 0 this number is omitted). |
#entitycount |
The event count. |
The sgc grammar compiler has an option to convert a SLM training file into an XML n-gram file. Even if you don't have a training file, it is easiest to understand the count structure as originating from one.
To explain the depth-first n-gram event count format, suppose we have the pseudo-corpus of one sentence: "-pau- A B A B C -pau2-". The unigram, bigram, and trigram information can be represented as follows:
unigram |
bigram |
trigram |
successors |
event count |
n-gram |
---|---|---|---|---|---|
-pau- |
bigrams=1 |
unigram-count=1 |
( "-pau-") |
||
A |
trigrams=1 |
bigram-count=1 |
( "-pau- A") |
||
B |
4-grams=0 |
trigram-count=1 |
( "-pau- A B") |
||
A |
bigrams=1 |
unigram-count=2 |
("A") |
||
B |
trigrams=2 |
bigram-count=2 |
( "A B") |
||
A |
4-grams=0 |
trigram-count=1 |
( "A B A") |
||
C |
4-grams=0 |
trigram-count=1 |
( "A B C") |
||
B |
bigrams=2 |
unigram-count=2 |
( "B") |
||
A |
trigrams=1 |
bigram-count=1 |
( "B A") |
||
B |
4-grams=0 |
trigram-count=1 |
( "B A B") |
||
C |
trigrams=1 |
bigram-count=1 |
( "B C") |
||
-pau2- |
4-grams=0 |
trigram-count=1 |
( "B C -pau2-") |
||
C |
bigrams=1 |
unigram-count=1 |
( "C") |
||
-pau2- |
trigrams=0 |
bigram-count=1 |
( "C -pau2-") |
||
-pau2- |
bigrams=0 |
unigram-count=1 |
( "-pau2-") |
The event count tree refers to vocabulary index numbers, removing branching count values of 0 (which are not allowed). The first node is a special 0-gram root node, detailing the number of unigrams and total corpus count.
In a ten thousand word vocabulary with a training corpus of five hundred thousand words, many vocabulary words occur so infrequently that you will prefer to rely on backoff weights instead of explicit models. These words must appear in the vocabulary section of the n-gram, but are not required in the count tree. When words are omitted from the tree, Recognizer automatically adds them to the grammar with a unigram count of 0 and calculates a backoff weight. The same behavior happens for modeling bigrams and especially trigrams (it is inefficient to model all trigrams in a corpus). Do not use backoff probabilities for utterance beginning and ending tokens ('-pau-' and '-pau2-'); if you do not model them in the tree, Recognizer writes a warning to the diagnostic log.

In this simple example, the bigram counts for all bigrams starting with "B" equals the unigram counts for "B". This is unlikely to happen in a real n-gram grammar for a production system that omits very infrequent bigrams from the tree (the unigram counts for those words remain). This and the inclusion of rules (see Importing rules) means that unigram counts for a given token are likely to be greater than bigram counts for the same token, and the bigram counts are likely to be greater than the trigram counts. (The unigram counts are never less than the bigram counts, and bigram counts are never less than trigrams.) For example, if you have a bigram count of 8 for "A B C", the unigram count for "A" is at least 8. This is the only constraint on the count.
<?xml version='1.0' encoding="UTF-8"?>
<N-Gram>
<vocab>
<token index="1"> -pau- </token>
<token index="2"> A </token>
<token index="3"> B </token>
<token index="4"> C </token>
<token index="5"> -pau2- </token>
</vocab>
<tree>
<!-- Spacing of the numbers in <tree> is for clarity only -->
<!-- I S C (I=token-index S=children C=entity-count) -->
<node> 5 7 </node> <!-- root node, 5 unigrams, 7 symbols -->
<node> 1 1 1 </node> <!-- "-pau-", one bigram, appears once -->
<node> 2 1 1 </node> <!-- "-pau- A", one trigram, used once -->
<node> 3 1 </node> <!-- "-pau- A B", appears once -->
<node> 2 1 2 </node> <!-- "A", one bigram, appears once -->
<node> 3 2 2 </node> <!-- "A B", two trigrams, appears twice-->
<node> 2 1 </node> <!-- "A B A", appears once -->
<node> 4 1 </node> <!-- "A B C", appears once-->
<node> 3 2 2 </node> <!-- "B", two bigrams, appears twice -->
<node> 2 1 1 </node> <!-- "B A" one trigram, appears once -->
<node> 3 1 </node> <!-- "B A B", appears once -->
<node> 4 1 1 </node> <!-- "B C one trigram, appears once -->
<node> 3 1 </node> <!-- "B C -pau2-", appears once -->
<node> 4 1 1 </node> <!-- "C", one bigram, appears once -->
<node> 5 1 </node> <!-- "C -pau2-", no trigram, used once -->
<node> 5 1 </node> <!-- "-pau2-", no bigrams, appears once-->
</tree>
</N-Gram>

The following example adds information to the pseudo-corpus in example1.ngxml (above): "A B A B C" => "one plus one plus five". We also have an SRGS for representing numbers "numeric.grxml". This would be represented in the following n-gram:
<?xml version='1.0' encoding="UTF-8"?>
<N-Gram>
<vocab>
<token index="1"> -pau- </token>
<token index="2">
<ruleref uri="numeric.grxml#NUMBER"/>
</token>
<token index="3"> plus </token>
<token index="4"> -pau2- </token>
</vocab>
<tree>
<!-- Spacing of the numbers in <tree> is for clarity only -->
<!-- I S C (I=token-index S=children C=entity-count) -->
<node>4 7 </node> <!-- start, 4 children -->
<node>1 1 1 </node> <!-- -pau-, 1 child -->
<node>2 1 1 </node> <!-- -pau- NUMBER, 1 child -->
<node>31 </node> <!-- -pau- NUMBER plus, no child -->
<node>2 2 3 </node> <!-- NUMBER, 2 children -->
<node>3 1 2 </node> <!-- NUMBER plus, 1 child -->
<node>22 </node> <!-- NUMBER plus NUMBER, no child-->
<node>41 </node> <!-- NUMBER -pau2- no child -->
<node>3 1 2 </node> <!-- plus", 1 child -->
<node>2 2 2 </node> <!-- plus NUMBER, 2 child -->
<node>31 </node> <!-- plus NUMBER plus, no child -->
<node>41 </node> <!-- plus NUMBER -pau2-, no child -->
<node>41 </node> <!-- -pau2-, no child -->
</tree>
</N-Gram>