This page has been kindly translated into Romanian by Aleksandra Seremina.

ROMANIAN TRANSLATION

Counting N-grams Over FSMs

Introduction

Computational linguists use n-gram language models in many natural language processing tasks. See any NLP or comp ling textbook for examples and explanations of the statistical estimation of such language models. This page and the included software focuses on building a language model, or simply calculating n-gram counts, from a corpus or sentence encoded as a finite state machine (FSM).

Regular N-gram Model Building

A normal text corpus from which we might wish to build a language model might look like this (corpus.txt):

Go to the deli and pick up some cheese.
Go to the deli and get some cheese.
Go to the deli and pick up some turkey.
Go to the deli and get some turkey.
Drive to the deli and pick up some cheese.
Drive to the deli and get some cheese.
Drive to the deli and pick up some turkey.
Drive to the deli and get some turkey.
See if the deli has any ham.

To count the individual n-grams from this corpus, we can use the SRILM toolkit as follows:

$> ngram -text corpus.txt -write corpus.counts

The output in corpus.counts is a list of unigrams, bigrams, and trigrams and their counts in the corpus.

From this list of counts we can again use SRILM to build a statistical language model:

$> ngram-count -read corpus.counts -lm corpus.lm

That is the basic algorithm for building an LM from a regular text corpus. Of course there other programs for doing the counting and LM estimation, and there are many parameters that can be set for the order of n-grams and smoothing techniques, etc.

Moving to FSMs

What if, instead, you wanted to compress the corpus into a more compact form, for instance, a finite state machine? The example corpus lends itself to this data structure quite nicely:

The corpus as FSM

(Depending on your purposes, you might also connect 'the deli' in the lower part to the same links in the upper part.) Now, how do we count n-grams across this FSM version of the same corpus? The correct way is to count how many times each n-gram occurs in each path through the FSM, and multiply by the weight of that path. The motivation, proof, and example of this algorithm is given in Allauzen et al. 2003. An expansion of the proof on the 3rd page is given in the Appendix of my dissertation. Below I share C++ code that implements the algorithm described in Allauzen et al. 2003.

Code for Counting N-grams over FSMs

The code uses the OpenFst software as its basis, therefore OpenFst must be installed on your system in order for this software to work.

The crucial bits are ngramcount.cpp and ngramcount.h. An example of how this class can be used is given in doNgramCount.cpp. They can be compiled this way, assuming you have installed openfst in your home directory:
$> g++ -I ~/openfst-1.1/include -c ngramcount.cpp
$> g++ -I ~/openfst-1.1/include -o doNgramCount doNgramCount.cpp ngramcount.o ~/openfst-1.1/lib/libfst.so -ldl

The program doNgramCount takes in a single FSM and gives out the number of times each n-gram occurs in that FSM. The default is n-gram order is 1, unigrams. We can use the FSM above as an example. First we compile the fsm.

$> fstcompile --keep_isymbols --keep_osymbols --isymbols=corpus.alpha --osymbols=corpus.alpha --acceptor corpus.fst.txt corpus.fst

It is important to make sure that the text symbols are included in the FSM rather than only the integer labels (by using --keep_isymbols and --keep_osymbols), so that the output is correct. Now you can run the program:
$> ./doNgramCount corpus.fst

which shows the following:
</s> 9
<s> 9
Drive 4
Go 4
See 1
and 8
any 1
cheese. 4
deli 9
get 4
ham. 1
has 1
if 1
pick 4
some 8
the 9
to 8
turkey. 4
up 4

Each path in the fsm above has a weight of 1. Therefore, each unigram accumulates a count of one for each path in which it appears. The begin and end of sentence symbols and the word "the" appear in all 9 paths (all of which appear in corpus.txt). You can similarly check the counts of the other unigrams by counting their appearances in corpus.txt.

To count bigrams:
$> ./doNgramCount corpus.fst 2

And similarly for trigrams, 4-grams, or as high an order as you please. These counts, if created using order 3, are the same as those in corpus.counts created using SRILM. If you direct the output to a file, that counts file can be used to create a language model, just as before. The only change is in how the n-gram counts are calculated , not in how the language model is estimated.

Incorporating Probabilistic Information

One way to incorporate probabilistic information into the FSM on which we are counting ngrams is to compose that FSM with an appropriate language model. In this case our language model will come from corpus.txt, which we made into a language model above using SRILM. Now we need to turn that language model into an FSM itself, a process also described in Allauzen et al. 2003. Here is corpus.lm as an FSM in text format, corpus.lm.fst.txt, which should be compiled as follows:

$> fstcompile --keep_isymbols --keep_osymbols --isymbols=corpus.alpha --osymbols=corpus.alpha --acceptor --arc_type=log corpus.lm.fst.txt corpus.lm.fst

Again it is important to compile with the symbol tables included, and that the arc type is log (not the default tropical).

By putting the language model as FST as the third input to doNgramCount, it will be composed with the sentence FST before counting is done. Now we can count n-grams over the corpus FSM, taking into account the n-gram probabilities as determined by the same corpus's language model:

$> ./doNgramCount corpus.fst 1 corpus.lm.fst

The weight of each path is no longer one, so the counts of the unigrams are no longer integers. The counts of 'Drive' and 'Go', which are more probable in the LM, are larger than the count of 'See'.

Sentences with Unknown Words

You can build other sentences over which to count n-grams, using any language model to influence the counts. For instance, s2.fst.txt has an unknown word in it (store, which exists in the sentence but not in corpus.lm). If the language model is created using the -unk flag in SRILM (to create a language model corpus.unk.lm) then the unknown word in the sentence will compose with <unk> in the language model.

$> fstcompile --keep_isymbols --keep_osymbols --isymbols=s2.alpha --osymbols=s2.alpha --acceptor s2.fst.txt s2.fst

$> fstcompile --keep_isymbols --keep_osymbols --isymbols=corpus.unk.alpha --osymbols=corpus.unk.alpha --acceptor --arc_type=log corpus.unk.lm.fst.txt corpus.unk.lm.fst

$> ./doNgramCount s2.fst

$> ./doNgramCount s2.fst 1 corpus.unk.lm.fst

The output includes "store+unk", which means that "store" was unknown to the language model and was given the <unk> probability. The +unk can be removed with sed if it is not needed.

Using doNgramCount

Above are specific examples of how to use the program doNgramCount to count n-grams over a single sentence at a time on the command line. The general instructions are: The output of doNgramCount is an n-gram counts file (written to STDOUT) that can be used by SRILM ngram-count to build a language model with the -read flag.

Using ngramcount.cpp

The program doNgramCount is meant as an example use of ngramcount.cpp. You can incorporate the latter into your own program as you like. Places where problems are likely to occur are with unknown symbols, symbol tables, and begin/end of sentence symbols (make sure they are present both in the LM and on your sentence FSMs).

Download the Code

For counting n-grams over FSMs: ngramcount.cpp
ngramcount.h
doNgramCount.cpp

For converting an arpa-style LM into an FSM:
ngram2fsm.tar.gz
See my ngram2fsm explanation here.