Go to the deli and pick up some cheese. Go to the deli and get some cheese. Go to the deli and pick up some turkey. Go to the deli and get some turkey. Drive to the deli and pick up some cheese. Drive to the deli and get some cheese. Drive to the deli and pick up some turkey. Drive to the deli and get some turkey. See if the deli has any ham.
To count the individual n-grams from this corpus, we can use the SRILM toolkit as follows:
$> ngram -text corpus.txt -write corpus.counts
The output in corpus.counts is a list of unigrams, bigrams, and trigrams and their counts in the corpus.
From this list of counts we can again use SRILM to build a statistical language model:
$> ngram-count -read corpus.counts -lm corpus.lm
That is the basic algorithm for building an LM from a regular text corpus. Of course there other programs for doing the counting and LM estimation, and there are many parameters that can be set for the order of n-grams and smoothing techniques, etc.
(Depending on your purposes, you might also connect 'the deli' in the lower part to the same links in the upper part.) Now, how do we count n-grams across this FSM version of the same corpus? The correct way is to count how many times each n-gram occurs in each path through the FSM, and multiply by the weight of that path. The motivation, proof, and example of this algorithm is given in Allauzen et al. 2003. An expansion of the proof on the 3rd page is given in the Appendix of my dissertation. Below I share C++ code that implements the algorithm described in Allauzen et al. 2003.
The crucial bits are ngramcount.cpp and
ngramcount.h. An example of how this class can be used is given
in doNgramCount.cpp. They can be compiled this
way, assuming you have installed openfst in your home directory:
$> g++ -I ~/openfst-1.1/include -c ngramcount.cpp
$> g++ -I ~/openfst-1.1/include -o doNgramCount doNgramCount.cpp
ngramcount.o ~/openfst-1.1/lib/libfst.so -ldl
The program doNgramCount takes in a single FSM and gives out the number of times each n-gram occurs in that FSM. The default is n-gram order is 1, unigrams. We can use the FSM above as an example. First we compile the fsm.
$> fstcompile --keep_isymbols --keep_osymbols
--isymbols=corpus.alpha
--osymbols=corpus.alpha
--acceptor corpus.fst.txt corpus.fst
It is important to make sure that the text symbols are included in the
FSM rather than only the integer labels (by using --keep_isymbols and
--keep_osymbols), so that the output is correct. Now you can run the
program:
$> ./doNgramCount corpus.fst
which shows the following:
</s> 9
<s> 9
Drive 4
Go 4
See 1
and 8
any 1
cheese. 4
deli 9
get 4
ham. 1
has 1
if 1
pick 4
some 8
the 9
to 8
turkey. 4
up 4
Each path in the fsm above has a weight of 1. Therefore, each unigram accumulates a count of one for each path in which it appears. The begin and end of sentence symbols and the word "the" appear in all 9 paths (all of which appear in corpus.txt). You can similarly check the counts of the other unigrams by counting their appearances in corpus.txt.
To count bigrams:
$> ./doNgramCount corpus.fst 2
And similarly for trigrams, 4-grams, or as high an order as you please. These counts, if created using order 3, are the same as those in corpus.counts created using SRILM. If you direct the output to a file, that counts file can be used to create a language model, just as before. The only change is in how the n-gram counts are calculated , not in how the language model is estimated.
$> fstcompile --keep_isymbols --keep_osymbols --isymbols=corpus.alpha
--osymbols=corpus.alpha --acceptor --arc_type=log corpus.lm.fst.txt
corpus.lm.fst
Again it is important to compile with the symbol tables included, and that the arc type is log (not the default tropical).
By putting the language model as FST as the third input to doNgramCount, it will be composed with the sentence FST before counting is done. Now we can count n-grams over the corpus FSM, taking into account the n-gram probabilities as determined by the same corpus's language model:
$> ./doNgramCount corpus.fst 1 corpus.lm.fst
The weight of each path is no longer one, so the counts of the unigrams are no longer integers. The counts of 'Drive' and 'Go', which are more probable in the LM, are larger than the count of 'See'.
$> fstcompile --keep_isymbols --keep_osymbols --isymbols=s2.alpha
--osymbols=s2.alpha --acceptor
s2.fst.txt s2.fst
$> fstcompile --keep_isymbols --keep_osymbols
--isymbols=corpus.unk.alpha
--osymbols=corpus.unk.alpha --acceptor
--arc_type=log corpus.unk.lm.fst.txt
corpus.unk.lm.fst
$> ./doNgramCount s2.fst
$> ./doNgramCount s2.fst 1 corpus.unk.lm.fst
The output includes "store+unk", which means that "store" was unknown to the language model and was given the <unk> probability. The +unk can be removed with sed if it is not needed.
$> ./doNgramCount sentence.fst order
$> ./doNgramCount sentence.fst order lm.fst
For converting an arpa-style LM into an FSM:
ngram2fsm.tar.gz
See my ngram2fsm explanation here.