NOTE: You may work in pairs in this assignment!
IMPORTANT:
For the people working at home or somewhere else: You need the
modified HMM package at /home/compling/HMM which is based on the one
by Tapas Kanungo
and the files in /home/compling/HMM/data for this exercise (which is
based on a lab by Joerg
Tiedemann at http://www.let.rug.nl/~tiedeman/ml06/lab4.html)
. You may need to compile the HMM source .
You save a lot of trouble if you do this exercise on a linux machine especially on bardolph, which is where I compiled and tested it
In this exercise we use Markov Models for automatic part-of-speech tagging and word generation. We will use a modified HMM package (based on this one developed by Tapas Kanungo). We will also use some additional scripts and data provided by Philip Resnik for his exercise in a course on computational linguistics at the University of Maryland. Don't copy the files from there because we will use additional data and modified software!
Hand-in: A report with solutions and answers to the questions and assignments below.
The task of a part-of-speech tagger is to assign wordclass labels to words in running text. Here is a little example of "English-like" text:
the/DT plane/NN can/MD fly/VB ./. the/DT typical/JJ plane/NN can/MD see/VB the/DT plane/NN ./. a/DT typical/JJ fly/NN can/MD see/VB ./.
The tagset is a small subset of the WSJ tagset including the following basic wordclasses: DT (determiner), JJ (adjective), NN (noun), VB (verb), WP (Wh question word), . (punctuation) and MD (modal).
We will
use tagged training data for supervised learning of a "visible Markov model"
use unlabeled training data to train a Hidden Markov Model
combine both techniques
apply learned models for tagging of unseen data and for word generation
The HMM package and some additional scripts are installed on bardolph at
/home/compling/HMM
and the data we will use is in
/home/compling/HMM/data
. Make a copy of the data files into your own working directory.
Markov Models for part-of-speech tagging are used as follows: Words in the text are used as the observed data items in the sequence they appear in the text. They are generated from a sequence of states that correspond to the part-of-speech labels assigned to these words. Hence we have probabilistic transitions from a part-of-speech label to another in the model. The states in the model produce discrete output namely the words in the text. We will only consider the simplest model, the first-order Markov model in which each part-of-speech label is dependent on the previous one and conditionally independent of other labels.
First, we will build a Visible Markov Model for part-of-speech tagging using supervised learning from labeled training data. The training data is in /home/compling/HMM/data/example.train.tagged in the format as described above. For the model we have to estimate 3 kinds of probabilities (model parameters):
initial probabilities pi_i, the probability to start in a certain state
transition probabilities a_ij, the probability to go from state i to state j (wordclass i to wordclass j)
emission probabilities b_ik, the probability to generate word k in state i
We do simple maximum likelihood estimation (MLE) with and without smoothing to estimate our parameters (look at the end of this file for some additional information about MLE). Using MLE, probabilities can be estimated from frequency counts like this:
P(x)=count(x)/N,
where count(x) is the frequency of item x in the data collection D
and N is the total number of items in D
P(x|y)=count(x,y)/count(y), where count(x,y) is
the frequency of items x and y occurring together in D
Often it is necessary to "smooth" the probability estimations based on MLE in order to be able to handle unseen events. A simple smoothing technique is the "add-one discounting" estimation. Using this technique, probabilities are estimated as follows:
P(x)=(count(x)+1)/(N+k),
where k is the number of distinct elements x in our data
P(x|y)=(count(x,y)+1)/(count(y)+m), where m is
the number of distinct elements x in the data (not 'y' as it
was written earlier ...)
The HMM package works with numbers only. Therefore we have to convert our data into numeric values. There are keyfiles for converting words into digits (/home/compling/HMM/data/example.words) and for converting labels into state numbers (/home/compling/HMM/data/example.states). For convenience, I already converted the training data into numeric values (~tiedeman/ML06/example.train.seq).
Now the assignments:
Estimate the parameters of a Visible Markov Model from the data in /home/compling/HMM/data/example.train.seq and store them in the format used by the HMM package (look at /home/compling/HMM/README to see the requirements). Start with unsmoothed MLE estimates. The best way to do this is to write a little script in your favorite scripting language to do the counts for you and for generating the HMM file as required. Explain your solution (include code if necessary) and show the resulting HMM in your report.
You can view the HMM in a more readable form using the viewhmm.pl script:
/home/compling/HMM/viewhmm.pl supervised.hmm /home/compling/HMM/data/example.words 0 0 | less
where "supervised.hmm" is the file in which you saved your HMM. The last two parameters are thresholds for displaying the transition and the emission probabilities. Using the information you see, draw your model as a weighted finite state automaton (you can do that by hand) and comment on the model and its limitations.
Markov models can be used to predict the next item in a sequence. Hence, they can be used to generate text using the probabilistic transition and emission distributions in the model. Using your transition diagram from above and the output of viewhmm.pl, start at the the most likely start state, and write down the most likely symbol to be emitted there. (Break ties at random.) Then follow the most likely arc to the next state, and write down the most likely symbol to be emitted there. Continue in this fashion until you emit a punctuation mark, or until you get bored. Write down your sequence.
The HMM package includes a program for generating sequences at random according to the model. Run, e.g.
/home/compling/HMM/genseq -T 15 supervised.hmm
to generate a random sequence of 15 words from your model (supervised.hmm). genseq includes a random number generator to alternate the output. You can set the seed for this generator with the -S flag. In order to see the actual words coming out of the model you can pipe the output through a little script in the HMM package directory:
/home/compling/HMM/genseq -T 15 supervised.hmm | /home/compling/HMM/ints2words.pl /home/compling/HMM/data/example.words
You can play around with this a little bit if you like. Give at least one sequence generated by the model. Comment on the quality of the output (compared to the input in the training data). Do they look ok and do you obtain sequences not included in the training data?
The Viterbi algorithm is used to find the optimal state sequence for a given observation. This is used to tag a text and you can try it with your model with the following command:
/home/compling/HMM/testvit supervised.hmm /home/compling/HMM/data/example.test.wordseq
The input sequence above (example.test.wordseq) is
generated from an example text used for testing out models
(/home/compling/HMM/data/example.test). You can also
make your own input for tagging with the words2seq.pl script:
echo "a plane can fly ." | /home/compling/HMM/words2seq.pl /home/compling/HMM/data/example.words > my_text
There is a tagged version of the example text used above in
/home/compling/HMM/data/example.test.tagged that you
can use as gold standard. The transformation into state numbers is
in /home/compling/HMM/data/example.test.tagseq. Tag the
example text with your model and compute the accuracy of the tagger
for this test data. Look at the tags produced by your model. Can you
see a reason for the tagging errors?
Build another Visible Markov Model similar to the one above but with smoothed probability distributions. Use the simple "add-one discounting" approach as described earlier. Describe your solution and include the HMM with its parameters in your report. Tag the test data again with your new model and compute the accuracy. Comment on the results and the effect of smoothing.
In this exercise we train a Hidden Markov Model for part-of-speech
tagging on unlabeled data. For this we use the well-known Baum-Welch
algorithm for EM training of HMMs. The training data is in the
following file: /home/compling/HMM/data/example.train.
The transformation of this file into numeric values is in
/home/compling/HMM/data/example.seq using the same
transformation table as before
(/home/compling/HMM/data/example.words).
Train the HMM with the following command
/home/compling/HMM/esthmm -M nr_of_symbols -N nr_of_states /home/compling/HMM/data/example.seq > unsupervised.hmm
You have to replace nr_of_symbols with the number of
distinct output symbols in your data (number of distinct words) and
nr_of_states with the number of states you wish to have
in your model. We know that we want to represent part-of-speech
labels with the states in the model and therefore we need one state
for each label!
Look at the model with
/home/compling/HMM/viewhmm.pl unsupervised.hmm /home/compling/HMM/data/example.words 0 0 | less
Try to label each state with a distinct part-of-speech label
according to the probability distributions in the model (look
especially at the emission probabilities for the labeling). How good
is the match between your intuitions and the decisions the model
made automatically? Tag the test data as in exercise 5 on visible
Markov models but now with your HMM trained on unlabeled data. What
is the accuracy of your model? Note that states in your current
model do not correspond to the states listed in
/home/compling/HMM/data/example.states! You have to map
the resulting state sequence to a part-of-speech sequence using the
labels assigned to each state in the previous assignment. Comment on
the tagging results!
The training procedure as implemented in esthmm starts with a random initial model. For this, it uses a random number generator. You can see that the initial model changes each time you run esthmm (add the flag -v when calling esthmm to see the initial model). What influence has the initial model on the training procedure and the final model found? (Run the training several times with the "-v" flag set and look at the number of re-estimation iterations and the probabilities of the observed training data given the models)
You can start with a specific initial model by setting a seed value for the random number generator using the '-S' argument for esthmm. Select a seed and run the training procedure once more. If you repeat you will see that you always start with the same initial model and you always obtain the same final model. Now you can give another argument to control the number of iterations to be run. For example, to run 3 iterations run the following command (with seed 10):
/home/compling/HMM/esthmm -v -S 10 -i 3 -M nr_of_symbols -N nr_of_states /home/compling/HMM/data/example.seq
Run the steps from 1 to 10 iterations and record the probability of the data given the model reported by esthmm. You may plot this probability during training if you like. Do you see a general tendency?
In the previous exercise we could see that the final model learned from unlabeled data depends on the initial model the training procedure starts with. Supervised learning is usually more exact than unsupervised learning. However, there is often too little training data available and labeling additional data is expensive. A common strategy is therefore to combine supervised with unsupervised learning. We can easily do that with HMMs using the 2 strategies tested earlier. We start with supervised learning from labeled training data to estimate the initial model and continue with unsupervised training on unlabeled data.
Use the Visible Markov Model with
smoothed probability estimations produced in the first assignment as
initial model to train an HMM on the unlabeled training data given
in /home/compling/HMM/data/example.seq. Use the '-I'
argument to specify the file with the initial model to start with.
Leave out the -M and -N arguments! These parameters will be taken
from the initial model. You can still use "-v" and "-i"
if you like. Compare the model to the ones produced earlier (look
for example at the probability of the data given the model).
Why is it necessary to start the Baum-Welch training with smoothed probability estimates?
Tag the test data
/home/compling/HMM/data/example.test.wordseq again and
compute accuracy. Compare the results with the ones from applying
the visible Markov models and the hidden Markov model.
Look at the current model with viewhmm.pl in the same way as above and comment on the model parameters. Do you see any striking differences to the earlier models?
Visualize parameter changes during unsupervised training with animated plots. Decide yourself how to represent parameters and their values. The nicest solution would be to have a little script that generates the visual representation from given data and a training procedure.