Data Intensive Linguistics


HMM Lab : Markov Models and Part-of-Speech Tagging

NOTE: You may work in pairs in this assignment!

IMPORTANT: For the people working at home or somewhere else: You need the modified HMM package at /home/compling/HMM which is based on the one by Tapas Kanungo and the files in /home/compling/HMM/data for this exercise (which is based on a lab by Joerg Tiedemann at http://www.let.rug.nl/~tiedeman/ml06/lab4.html) . You may need to compile the HMM source .

You save a lot of trouble if you do this exercise on a linux machine especially on bardolph, which is where I compiled and tested it

Introduction

In this exercise we use Markov Models for automatic part-of-speech tagging and word generation. We will use a modified HMM package (based on this one developed by Tapas Kanungo). We will also use some additional scripts and data provided by Philip Resnik for his exercise in a course on computational linguistics at the University of Maryland. Don't copy the files from there because we will use additional data and modified software!

Hand-in: A report with solutions and answers to the questions and assignments below.

The task of a part-of-speech tagger is to assign wordclass labels to words in running text. Here is a little example of "English-like" text:

the/DT plane/NN can/MD fly/VB ./. 
the/DT typical/JJ plane/NN can/MD see/VB the/DT plane/NN ./. 
a/DT typical/JJ fly/NN can/MD see/VB ./.

The tagset is a small subset of the WSJ tagset including the following basic wordclasses: DT (determiner), JJ (adjective), NN (noun), VB (verb), WP (Wh question word), . (punctuation) and MD (modal).

We will

The HMM package and some additional scripts are installed on bardolph at

/home/compling/HMM

and the data we will use is in

/home/compling/HMM/data

. Make a copy of the data files into your own working directory.

Markov Models for part-of-speech tagging are used as follows: Words in the text are used as the observed data items in the sequence they appear in the text. They are generated from a sequence of states that correspond to the part-of-speech labels assigned to these words. Hence we have probabilistic transitions from a part-of-speech label to another in the model. The states in the model produce discrete output namely the words in the text. We will only consider the simplest model, the first-order Markov model in which each part-of-speech label is dependent on the previous one and conditionally independent of other labels.

Training a Visible Markov Model

First, we will build a Visible Markov Model for part-of-speech tagging using supervised learning from labeled training data. The training data is in /home/compling/HMM/data/example.train.tagged in the format as described above. For the model we have to estimate 3 kinds of probabilities (model parameters):

We do simple maximum likelihood estimation (MLE) with and without smoothing to estimate our parameters (look at the end of this file for some additional information about MLE). Using MLE, probabilities can be estimated from frequency counts like this:

Often it is necessary to "smooth" the probability estimations based on MLE in order to be able to handle unseen events. A simple smoothing technique is the "add-one discounting" estimation. Using this technique, probabilities are estimated as follows:

The HMM package works with numbers only. Therefore we have to convert our data into numeric values. There are keyfiles for converting words into digits (/home/compling/HMM/data/example.words) and for converting labels into state numbers (/home/compling/HMM/data/example.states). For convenience, I already converted the training data into numeric values (~tiedeman/ML06/example.train.seq).

Now the assignments:

  1. Estimate the parameters of a Visible Markov Model from the data in /home/compling/HMM/data/example.train.seq and store them in the format used by the HMM package (look at /home/compling/HMM/README to see the requirements). Start with unsmoothed MLE estimates. The best way to do this is to write a little script in your favorite scripting language to do the counts for you and for generating the HMM file as required. Explain your solution (include code if necessary) and show the resulting HMM in your report.

  2. You can view the HMM in a more readable form using the viewhmm.pl script:

    /home/compling/HMM/viewhmm.pl supervised.hmm /home/compling/HMM/data/example.words 0 0 | less

    where "supervised.hmm" is the file in which you saved your HMM. The last two parameters are thresholds for displaying the transition and the emission probabilities. Using the information you see, draw your model as a weighted finite state automaton (you can do that by hand) and comment on the model and its limitations.

  3. Markov models can be used to predict the next item in a sequence. Hence, they can be used to generate text using the probabilistic transition and emission distributions in the model. Using your transition diagram from above and the output of viewhmm.pl, start at the the most likely start state, and write down the most likely symbol to be emitted there. (Break ties at random.) Then follow the most likely arc to the next state, and write down the most likely symbol to be emitted there. Continue in this fashion until you emit a punctuation mark, or until you get bored. Write down your sequence.

  4. The HMM package includes a program for generating sequences at random according to the model. Run, e.g.

    /home/compling/HMM/genseq -T 15 supervised.hmm 

    to generate a random sequence of 15 words from your model (supervised.hmm). genseq includes a random number generator to alternate the output. You can set the seed for this generator with the -S flag. In order to see the actual words coming out of the model you can pipe the output through a little script in the HMM package directory:

    /home/compling/HMM/genseq -T 15 supervised.hmm | /home/compling/HMM/ints2words.pl /home/compling/HMM/data/example.words

    You can play around with this a little bit if you like. Give at least one sequence generated by the model. Comment on the quality of the output (compared to the input in the training data). Do they look ok and do you obtain sequences not included in the training data?

  5. The Viterbi algorithm is used to find the optimal state sequence for a given observation. This is used to tag a text and you can try it with your model with the following command:

    /home/compling/HMM/testvit supervised.hmm /home/compling/HMM/data/example.test.wordseq

    The input sequence above (example.test.wordseq) is generated from an example text used for testing out models (/home/compling/HMM/data/example.test). You can also make your own input for tagging with the words2seq.pl script:

    echo "a plane can fly ." | /home/compling/HMM/words2seq.pl /home/compling/HMM/data/example.words > my_text

    There is a tagged version of the example text used above in /home/compling/HMM/data/example.test.tagged that you can use as gold standard. The transformation into state numbers is in /home/compling/HMM/data/example.test.tagseq. Tag the example text with your model and compute the accuracy of the tagger for this test data. Look at the tags produced by your model. Can you see a reason for the tagging errors?

  6. Build another Visible Markov Model similar to the one above but with smoothed probability distributions. Use the simple "add-one discounting" approach as described earlier. Describe your solution and include the HMM with its parameters in your report. Tag the test data again with your new model and compute the accuracy. Comment on the results and the effect of smoothing.

Training a Hidden Markov Model

In this exercise we train a Hidden Markov Model for part-of-speech tagging on unlabeled data. For this we use the well-known Baum-Welch algorithm for EM training of HMMs. The training data is in the following file: /home/compling/HMM/data/example.train. The transformation of this file into numeric values is in /home/compling/HMM/data/example.seq using the same transformation table as before (/home/compling/HMM/data/example.words).

  1. Train the HMM with the following command

    /home/compling/HMM/esthmm -M nr_of_symbols -N nr_of_states /home/compling/HMM/data/example.seq > unsupervised.hmm 

    You have to replace nr_of_symbols with the number of distinct output symbols in your data (number of distinct words) and nr_of_states with the number of states you wish to have in your model. We know that we want to represent part-of-speech labels with the states in the model and therefore we need one state for each label!

  2. Look at the model with

    /home/compling/HMM/viewhmm.pl unsupervised.hmm /home/compling/HMM/data/example.words 0 0 | less

    Try to label each state with a distinct part-of-speech label according to the probability distributions in the model (look especially at the emission probabilities for the labeling). How good is the match between your intuitions and the decisions the model made automatically? Tag the test data as in exercise 5 on visible Markov models but now with your HMM trained on unlabeled data. What is the accuracy of your model? Note that states in your current model do not correspond to the states listed in /home/compling/HMM/data/example.states! You have to map the resulting state sequence to a part-of-speech sequence using the labels assigned to each state in the previous assignment. Comment on the tagging results!

  3. The training procedure as implemented in esthmm starts with a random initial model. For this, it uses a random number generator. You can see that the initial model changes each time you run esthmm (add the flag -v when calling esthmm to see the initial model). What influence has the initial model on the training procedure and the final model found? (Run the training several times with the "-v" flag set and look at the number of re-estimation iterations and the probabilities of the observed training data given the models)

  4. You can start with a specific initial model by setting a seed value for the random number generator using the '-S' argument for esthmm. Select a seed and run the training procedure once more. If you repeat you will see that you always start with the same initial model and you always obtain the same final model. Now you can give another argument to control the number of iterations to be run. For example, to run 3 iterations run the following command (with seed 10):

    /home/compling/HMM/esthmm -v -S 10 -i 3 -M nr_of_symbols -N nr_of_states /home/compling/HMM/data/example.seq

    Run the steps from 1 to 10 iterations and record the probability of the data given the model reported by esthmm. You may plot this probability during training if you like. Do you see a general tendency?

Combining supervised and unsupervised learning

In the previous exercise we could see that the final model learned from unlabeled data depends on the initial model the training procedure starts with. Supervised learning is usually more exact than unsupervised learning. However, there is often too little training data available and labeling additional data is expensive. A common strategy is therefore to combine supervised with unsupervised learning. We can easily do that with HMMs using the 2 strategies tested earlier. We start with supervised learning from labeled training data to estimate the initial model and continue with unsupervised training on unlabeled data.

  1. Use the Visible Markov Model with smoothed probability estimations produced in the first assignment as initial model to train an HMM on the unlabeled training data given in /home/compling/HMM/data/example.seq. Use the '-I' argument to specify the file with the initial model to start with. Leave out the -M and -N arguments! These parameters will be taken from the initial model. You can still use "-v" and "-i" if you like. Compare the model to the ones produced earlier (look for example at the probability of the data given the model).

  2. Why is it necessary to start the Baum-Welch training with smoothed probability estimates?

  3. Tag the test data /home/compling/HMM/data/example.test.wordseq again and compute accuracy. Compare the results with the ones from applying the visible Markov models and the hidden Markov model.

  4. Look at the current model with viewhmm.pl in the same way as above and comment on the model parameters. Do you see any striking differences to the earlier models?

Bonus Assignment

Visualize parameter changes during unsupervised training with animated plots. Decide yourself how to represent parameters and their values. The nicest solution would be to have a little script that generates the visual representation from given data and a training procedure.