This is a page for Chris Brew's Ling. 684.02 Class, WI 2002 --with a few links, assignments, due dates, etc.



This course has three main aims: familiarity with tools and techniques for handling text corpora, knowledge of the characteristics of some of the available corpora, and a secure grasp of the fundamentals of statistical natural language processing. Specific objectives include:

  1. understanding of probability and information theory as they have been applied to computational linguistics.
  2. knowledge of fundamental techniques of probabilistic language modelling.
  3. experience of working with corpora.
  4. knowledge of some applications of statistical NLP.
The formal syllabus is available at{ps|pdf}
in Postscript or PDF
Some course notes, mainly by me, including material on Unix tools, are available at Edinburgh.


Class Time and Location:

Tuesday and Thursday 12:30-2:18pm
228 Cockins Hall

Class mailing list: snlp

to subscribe/unsubcribe, email

Chris's office hours: 2-5:00 Wednesday (or by arrangement)


Foundations of Statistical Natural Language Processing,
by Christopher Manning and Hinrich Schütze


Assessment will be by means of weekly assignments. Each assignment contributes 12.5\% of the grade, so 8 high-quality submissions will suffice. Assignments will be set on Tuesdays and will be due the following Tuesday.

There are no extensions and no incompletes in this course. All deadlines are hard-and-fast. But recall that only 8 assignments are required.

If you wish, you may substitute a mini-project for the final three assignments. In that case you must do the following

  1. Let me know that you are going this route. At this stage you should have some idea of an inspiring topic. If not, it is probably a better option to do the regular assignments. You always have the option of doing more of the regular assigments instead. (Deadline 2/19)
  2. Submit an initial proposal This is a proposal for the research that you aim to conduct over the next two weeks. At this stage your job is to convince me that your proposed project is of an appropriate size and difficulty. It has to be hard enough to be worth three assignments, but small enough to be achievable in the time available. I will work with you to design a suitable project. (Deadline 2/26 -- counts as one assignment).
  3. Submit an approved version of the project proposal (Deadline: 3/3 -- counts as half an assignment). Might be no work whatsoever if first version was approved.
  4. Submit a final report on the project, with an appendix including any code that you wrote (Deadline 3/21 -- counts as one and one half assignments)

Class schedule

  1. Tuesday 1/8 Introduction. Background knowledge survey. To be made up at time tba.
  2. Thursday 1/10 Statistical approaches to NLP Slides (HTML for Explorer only -- sorry)
  3. Tuesday 1/15 Probability review.
  4. Thursday 1/17 Information theory review.
  5. Tuesday 1/22 Language Identification.
  6. Thursday 1/24 Smoothing techniques slides
  7. Tuesday 1/29 More information theory slides
  8. Thursday 1/31 Collocations slides
  9. Tuesday 2/5 HMM decoding and training. slides, Explorer only
  10. Thursday 2/7 Implementing CQP in Python
  11. Tuesday 2/12 Part-of-Speech Tagging
  12. Thursday 2/14 Probabilistic Context-Free Grammars
  13. Tuesday 2/19 Probabilistic Parsing 1. CKY Algorithm.
  14. Thursday 2/21 Probabilistic Parsing 2. Lexical dependency parsers.
  15. Tuesday 2/26 Word sense disambiguation
  16. Thursday 2/28 Clustering Review
  17. Tuesday 3/5 Computational Lexicography
  18. Thursday 3/7 Statistical Machine Translation.
  19. Tuesday 3/12 Information Retrieval
  20. Thursday 3/14 Information Extraction
  21. Thursday 3/21 -- No class, final assigment due.

Homework Assignments:

You may do programming and data preparation assignments any way that you want, but I recommend use of nltk. Which I will show in the first lecture.


The readings are designed to be useful after the corresponding lectures. You can read them before, obviously, but it's not recommended.


We will be using Steven Bird's NLTK, which is by far the quickest way of getting going on realistic natural language processing tasks.

One of the great things about Python is that it comes with very good tools and documentation. There is a choice of nice introductory material at the website. My taste is for Richard Baldwin's non-programmer intro, and Guido van Rossum and Fred Drake's programmer intro.

What I showed Emacs doing today was done with Emacs Python mode. If you need to set up your own machine the same way that our Suns are, or to work on Windows or Mac, the information that you need to do this is a few clicks away. Python is cross-platform and popular, but by no means the only show in town.

Additional software for the course is being collected in /opt/compling. The easiest way to use this is to add the line

to the .subscriptions file in your Unix home directory. NB. If it is the last line, you will need to make sure that there is a newline at the end of the file, or it will silently fail. This adds the necessary paths to your environment variables. If I've done this right, and you do it right, you should, when you log in again, have access to the CMU tools, CQP and several other useful things.

Important Links

Last modified: Fri Mar 3 12:15:57 EST 2000