This course has three main aims: familiarity with tools and techniques for handling text corpora, knowledge of the characteristics of some of the available corpora, and a secure grasp of the fundamentals of statistical natural language processing. Specific objectives include:

- understanding of probability and information theory as they have been applied to computational linguistics.
- knowledge of fundamental techniques of probabilistic language modelling.
- experience of working with corpora.
- knowledge of some applications of statistical NLP.

http://www.ling.ohio-state.edu/~cbrew/2002/684.02/winter/syllabus.{ps|pdf}in Postscript or PDF

Some course notes, mainly by me, including material on Unix tools, are available at Edinburgh.

Tuesday and Thursday 12:30-2:18pm

228 Cockins Hall

Foundations of Statistical Natural Language Processing,

by Christopher Manning and Hinrich Schütze

Assessment will be by means of weekly assignments. Each assignment contributes 12.5\% of the grade, so 8 high-quality submissions will suffice. Assignments will be set on Tuesdays and will be due the following Tuesday.

There are no extensions and no incompletes in this course. All deadlines are hard-and-fast. But recall that only 8 assignments are required.

If you wish, you may substitute a mini-project for the final three assignments. In that case you must do the following

- Let me know that you are going this route. At this stage you should have some idea of an inspiring topic. If not, it is probably a better option to do the regular assignments. You always have the option of doing more of the regular assigments instead. (Deadline 2/19)
- Submit an initial proposal This is a proposal for the research that you aim to conduct over the next two weeks. At this stage your job is to convince me that your proposed project is of an appropriate size and difficulty. It has to be hard enough to be worth three assignments, but small enough to be achievable in the time available. I will work with you to design a suitable project. (Deadline 2/26 -- counts as one assignment).
- Submit an approved version of the project proposal (Deadline: 3/3 -- counts as half an assignment). Might be no work whatsoever if first version was approved.
- Submit a final report on the project, with an appendix including any code that you wrote (Deadline 3/21 -- counts as one and one half assignments)

- Tuesday 1/8 Introduction.
Background knowledge survey.
To be made up at time tba. - Thursday 1/10 Statistical approaches to NLP Slides (HTML for Explorer only -- sorry)
- Tuesday 1/15 Probability review.
- Thursday 1/17 Information theory review.
- Tuesday 1/22 Language Identification.
- Thursday 1/24 Smoothing techniques slides
- Tuesday 1/29 More information theory slides
- Thursday 1/31 Collocations slides
- Tuesday 2/5 HMM decoding and training. slides, Explorer only
- Thursday 2/7 Implementing CQP in Python
- Tuesday 2/12 Part-of-Speech Tagging
- Thursday 2/14 Probabilistic Context-Free Grammars
- Tuesday 2/19 Probabilistic Parsing 1. CKY Algorithm.
- Thursday 2/21 Probabilistic Parsing 2. Lexical dependency parsers.
- Tuesday 2/26 Word sense disambiguation
- Thursday 2/28 Clustering Review
- Tuesday 3/5 Computational Lexicography
- Thursday 3/7 Statistical Machine Translation.
- Tuesday 3/12 Information Retrieval
- Thursday 3/14 Information Extraction
- Thursday 3/21 -- No class, final assigment due.

- Due start of week 2:
- Work through NLTK Tutorial 1 .
- M&S exercise 1.4. (Hint: There are two problems here: making a random corpus and creating the appropriate table, similar to that on p24 of the book. It's easier if you tackle the second part first. Build a table of words and frequencies for a regular corpus, then use the result of that to calculate the ranks, then calculate f.r. You'll be able to re-use a lot of stuff from the tutorial. Then replace the string that you made by reading the corpus with the one that the problem calls for./LI>

- Due 24 Jan 2002. 5:00 PM Language identification. See description in Postscript or PDF.
- Due 31 Jan 2002. 5:00 PM (Short but hard)
This question is about ``identical'' twins. It isn't always possible to tell by inspection whether twins are monozygotic or dizygotic (Well actually, you could do a gene sequence test, but suppose that you couldn't). But monozygotic twins are always of the same sex, while dizygotic twins can be of different sexes. You can observe the distribution of the sexes in twins:

P(BB)=P(GG) and P(GB) = 1 - P(BB) -P(GG) = 1 - 2 P(GG)

Your task is to find P(Monozygotic) in terms of P(GG). You'll need to make a few reasonable assumptions in order to get an answer. If you're really stuck, ask a friend doing the course. If you're all stuck, I'd be surprised. Let me know that you are stuck and I'll give a clue in class. (borrowed from ``Bayesian Statistics'' by Peter M. Lee). There's an even better version of this question, involving Elvis's stillborn twin, but I couldn't find the details. - Corpus search assignment: Due 8 Feb 2002

see the instructions at`http://www.ling.ohio-state.edu/~cbrew/2002/winter/684.02/cqp.html`

. See following documentation for the tagset used description of tagset and tagging guidelines. - Report on part-of-speech tagging:see instructions .
- Assigment 9: short questions PDF HTML. The formulae in the HTML come out mangled for me using Netscape, with a big Swedish a where I wanted a big sigma, but look OK under Explorer. Due: 20th March 2002

- Week 1: M&S, Chapter 1; Abney 1996 paper
- Week 2: M&S, Chapter 2; M&S, Chapter 4
- Week 3: M&S Chapter 5
- Week 4: M&S Chapter 6
- Week 5: M&S,Chapter 7;M&S Chapter 8
- Week 6: M&S Chapter 9;M&S,Chapter 10
- Week 7: M&S Chapter 11;M&S,Chapter 12
- Week 8: M&S,Chapter 14
- Week 9: M&S Chapter 13
- Week 10: M&S Chapter 15

We will be using Steven Bird's NLTK, which is by far the quickest way of getting going on realistic natural language processing tasks.

One of the great things about Python is that it comes with very good tools and documentation. There is a choice of nice introductory material at the python.org website. My taste is for Richard Baldwin's non-programmer intro, and Guido van Rossum and Fred Drake's programmer intro.

What I showed Emacs doing today was done with Emacs Python mode. If you need to set up your own machine the same way that our Suns are, or to work on Windows or Mac, the information that you need to do this is a few clicks away. Python is cross-platform and popular, but by no means the only show in town.

Additional software for the course is being collected in `/opt/compling`

.
The easiest way to use this is to add the line

COMPLINGto the

`.subscriptions`

file in your Unix home directory. - Chris' course outline from a CogSci course in the past (this is _not_ the syllabus for Ling. 684.02, but may give you an idea of what to expect)
- Chris' Data-Intensive Linguistics page
- Shravan showed me a very short and simple primer for Information Theory. It looks pretty good to me.
- Lucent distribute Shannon's original paper on information theory. If you want, you can go to the original source (and its fairly readable).
- I added a description of the algorithm for calculating edit distance. This is similar to but simpler than the Viterbi algorithm for calculating the best path through an HMM.
- Impending:A description of the CKY algorithm

Last modified: Fri Mar 3 12:15:57 EST 2000