This is a page for Chris Brew's Ling. 684.02 Class,
WI 2002 --with a few links, assignments, due dates, etc.
This course has three main aims: familiarity with tools and
techniques for handling text corpora, knowledge of the characteristics
of some of the available corpora, and a secure grasp of the fundamentals
of statistical natural language processing.
Specific objectives include:
- understanding of probability and information theory as they have
been applied to computational linguistics.
- knowledge of fundamental techniques of probabilistic language
modelling.
- experience of working with corpora.
- knowledge of some applications of statistical NLP.
The formal syllabus is available at
http://www.ling.ohio-state.edu/~cbrew/2002/684.02/winter/syllabus.{ps|pdf}
in
Postscript
or
PDF
Some course notes, mainly by me, including material on Unix tools, are
available at
Edinburgh.
Class Time and Location:
Tuesday and Thursday 12:30-2:18pm
228 Cockins Hall
Class mailing list: snlp
to subscribe/unsubcribe, email cbrew@ling.ohio-state.edu
Chris's office hours: 2-5:00 Wednesday (or by arrangement)
Text:
Foundations of Statistical Natural Language Processing,
by Christopher Manning and Hinrich Schütze
Assessment will be by means of weekly assignments. Each
assignment contributes 12.5\% of the grade, so 8 high-quality
submissions will suffice. Assignments will be set on Tuesdays
and will be due the following Tuesday.
There are no extensions and no incompletes in this course. All
deadlines are hard-and-fast. But recall that only 8 assignments
are required.
If you wish, you may substitute a mini-project for the
final three assignments. In that case you must do the following
- Let me know that you are going this route. At this stage you
should have some idea of an inspiring topic.
If not, it is probably a better option to do the regular
assignments. You always have the option of doing more of
the regular assigments instead.
(Deadline 2/19)
- Submit an initial proposal
This is a proposal
for the research that you aim to conduct over the next two weeks.
At this stage your job is to convince me that your proposed
project is of an appropriate size and difficulty. It has to
be hard enough to be worth three assignments, but small enough to
be achievable in the time available. I will work with you to design
a suitable project. (Deadline 2/26 -- counts as one assignment).
- Submit an approved version of the project proposal (Deadline:
3/3 -- counts as half an assignment). Might be no work whatsoever
if first version was approved.
- Submit a final report on the project, with an appendix
including any code that you wrote (Deadline 3/21 -- counts as
one and one half assignments)
Class schedule
- Tuesday 1/8 Introduction.
Background knowledge survey. To be made up at time tba.
- Thursday 1/10 Statistical approaches to NLP
Slides (HTML for Explorer only -- sorry)
- Tuesday 1/15 Probability review.
- Thursday 1/17 Information theory review.
- Tuesday 1/22 Language Identification.
- Thursday 1/24 Smoothing techniques
slides
- Tuesday 1/29 More information theory
slides
- Thursday 1/31 Collocations
slides
- Tuesday 2/5 HMM decoding and training.
slides, Explorer only
- Thursday 2/7 Implementing CQP in Python
- Tuesday 2/12 Part-of-Speech Tagging
- Thursday 2/14 Probabilistic Context-Free Grammars
- Tuesday 2/19 Probabilistic Parsing 1. CKY Algorithm.
- Thursday 2/21 Probabilistic Parsing 2. Lexical dependency parsers.
- Tuesday 2/26 Word sense disambiguation
- Thursday 2/28 Clustering
Review
- Tuesday 3/5 Computational Lexicography
- Thursday 3/7 Statistical Machine Translation.
- Tuesday 3/12 Information Retrieval
- Thursday 3/14 Information Extraction
- Thursday 3/21 -- No class, final assigment due.
Homework Assignments:
You may do programming and data preparation assignments any way that
you want, but I recommend use of nltk. Which I will show in the first
lecture.
- Due start of week 2:
- Work through
NLTK Tutorial 1
.
- M&S exercise 1.4. (Hint: There are two problems here: making
a random corpus and creating the appropriate table, similar to that
on p24 of the book. It's easier if you
tackle the second part first. Build a table of words and
frequencies for a regular corpus, then use the result of that to
calculate the ranks, then calculate f.r. You'll be able to re-use a
lot
of stuff from the tutorial. Then replace the string that you made by
reading the corpus with the one that the problem calls for./LI>
- Due 24 Jan 2002. 5:00 PM
Language identification. See description
in Postscript
or PDF.
- Due 31 Jan 2002. 5:00 PM (Short but hard)
This question is about ``identical'' twins. It isn't always
possible to tell by inspection whether twins are monozygotic or
dizygotic (Well actually, you could do a gene sequence test, but
suppose that you couldn't). But monozygotic twins are always of the
same sex, while dizygotic twins can be of different sexes. You
can observe the distribution of the sexes in twins:
P(BB)=P(GG) and P(GB) = 1 - P(BB) -P(GG) = 1 - 2 P(GG)
Your task is to find P(Monozygotic)
in terms of P(GG). You'll need to make a few reasonable
assumptions in order to get an answer.
If you're
really stuck, ask a friend doing the course. If you're all stuck,
I'd be surprised. Let me know that you are stuck and I'll give a
clue in class.
(borrowed
from ``Bayesian Statistics'' by Peter M. Lee).
There's an even
better version of this question, involving Elvis's stillborn twin,
but I couldn't find the details.
- Corpus search assignment: Due 8 Feb 2002
see
the instructions at http://www.ling.ohio-state.edu/~cbrew/2002/winter/684.02/cqp.html. See following documentation for the tagset
used
description of tagset and
tagging guidelines.
- Report on part-of-speech tagging:see instructions
.
- Assigment 9: short questions PDF
HTML. The formulae in the HTML
come out mangled for me using Netscape, with a big Swedish
a where I wanted a big sigma, but look OK under Explorer.
Due: 20th March 2002
The readings are designed to be useful after the
corresponding lectures. You can read them before, obviously,
but it's not recommended.
- Week 1: M&S, Chapter 1;
Abney 1996
paper
- Week 2: M&S, Chapter 2; M&S, Chapter 4
- Week 3: M&S Chapter 5
- Week 4: M&S Chapter 6
- Week 5: M&S,Chapter 7;M&S Chapter 8
- Week 6: M&S Chapter 9;M&S,Chapter 10
- Week 7: M&S Chapter 11;M&S,Chapter 12
- Week 8: M&S,Chapter 14
- Week 9: M&S Chapter 13
- Week 10: M&S Chapter 15
We will be using Steven Bird's NLTK, which is by far the
quickest way of getting going on
realistic natural language processing tasks.
One of the great things about Python is that it comes with very good
tools and documentation. There is a choice of nice introductory
material at the
python.org website. My taste is for Richard Baldwin's
non-programmer intro, and Guido van Rossum and Fred Drake's
programmer intro.
What I showed Emacs doing today was done with
Emacs Python
mode.
If you need to set up your own machine the same way that our Suns
are, or to work on Windows or Mac, the information that you need
to do this is a few clicks away. Python is cross-platform and
popular, but by no means the only show in town.
Additional software for the course is being collected in /opt/compling.
The easiest way to use this is to add the line
COMPLING
to the .subscriptions file in your Unix home directory. NB. If it is the
last line, you will need to make sure that there is a newline at the end of the
file, or it will silently fail. This adds the necessary paths to your
environment variables. If I've done this right, and you do it right, you should,
when you log in again, have access to the CMU tools, CQP and several other
useful things.
Last modified: Fri Mar 3 12:15:57 EST 2000