Data
Intensive Computational Linguistics
Ling
684.02
Spring 2007
TTh 9:30-11:18
Location: 291 Journalism
|
Name |
: |
Chris Brew, Associate Professor |
|
|
: |
|
|
Office |
: |
Oxley 200 |
|
Phone |
: |
292-5420 |
|
Web Site |
: |
|
|
Office Hours |
: |
TWR 4-5 by appointment |
This course has two main aims: familiarity with tools and techniques for handling text corpora, and a secure grasp of the fundamentals of statistical natural language processing. It is designed primarily for those who might wish to become specialists, but also for other linguists who wish to understand what is involved in using corpora.
The best available textbook is Foundations of Statistical Natural Language Processing, by Christopher Manning and Hinrich Schütze.
This is a big book, and could be intimidating. I find that I use it mainly as a reference.
Why do statistical linguistics at all? (1 lecture)
Counting and probability (1 lecture) Notes
Programming (1 lecture)
Information theory (1 lecture)
Collocations (2 lectures)
Part-of-speech tagging (2 lectures)
Probabilistic parsing (2 lectures)
PP-attachment (1 lecture)
Word sense disambiguation (1 lecture)
Statistical Machine Translation (2 lectures)
Unix tools (2 laboratories)
Keyword in context (1 laboratory)
Text encoding (1 lecture)
Linguistic annotation (2 laboratories)
Dealing with huge corpora (1 lecture)
Assessment will be by means of weekly assignments.
Each assignment contributes 12.5% of the grade, so 8 high-quality
submissions will suffice.
Some assignments (such as problem sets)
will have hard and fast answers. Others (corpus search and writing)
will be graded more subjectively.
There are no incompletes in this
course. But recall that only 8 assignments are required.
An assignment is considered late if it is not submitted at the beginning of the class period on the day it is due. Late assignments that are submitted within one week of the due date will be slightly penalized. Assignments submitted after the 1 week deadline will be more heavily penalized.
All class members are responsible for
Keeping up with the assignments and reading
Spending time at the computer working on your skills.
Monitoring your own progress and understanding of the lecture material. If there is something you don't understand, please do ask, preferably in class.
Contributing to class discussion.
Helping to form a ``course community''. This includes responding appropriately and helpfully to other class members. Discussion of homework and class material is expected and encouraged.
Ohio State is committed to extending access and opportunity to those who are disabled. Any student who feels s/he may need an accommodation based on the impact of a disability should contact me privately to discuss your specific needs. You may also contact the Office for Disability Services at 614-292-3307 in room 150 Pomerene Hall.
A direct link to the notes directory is here.
The following course calendar is a tentative schedule of when various topics will be covered. Please understand that this schedule is intended to give an overview of the quarter but will probably not be useful in knowing what material will be covered on a given day.
|
|
|
|
|
|
|
|
||
|
Date |
Topic / Activity |
|
Date |
Topic / Activity |
||||
|
Tue |
28 Mar |
Syllabus / Intro |
|
Tue |
2 May |
Statistical parsing 2 |
||
|
Thu |
30 Mar |
Text Tools |
|
Thu |
4 May |
Lexicalized parsing |
||
|
|
|
|
|
|
|
|
||
|
Tue |
4 Apr |
What and how to count |
|
Tue |
9 May |
Word senses |
||
|
Thu |
6 Apr |
Collocations 1 |
|
Thu |
11 May |
Prepositional Phrase Attachment |
||
|
|
|
|
|
|
|
|
||
|
Tue |
11 Apr |
Collocations 2 |
|
Tue |
16 May |
Text classification |
||
|
Thu |
13 Apr |
Information theory |
|
Thu |
18 May |
Text encoding |
||
|
|
|
|
|
|
|
|
||
|
Tue |
18 Apr |
Spelling Correction |
|
Tue |
23 May |
guest lecture |
||
|
Thu |
20 Apr |
POS-Tagging |
|
Thu |
25 May |
guest lecture |
||
|
|
|
|
|
|
|
|
||
|
Tue |
25 Apr |
HMMs |
|
Tue |
30 May |
Statistical MT 2 |
||
|
Thu |
27 Apr |
Statistical parsing I |
|
Thu |
1 June |
Wrap up |
||
|
|
|
|
|
|
|
|
||
Week 1:
M&S, Chapter 1; Abney 1996 paper. http://citeseer.ist.psu.edu/abney96statistical.html
Week 2: M&S, Chapter 2; M&S, Chapter 4
Week 3: M&S Chapter 5
Week 4: M&S Chapter 6
Week 5: M&S,Chapter 7;M&S Chapter 8
Week 6: M&S Chapter 9;M&S,Chapter 10
Week 7: M&S Chapter 11;M&S,Chapter 12
Week 8: M&S,Chapter 14
Week 9: M&S Chapter 13
Week 10: M&S Chapter 15
Preview provided for anyone who wants to get a jump on the assignments. Instructions and links are probably mostly wrong, but will become right over next day or two
Random words from wordlist (Due April 4)assignments/week1
Due April 18 M&S exercises 2.1 through 2.5
Due Thursday April 20 Corpus
search assignment: see the instructions at
http://ling.ohio-state.edu/~cbrew/2000/795M/cqp.html.
See following documentation for the tagset used description
of tagset and tagging guidelines.
Due April 25 :
This question is about ``identical'' twins. It isn't always
possible to tell by inspection whether twins are monozygotic or
dizygotic (Well actually, you could do a gene sequence test, but
suppose that you couldn't). But monozygotic twins are always of the
same sex, while dizygotic twins can be of different sexes. You can
observe the distribution of the sexes in twins:
P(BB)=P(GG) and
P(GB) = 1 - P(BB) -P(GG) = 1 - 2 P(GG)
Your task is to find
P(Monozygotic) in terms of P(GG). You'll need to make a few
reasonable assumptions about the relationships between different
probabilities. If you're really stuck, ask a friend doing the
course. If you're all stuck, I'd be surprised. (borrowed from
``Bayesian Statistics'' by Peter M. Lee). There's an even better
version of this question, involving Elvis's stillborn twin, but I
couldn't find the details.
Due May 2 if you can! Links
updated May 1
M&S exercise 6.6
-- the CMU software you'll need is available in
~/cbrew/bin.
You'll need
the ,
companion website for chapter 6 especially the recipes.
And,
for completeness, the documentation for
the CMU toolkit. Optional assignment,
is M&S 6.9. If you want to look at 6.9, but don't wish to write
a program, work in group with someone who does.
Due Tuesday May 9th
M&S Exercise 11.1 (demanding, I suggest team working for
this) and 11.2 ( easy, do it on your own).
For 11.1: you'll need to find a way to process the treebank
data in /home/corpora/EN/penn_treebank_3 structure is
quite simple.
The trees are plain(ish) text, like this:
( (S
(PP-LOC (IN In)
(NP (DT a) (JJ cross-border) (NN transaction) ))
(, ,)
(NP-SBJ (DT the) (NN buyer) )
(VP (VBZ is)
(PP-LOC-PRD (IN in)
(NP
(NP (DT a) (JJ different) (NN region) )
(PP (IN of)
(NP (DT the) (NN globe) ))
(PP (IN from)
(NP (DT the) (NN target) )))))
(. .) ))In the worst case you can just look at them, but
But whatever you do, you'll need to invent a way of getting subtrees into a form which allows them to be counted.
The following is the final assignment for the statistical NLP course. It is the most important piece of work you will do for this course.
You are called as a neuropsychological expert witness in a court case. An unsigned typewritten confession has been found at the scene of the crime. You have examined the confession and concluded that it was written by someone suffering from a minor linguistic impairment called Herzog's aphasia The name is imaginary: you shouldn't look it up!. About 1% of the population suffer from this syndrome. Medical records reveal that the defendant has the disease.
The prosecution argue that ``There is a 1% chance that the defendant would have the disease if he was innocent, so there is a 99% chance that he committed the crime''. On the other hand the defence argue that ``There are 100,000 people in this town, so about 1,000 of them have the disease, so there is only a 1 in 1000 chance that the defendant is guilty.''
Both prosecution and defence are (perhaps deliberately) making mistakes about conditional probabilities. Briefly explain to the judge why neither defence nor prosecution should be trusted in this matter.
(25% of total mark)
Using the knowledge of statistical NLP which you have acquired in this course, prepare a one page research proposal for a project addressing a linguistic problem of your choice. The length limit is to be taken seriously, but you will not be penalised (or rewarded) for violations. Your proposal should be as specific as you can make it - precise enough that is clear what you propose to do and why, and the connection to material covered in the course should be made clear.
If you want to, and you have space, you can include a less precisely specified idea as further work, but you should concentrate primarily on one well grounded proposal.
(75% of total mark)
Statistical NLP builds on ideas from many fields.
For most linguists the least familiar concepts will be those from
probability theory, information theory, programming and computer
science. I'll give one-lecture introductions to all of these, but
obviously this will only skim the surface. So some private study
will be needed.
Here are some resources to use in your own
time. I'll add to these as I get a better sense of where this class
is
Programming and computer science:
Getting started with Emacs (will be distributed)
Non-Programmers Tutorial For Python:
How to think like a computer scientist:
Statistics and probability: various textbooks: DeGroot, Hinton ``Statistics explained''
Information theory: various textbooks: Cover and Thomas ``Elements of Information Theory'', Jelinek: Statistical Methods for Speech Recognition
Unix tools: Ken Church ``Unix for Poets'', chapter draft by Moens and Brew (will be distributed)