Data
Intensive Computational Linguistics
Ling
684.02
Spring 2007
TTh 9:30-11:18
Location: 291 Journalism
|
Name |
: |
Chris Brew, Associate Professor |
|
|
: |
|
|
Office |
: |
Oxley 200 |
|
Phone |
: |
292-5420 |
|
Web Site |
: |
|
|
Office Hours |
: |
TWR 4-5 or by appointment |
This course has two main aims: familiarity with tools and techniques for handling text corpora, and a secure grasp of the fundamentals of statistical natural language processing. It is designed primarily for those who might wish to become specialists, but also for other linguists who wish to understand what is involved in using corpora.
The best available textbook is Foundations of Statistical Natural Language Processing, by Christopher Manning and Hinrich Schütze.
This is a big book, and could be intimidating. I find that I use it mainly as a reference.
Why do statistical linguistics at all? (1 lecture)
Counting and probability (1 lecture) Notes
Programming (1 lecture)
Information theory (1 lecture)
Collocations (2 lectures)
Part-of-speech tagging (2 lectures)
Probabilistic parsing (2 lectures)
PP-attachment (1 lecture)
Word sense disambiguation (1 lecture)
Statistical Machine Translation (2 lectures)
Unix tools (2 laboratories)
Keyword in context (1 laboratory)
Text encoding (1 lecture)
Linguistic annotation (2 laboratories)
Dealing with huge corpora (1 lecture)
Assessment will be by primarily means of weekly
assignments, with a mid-term (on May 17th) and a final
project,
Some assignments (such as problem sets) will have hard
and fast answers. Others (corpus search and writing) will be graded
more subjectively.
There are no incompletes in this course.
An assignment is considered late if it is not submitted at the beginning of the class period on the day it is due. Late assignments that are submitted within one week of the due date will be slightly penalized. Assignments submitted after the 1 week deadline will be more heavily penalized.
All class members are responsible for
Keeping up with the assignments and reading
Spending time at the computer working on your skills.
Monitoring your own progress and understanding of the lecture material. If there is something you don't understand, please do ask, preferably in class.
Contributing to class discussion.
Helping to form a ``course community''. This includes responding appropriately and helpfully to other class members. Discussion of homework and class material is expected and encouraged.
Ohio State is committed to extending access and opportunity to those who are disabled. Any student who feels s/he may need an accommodation based on the impact of a disability should contact me privately to discuss your specific needs. You may also contact the Office for Disability Services at 614-292-3307 in room 150 Pomerene Hall.
A direct link to the notes directory is here.
The following course calendar is a tentative schedule of when various topics will be covered. Please understand that this schedule is intended to give an overview of the quarter but will probably not be useful in knowing what material will be covered on a given day.
|
|
|
|
|
|
|
|
||
|
Date |
Topic / Activity |
|
Date |
Topic / Activity |
||||
|
Tue |
27 Mar |
Syllabus / Intro |
|
Tue |
1 May |
Statistical parsing 2 |
||
|
Thu |
29 Mar |
Text Tools |
|
Thu |
3 May |
Lexicalized parsing |
||
|
|
|
|
|
|
|
|
||
|
Tue |
3 Apr |
What and how to count |
|
Tue |
8 May |
Word senses |
||
|
Thu |
5 Apr |
Collocations 1 |
|
Thu |
10 May |
Prepositional Phrase Attachment |
||
|
|
|
|
|
|
|
|
||
|
Tue |
10 Apr |
Collocations 2 |
|
Tue |
15 May |
Text classification |
||
|
Thu |
12 Apr |
Information theory |
|
Thu |
17 May |
Mid-term |
||
|
|
|
|
|
|
|
|
||
|
Tue |
17 Apr |
Spelling Correction |
|
Tue |
22 May |
tbd |
||
|
Thu |
19 Apr |
POS-Tagging |
|
Thu |
24 May |
tbd |
||
|
|
|
|
|
|
|
|
||
|
Tue |
24 Apr |
HMMs |
|
Tue |
29 May |
Statistical MT |
||
|
Thu |
26 Apr |
Statistical parsing I |
|
Thu |
31 May |
Wrap up |
||
|
|
|
|
|
|
|
|
||
Week 1:
M&S, Chapter 1; Abney 1996 paper. http://citeseer.ist.psu.edu/abney96statistical.html
Week 2: M&S, Chapter 2; M&S, Chapter 4
Week 3: M&S Chapter 5
Week 4: M&S Chapter 6
Week 5: M&S,Chapter 7;M&S Chapter 8
Week 6: M&S Chapter 9;M&S,Chapter 10
Week 7: M&S Chapter 11;M&S,Chapter 12
Week 8: M&S,Chapter 14
Week 9: M&S Chapter 13
Week 10: M&S Chapter 15
This list will grow
Write a program that generates 1000 random characters. Various extensions to be discussed in class, including making the 1000 chars user-adjustable [To be done collaboratively in groups of 2] Due 29 Mar 2007.
Write two programs that together constitute a language identification system.
The first program reads from one or more text files, assumed to be somehow the same (e.g. from the same language) and trains some kind of model that is able to assign a score to the text file.
The second program takes as input one or more text files and two or more of the models produced by the first program. It then decides which of the models is the best match for each of the text files.
Write up your results. Test the program to ensure that it works according to specification, and give a reasoned argument for how well it is performing. You will need to give thought to how to design a training and testing regime to support this conclusion.
For extra kudos, include in the writeup a commentary on the design constraint which I imposed on you by insisting that each language be trained separately. Why does this matter? What extra opportunities for improved performance would you have if you relaxed this constraint? For extra extra kudos, try this out.
Due in class April 17
M&S exercises 2.1 - 2.6 and 2.9 - 2.10 Due in class Thursday April 26
HMMs see description(Due Tues May 15)
Ungraded but important assignments. One per week. Expect to spend a couple of hours nosing around and getting used to how things work, unless you already are familiar
Counting words and bigrams with tr,paste, sort and wc
Regular expressions with grep and egrep
Managing data with awk
Using split for corpus processing
How to organize a corpus using the Unix filesystem
Command-line statistics tools
Python
R. An environment for statistics.
Searching huge files with mmap
Reference: Ken Church ``Unix for Poets''.
Statistical NLP builds on ideas from many fields.
For most linguists the least familiar concepts will be those from
probability theory, information theory, programming and computer
science. I'll give one-lecture introductions to all of these, but
obviously this will only skim the surface. So some private study
will be needed.
Here are some resources to use in your own
time. I'll add to these as I get a better sense of where this class
is
Programming and computer science:
Non-Programmers Tutorial For Python:
How to think like a computer scientist:
Statistics and probability: various textbooks: DeGroot, Hinton ``Statistics explained''
Information theory: various textbooks: Cover and Thomas ``Elements of Information Theory'', Jelinek: Statistical Methods for Speech Recognition
Unix tools: chapter draft by Moens and Brew