Data Intensive Computational Linguistics
Ling 684.02
Spring 2007
TTh 9:30-11:18
Location: 291 Journalism



Instructor Information

Name

:

Chris Brew, Associate Professor

E-Mail

:

cbrew at ling.ohio-state.edu

Office

:

Oxley 200

Phone

:

292-5420

Web Site

:

http://www.purl.org/NET/cbrew.html

Office Hours

:

TWR 4-5 by appointment



Catalog Description

This course has two main aims: familiarity with tools and techniques for handling text corpora, and a secure grasp of the fundamentals of statistical natural language processing. It is designed primarily for those who might wish to become specialists, but also for other linguists who wish to understand what is involved in using corpora.

Textbooks

The best available textbook is Foundations of Statistical Natural Language Processing, by Christopher Manning and Hinrich Schütze.

This is a big book, and could be intimidating. I find that I use it mainly as a reference.

Topics

Orientation

Key algorithms

Corpora and tools

Assessment

Assessment will be by means of weekly assignments. Each assignment contributes 12.5% of the grade, so 8 high-quality submissions will suffice.
Some assignments (such as problem sets) will have hard and fast answers. Others (corpus search and writing) will be graded more subjectively.
There are no incompletes in this course. But recall that only 8 assignments are required.

Late Penalties

An assignment is considered late if it is not submitted at the beginning of the class period on the day it is due. Late assignments that are submitted within one week of the due date will be slightly penalized. Assignments submitted after the 1 week deadline will be more heavily penalized.

Your responsibilities

All class members are responsible for

Students with Disabilities

Ohio State is committed to extending access and opportunity to those who are disabled. Any student who feels s/he may need an accommodation based on the impact of a disability should contact me privately to discuss your specific needs. You may also contact the Office for Disability Services at 614-292-3307 in room 150 Pomerene Hall.

Notes

A direct link to the notes directory is here.

Class schedule

The following course calendar is a tentative schedule of when various topics will be covered. Please understand that this schedule is intended to give an overview of the quarter but will probably not be useful in knowing what material will be covered on a given day.

 

 

 

 

 

 

 

Date

Topic / Activity

 

Date

Topic / Activity

Tue

28 Mar

Syllabus / Intro

 

Tue

2 May

Statistical parsing 2

Thu

30 Mar

Text Tools

 

Thu

4 May

Lexicalized parsing

 

 

 

 

 

 

 

Tue

4 Apr

What and how to count

 

Tue

9 May

Word senses

Thu

6 Apr

Collocations 1

 

Thu

11 May

Prepositional Phrase Attachment

 

 

 

 

 

 

 

Tue

11 Apr

Collocations 2

 

Tue

16 May

Text classification

Thu

13 Apr

Information theory

 

Thu

18 May

Text encoding

 

 

 

 

 

 

 

Tue

18 Apr

Spelling Correction

 

Tue

23 May

guest lecture

Thu

20 Apr

POS-Tagging

 

Thu

25 May

guest lecture

 

 

 

 

 

 

 

Tue

25 Apr

HMMs

 

Tue

30 May

Statistical MT 2

Thu

27 Apr

Statistical parsing I

 

Thu

1 June

Wrap up

 

 

 

 

 

 

 



Readings (provisional)

Assignments

Preview provided for anyone who wants to get a jump on the assignments. Instructions and links are probably mostly wrong, but will become right over next day or two

  1. Random words from wordlist (Due April 4)assignments/week1

  2. Due April 18 M&S exercises 2.1 through 2.5

  3. Due Thursday April 20 Corpus search assignment: see the instructions at http://ling.ohio-state.edu/~cbrew/2000/795M/cqp.html. See following documentation for the tagset used description of tagset and tagging guidelines.

  4. Due April 25 :

One short but hard question

    This question is about ``identical'' twins. It isn't always possible to tell by inspection whether twins are monozygotic or dizygotic (Well actually, you could do a gene sequence test, but suppose that you couldn't). But monozygotic twins are always of the same sex, while dizygotic twins can be of different sexes. You can observe the distribution of the sexes in twins:
    P(BB)=P(GG) and P(GB) = 1 - P(BB) -P(GG) = 1 - 2 P(GG)
    Your task is to find P(Monozygotic) in terms of P(GG). You'll need to make a few reasonable assumptions about the relationships between different probabilities. If you're really stuck, ask a friend doing the course. If you're all stuck, I'd be surprised. (borrowed from ``Bayesian Statistics'' by Peter M. Lee). There's an even better version of this question, involving Elvis's stillborn twin, but I couldn't find the details.

  1. Due May 2 if you can! Links updated May 1
    M&S exercise 6.6 -- the CMU software you'll need is available in ~/cbrew/bin.
    You'll need the , companion website for chapter 6 especially the recipes.
    And, for completeness, the documentation for the CMU toolkit. Optional assignment, is M&S 6.9. If you want to look at 6.9, but don't wish to write a program, work in group with someone who does.

  2. Due Tuesday May 9th
    M&S Exercise 11.1 (demanding, I suggest team working for this) and 11.2 ( easy, do it on your own).

  3. The following is the final assignment for the statistical NLP course. It is the most important piece of work you will do for this course.


    You are called as a neuropsychological expert witness in a court case. An unsigned typewritten confession has been found at the scene of the crime. You have examined the confession and concluded that it was written by someone suffering from a minor linguistic impairment called Herzog's aphasia The name is imaginary: you shouldn't look it up!. About 1% of the population suffer from this syndrome. Medical records reveal that the defendant has the disease.

Resources for private study

    Statistical NLP builds on ideas from many fields. For most linguists the least familiar concepts will be those from probability theory, information theory, programming and computer science. I'll give one-lecture introductions to all of these, but obviously this will only skim the surface. So some private study will be needed.

    Here are some resources to use in your own time. I'll add to these as I get a better sense of where this class is