Data Intensive Computational Linguistics
Ling 684.02
Spring 2007
TTh 9:30-11:18
Location: 291 Journalism



Instructor Information

Name

:

Chris Brew, Associate Professor

E-Mail

:

cbrew at ling.ohio-state.edu

Office

:

Oxley 200

Phone

:

292-5420

Web Site

:

http://www.purl.org/NET/cbrew.html

Office Hours

:

TWR 4-5 or by appointment



Catalog Description

This course has two main aims: familiarity with tools and techniques for handling text corpora, and a secure grasp of the fundamentals of statistical natural language processing. It is designed primarily for those who might wish to become specialists, but also for other linguists who wish to understand what is involved in using corpora.

Textbooks

The best available textbook is Foundations of Statistical Natural Language Processing, by Christopher Manning and Hinrich Schütze.

This is a big book, and could be intimidating. I find that I use it mainly as a reference.

Topics

Orientation

Key algorithms

Corpora and tools

Assessment

Assessment will be by primarily means of weekly assignments, with a mid-term (on May 17th) and a final project,
Some assignments (such as problem sets) will have hard and fast answers. Others (corpus search and writing) will be graded more subjectively.
There are no incompletes in this course.

Late Penalties

An assignment is considered late if it is not submitted at the beginning of the class period on the day it is due. Late assignments that are submitted within one week of the due date will be slightly penalized. Assignments submitted after the 1 week deadline will be more heavily penalized.

Your responsibilities

All class members are responsible for

Students with Disabilities

Ohio State is committed to extending access and opportunity to those who are disabled. Any student who feels s/he may need an accommodation based on the impact of a disability should contact me privately to discuss your specific needs. You may also contact the Office for Disability Services at 614-292-3307 in room 150 Pomerene Hall.

Notes

A direct link to the notes directory is here.

Class schedule

The following course calendar is a tentative schedule of when various topics will be covered. Please understand that this schedule is intended to give an overview of the quarter but will probably not be useful in knowing what material will be covered on a given day.

 

 

 

 

 

 

 

Date

Topic / Activity

 

Date

Topic / Activity

Tue

27 Mar

Syllabus / Intro

 

Tue

1 May

Statistical parsing 2

Thu

29 Mar

Text Tools

 

Thu

3 May

Lexicalized parsing

 

 

 

 

 

 

 

Tue

3 Apr

What and how to count

 

Tue

8 May

Word senses

Thu

5 Apr

Collocations 1

 

Thu

10 May

Prepositional Phrase Attachment

 

 

 

 

 

 

 

Tue

10 Apr

Collocations 2

 

Tue

15 May

Text classification

Thu

12 Apr

Information theory

 

Thu

17 May

Mid-term

 

 

 

 

 

 

 

Tue

17 Apr

Spelling Correction

 

Tue

22 May

tbd

Thu

19 Apr

POS-Tagging

 

Thu

24 May

tbd

 

 

 

 

 

 

 

Tue

24 Apr

HMMs

 

Tue

29 May

Statistical MT

Thu

26 Apr

Statistical parsing I

 

Thu

31 May

Wrap up

 

 

 

 

 

 

 



Readings (provisional)

Assignments

Graded assignments

This list will grow

  1. Write a program that generates 1000 random characters. Various extensions to be discussed in class, including making the 1000 chars user-adjustable [To be done collaboratively in groups of 2] Due 29 Mar 2007.

  2. Write two programs that together constitute a language identification system.

  3. The second program takes as input one or more text files and two or more of the models produced by the first program. It then decides which of the models is the best match for each of the text files.

  4. Write up your results. Test the program to ensure that it works according to specification, and give a reasoned argument for how well it is performing. You will need to give thought to how to design a training and testing regime to support this conclusion.

  5. For extra kudos, include in the writeup a commentary on the design constraint which I imposed on you by insisting that each language be trained separately. Why does this matter? What extra opportunities for improved performance would you have if you relaxed this constraint? For extra extra kudos, try this out.

    Due in class April 17

  6. M&S exercises 2.1 - 2.6 and 2.9 - 2.10 Due in class Thursday April 26

  7. HMMs see description(Due Tues May 15)

Unix tools club

Ungraded but important assignments. One per week. Expect to spend a couple of hours nosing around and getting used to how things work, unless you already are familiar

  1. Editors

  2. Counting words and bigrams with tr,paste, sort and wc

  3. Regular expressions with grep and egrep

  4. Managing data with awk

  5. Using split for corpus processing

  6. How to organize a corpus using the Unix filesystem

  7. Command-line statistics tools

  8. Python

  9. R. An environment for statistics.

  10. Searching huge files with mmap

Reference: Ken Church ``Unix for Poets''.

Resources for private study

    Statistical NLP builds on ideas from many fields. For most linguists the least familiar concepts will be those from probability theory, information theory, programming and computer science. I'll give one-lecture introductions to all of these, but obviously this will only skim the surface. So some private study will be needed.

    Here are some resources to use in your own time. I'll add to these as I get a better sense of where this class is

    1. Programming and computer science:

    2. Statistics and probability: various textbooks: DeGroot, Hinton ``Statistics explained''

    3. Information theory: various textbooks: Cover and Thomas ``Elements of Information Theory'', Jelinek: Statistical Methods for Speech Recognition

    4. Unix tools: chapter draft by Moens and Brew