Using Corpora to Study Verbs
Ling 795V
Winter 2003
TTh 10:30-12:18
Journalism Building 295

Instructor Information

Name : Chris Brew, Assistant Professor
E-Mail : cbrew@ling.ohio-state.edu
Office : Oxley 200
Phone : 292-5420
Web Site : http://www.ling.ohio-state.edu/ cbrew
Office Hours : W 1-5 or by appointment

Catalog Description

Linguists and lexicographers agree that verbs are at the core of linguistics and language technology. Corpus-based techniques have been hugely successful in facilitating new technology, but have not yet produced as much new linguistic insight as might perhaps be expected. Because verbs are central to so many theories, and because large-scale lexical resources are difficult and expensive to create by hand, many recent studies have explored the idea of using corpora to study and classify verbs. Aside from the intrinsic interest, this seems to me an area in which linguists have a big competitive advantage over engineers, simply because an awareness of the theory really pays off. The goal of the seminar is to read, discuss and understand the most important of these studies, and to develop our ability to do cutting-edge research in this area.

Main topics covered


Verbs,frames and classes Introduction
Corpus-based studies of verbs why and how?
Shallow text processing corpus processing for verbs and frames
Subcategorization measuring the affinity between verbs and arguments
Clustering unsupervised verb clustering and classification
Applications FrameNet and PropBank

Details

Corpus-based methods have obvious potential for showing us facts about the language which did not immediately occur to us, and can complement introspection. For example, the Duden dictionary of German informs us that the verb "dämmern" ("to dawn" or "to grow dusky") can be grouped with other weather verbs like "regnen" ("to rain"). One reason for accepting this grouping is that both verbs can take an expletive subject, as in "Es regnet" (it is raining) or "Es dämmert" (either "day dawns" or "night falls"). But even though corpus evidence shows that "dämmern" also frequently occurs in contexts like the English "It dawned on him that ...", and therefore an affinity to verbs of cognition like "denken" (think), Duden does not note this fact. This is an accidental rather than a principled omission, and might remedied if the lexicographer had the proper computational support. Similarly "laufen" ("to run") is listed as a verb of motion, but there are also examples like "Die Ausstellung läuft bis November" ("The exhibition runs until November" ). No motion is involved, rather a kind of existence.

Projects like Berkeley's FrameNet project and Penn's PropBank are using large-scale corpus processing to build up comprehensive descriptions of the way in which words are used in context. Just as the Penn Treebank facilitated the creation and evaluation of robust statistical parsers (such as the systems built Collins, Charniak and others), so the hope is that the new databases will move statistical NLP closer to the goal of efficiently assigning predicate-argument structure to naturally occurring sentences.

As background, we'll be reading a representative sample of the computational literature on verb-frames (a reading list is given below). Then we'll be using it as a basis for our own explorations.

Prerequisites

Ling 684.02 or permission of instructor
Some exposure to programming.

Grading

Your grade in the course will be assessed as follows:
paper   45%
presentation   20%
reviewing   15%
programs   10%
participation   10%
       
A   85 -- 100
A-   75 -- 84
B+   65 -- 74
B   60 -- 64
B-   40 -- 59
C   30 -- 39
F   0 -- 29

Programming assignments

There will be regular programming assignments throughout the course, including the two that will be formally assessed, which are listed in the course calendar. In general it is acceptable to discuss assignments with anyone you want, and to submit joint work, provided this is clearly marked as such. Joint work will be held to a higher standard than individual work. All participants in piece of a joint work will normally receive the same grade. If circumstances (such as non-participation) arise where a group is clearly not working, please let me know. I reserve the right to adjust grades as needed.

Writing workshop

Practice in scientific writing and reviewing, under time pressure, is an essential part of your training. This seminar therefore includes practice in paper-writing and reviewing. In essence, what we will be doing is running a small internal workshop, along the lines of those run at ACL and similar conferences. This will spread over the final weeks of the course. I will be the program chair, you will act as reviewers, presenters and participants. Note that we will be aspiring to, but not requiring, work of a standard that could be published externally.

Each student will write a short paper (3-4 pages) on a topic selected in consultation with me. The results will be presented to the class according to the schedule specified in the course calendar. Paper topics must be negotiated with me in good time. To see why, read the next paragraph.

I will allocate a discussant to each presentation. Presenters are responsible for: Discussants are responsible for All class members are responsible for

Late Penalties

An assignment is considered late if it is not submitted at the beginning of the class period on the day it is due. Late assignments that are submitted within one week of the due date will be slightly penalized. Assignments submitted after the 1 week deadline will be more heavily penalized.

Students with Disabilities

Ohio State is committed to extending access and opportunity to those who are disabled. Any student who feels s/he may need an accommodation based on the impact of a disability should contact me privately to discuss your specific needs. You may also contact the Office for Disability Services at 614-292-3307 in room 150 Pomerene Hall.

Course Calendar

The following course calendar is a tentative schedule of when various topics will be covered. Please understand that this schedule is intended to give an overview of the quarter but will probably not be useful in knowing what material will be covered on a given day.

             
Date Topic / Activity   Date Topic / Activity
Tue 7 Jan Syllabus / Intro   Tue 11 Feb tbd.
Thu 9 Jan Early work   Thu 13 Feb tbd.
             
Tue 14 Jan Verb detection   Tue 18 Feb tbd.
Thu 16 Jan Generalizing case frames [LA98]   Thu 20 Feb Program 2 Due
             
Tue 21 Jan Grammars for Subcat extraction [BC97]   Tue 25 Feb FrameNet, PropBank
Thu 23 Jan Bayesian grammars [eisner]   Thu 27 Feb General feature spaces
             
Tue 28 Jan CCGBank [HS02a, HS02b]   Tue 04 Mar Presentations
Thu 30 Jan Huge corpora   Thu 06 Mar Presentations
             
Tue 4 Feb Program 1 Due   Tue 11 March Presentations
Thu 6 Feb Corpus variation   Thu 13 March Review and wrapup
             

Resources

I will ensure that tools and corpora are installed as necessary. Most of the reading material is on-line, some of it at the ACL Anthology which is an invaluable resource that you should all know about. I also have paper copies of much of the material.

0.1  Assignments

0.2  Online notes

Citations to make BibTeX happy

[HS02a] [HS02b] [Bre91] [Man93] [Bre] [MS01] [McC00] [Lev93] [BP96] [WM89] [DKPR98] [CR98] [DJ96] [LB99] [SB02] [Sch02] [SB02] [BPV96] [PS96] [AM96] [Bre94] [RJ98] [BC97]

Reading list

For hypertext links to the following references see. separate page.

References

[AM96]
Chinatsu Aone and Douglas McKee. Acquiring Predicate Mapping Information from Multilingual Texts, chapter 10, pages 191--204. In Boguraev and Pustejovsky [BP96], 1996.

[BC97]
Ted Briscoe and John Carroll. Automatic Extraction of Subcategorization from Corpora. In Proceedings of the 5th ACL Conference on Applied Natural Language Processing, pages 356--363, Washington, DC, 1997.

[BP96]
Branimir Boguraev and James Pustejovsky, editors. Corpus Processing for Lexical Acquisition. Bradford Books, 1996.

[BPV96]
Roberto Basili, Maria-Teresa Pazienza, and Paola Velardi. A Context Driven Conceptual Clustering Method for Verb Classification, chapter 7, pages 117--142. In Boguraev and Pustejovsky [BP96], 1996.

[Bre]
M. Brent. Surface cues and robust inference as a basis for the early acquisition of subcategorization frames.

[Bre91]
Michael R. Brent. Automatic acquisition of subcategorization frames from untagged text. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA, 1991.

[Bre94]
Michael R. Brent. Acquisition of subcategorization frames using aggregated evidence from local syntactic cues. Lingua, 92:433--470, 1994.

[Buc98]
Sabine Buchholz. Unsupervised learning of subcategorisation information and its application in a parsing subtask. In H. La Poutre and H.J. van den Herik, editors, Proceedings of the Tenth Netherlands/Belgium Conference on Artificial Intelligence (NAIC'98), pages 7--16, Amsterdam, 1998. ILK-9811.

[COL98]
Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, Montréal, Canada, 1998.

[CR98]
Glenn Carroll and Mats Rooth. Valence induction with a head-lexicalized PCFG. In Nancy Ide and Atro Voutilainen, editors, Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing, pages 36--45, Granada, Spain, 1998.

[DJ96]
Bonnie J. Dorr and Doug Jones. Role of word sense disambiguation in lexical acquisition: Predicting semantics from syntactic cues. In Proceedings of the 16th International Conference on Computational Linguistics, pages 322--327, Copenhagen, Denmark, 1996.

[DKPR98]
Hoa Trang Dang, Karin Kipper, Martha Palmer, and Joseph Rosenzweig. Investigating regular sense extensions based on intersective Levin classes. In COLING/ACL [COL98], pages 293--299.

[HS02a]
Julia Hockenmaier and Mark Steedman. Acquiring compact lexicalized grammars from a cleaner treebank. In Proceedings of Third International Conference on Language Resources and Evaluation, Las Palmas, 2002.

[HS02b]
Julia Hockenmaier and Mark Steedman. Generative models for statistical parsing with combinatory categorial grammar. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, 2002.

[LA98]
Hang Li and Naoki Abe. Generalizing case frames using a thesaurus and the mdl principle. Computational Linguistics, 24(2), 1998.

[LB99]
Maria Lapata and Chris Brew. Using subcategorization to resolve verb class ambiguity. In Pascal Fung and Joe Zhou, editors, Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora, College Park, Maryland, 1999.

[Lev93]
Beth Levin. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago, 1993.

[Man93]
Christopher D. Manning. Automatic acquisition of a large sub categorization dictionary from corpora. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, OH, 1993.

[McC00]
Diana McCarthy. Using semantic preferences to identify verbal participation in role switching alternations. In Proceedings of the 1st North American Annual Meeting of the Association for Computational Linguistics, pages 256--263, Seattle, WA, 2000.

[MS01]
Paola Merlo and Susanne Stevenson. Automatic verb classification based on statistical distribution of argument structure. Computational Linguistics, 27(3):373--408, 2001.

[PS96]
Victor Posnanski and Antonio Sanfilippo. Detecting dependencies between semantic verb subclasses and subcategorization frames in text corpora, chapter 9, pages 175--190. In Boguraev and Pustejovsky [BP96], 1996.

[RJ98]
Douglas Roland and Dan Jurafsky. How verb subcategorization frequencies are affected by corpus choice. In COLING/ACL [COL98].

[SB02]
Sabine Schulte im Walde and Chris Brew. Inducing German Semantic Verb Classes from Purely Syntactic Subcategorisation Information. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 223--230, Philadelphia, PA, 2002.

[Sch02]
Sabine Schulte im Walde. A subcategorisation lexicon for German verbs induced from a lexicalised PCFG. In Proceedings of the 3rd Conference on Language Resources and Evaluation, Las Palmas, Spain, 2002.

[WM89]
Mort Webster and Mitch Marcus. Automatic acquisition of the lexical semantics of verbs from sentence frames. In Annual Meeting of the Association for Computational Linguistics, 1989.

This document was translated from LATEX by HEVEA.