Seminar on Corpus-Based Grammar Engineering

Ling 884 — Seminar in Computational Linguistics
Sprint '07, TR 1:30–3:18, 141 Bio Sci Bldg
Instructor: Michael White
http://www.ling.ohio-state.edu/~mwhite/

Description

In recent years, grammars for a variety of formalisms (CFGs, CCGs, TAGs, LFGs) have been extracted from the Penn Treebank, for statistical parsing and realization. Compared to grammars that have been engineered by hand, these extracted grammars typically have broader coverage, but are lacking in depth of linguistic analysis. A question that has been largely unexplored is the extent to which one can successfully improve such extracted grammars by further grammar engineering.

In this seminar, we will explore methods for corpus-based grammar engineering, through readings and individual or group projects. At the beginning of the quarter, project teams and tasks will be arranged. During the quarter, project teams will present their ongoing work, starting with their task definitions and aims, continuing through intermediate milestones, and finishing with their empirical results, which they will then write up in a final project report and present in a poster session. Each person will also be expected to lead the discussion of one or two papers.

Projects are anticipated to involve one of the Penn Treebank (PTB), the English CCGbank (derived from the PTB), the German Tiger/Negra or Tüba-D/Z corpora, the Redwoods Treebank (for HPSG), or other treebanks. Possible topics include: making a CCGBank-extracted grammar more precise; methods for transforming the CCGbank to reflect more precise analyses; improving lexical coverage through lexical rules; evaluating the impact of more precise grammars on parsing or realization; comparing different evaluation measures; extracting a CCG from the Redwoods Treebank; and so forth. Students will also be welcome to propose possible projects, especially ones that would be synergistic with their own ongoing research. Projects using the OpenCCG library for parsing or realization are particularly encouraged.

Prerequisites

The comp ling intro courses (684.1 and 2) or permission of the instructor.

Requirements

Paper presentations (20%): A fair amount of class time will be dedicated to student presentations of papers from the reading list. Presenters will give a summary of the paper, highlighting important results, and moderate discussion of the paper.
Class discussion (10%): To facilitate discussion, each student should come to class with at least one good question in mind per paper.
Term project (70%): The goal of the term project will be to examine a corpus-based grammar engineering task in depth. Projects may be conducted individually or in groups. The instructor will propose several projects involving OpenCCG. Students may choose to work on one of these projects or propose their own. The tentative schedule for the projects is as follows:

Week 2:	Choose Project
Weeks 3-4:	Present Project Plan
Week 5:	Present Evaluation Plan
Week 8:	Review Design / Code
Week 10:	Present Results
Finals Week:	Project Report due June 7

Carmen

We'll use Carmen to schedule presentations and post advance questions on the readings. Carmen will also be used to provide local access to PDFs that are not readily available.

Reading List

Note that the reading list represents a starting point for the papers we will read during the quarter, with the exact set of papers to be covered depending on student interest. More CCG papers can be found on the CCG site.

CCG Intro

Mark Steedman. 2000. The Syntactic Process, MIT Press.

Mark Steedman and Jason Baldridge. 2003. Combinatory Categorial Grammar. Unpublished Tutorial Paper.

Jason Baldridge and Geert-Jan Kruijff. 2003. Multi-Modal Combinatory Categorial Grammar. In Proceedings of EACL-03.

Creating CCG Lexicons and Corpora

Julia Hockenmaier. 2006. Creating a CCGbank and a wide-coverage CCG lexicon for German. In Proceedings of COLING-ACL 2006.

Julia Hockenmaier and Mark Steedman. 2005. CCGbank: User's Manual. Technical Report MS-CIS-05-09, Department of Computer and Information Science, University of Pennsylvania.

Julia Hockenmaier and Mark Steedman. 2002. Acquiring Compact Lexicalized Grammars from a Cleaner Treebank. In Proceedings of Third International Conference on Language Resources and Evaluation.

Julia Hockenmaier, Gann Bierner and Jason Baldridge. 2004. Extending the coverage of a CCG System. Research in Language and Computation, 2:165-208.

Christine Doran. 1998. Incorporating Punctuation Into The Sentence Grammar: A Lexicalized Tree Adjoining Grammar Perspective. Ph.D. Dissertation, University of Pennsylvania, Technical Report IRCS-98-24.

CCG Supertagging and Parsing

Stephen Clark and James R. Curran. 2007. Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. Unpublished version of article to appear in Computational Linguistics.

Stephen Clark and James R. Curran. 2004. The Importance of Supertagging for Wide-Coverage CCG Parsing. In Proceedings of COLING-04.

Stephen Clark and James R. Curran. 2004. Parsing the WSJ using CCG and Log-Linear Models. In Proceedings ACL-04.

Semantics and Questions

Johan Bos. 2005. Towards Wide-Coverage Semantic Interpretation. In Proceedings of Sixth International Workshop on Computational Semantics IWCS-6, pages 42-53.

Stephen Clark, Mark Steedman and James R. Curran. 2004. Object-Extraction and Question-Parsing using CCG. In Proceedings of EMNLP-04.

Daniel Gildea and Julia Hockenmaier. 2003. Identifying Semantic Roles using Combinatory Categorial Grammar. In Proceedings of EMNLP-03.

Surface Realization with CCG

Michael White. 2005. Designing an Extensible API for Integrating Language Modeling and Realization. In Proc. ACL-05 Workshop on Software.

Michael White. 2006. Efficient Realization of Coordinate Structures in Combinatory Categorial Grammar. Research on Language and Computation, 4(1):39–75. (prefinal version, accepted 2004)

Related Work on Surface Realization

Irene Langkilde and Kevin Knight. 1998. Generation that Exploits Corpus-based Statistical Knowledge, Proc. of COLING-ACL ('98).

Irene Langkilde. 2000. Forest-Based Statistical Sentence Generation, Proc. of the Association for Computational Linguistics Conference, North American chapter (NAACL-2000).

Srinivas Bangalore and Owen Rambow. 2000. Exploiting a Probabilistic Hierarchical Model for Generation, Proc. of the International Conference on Computational Linguistics (COLING-2000).

John Carroll and Stephan Oepen. 2005. High efficiency realization for a wide-coverage unification grammar. In Proc. of the Second International Joint Conference on Natural Language Processing (IJCNLP-05).

Evaluating Surface Realizers

Srinivas Bangalore, Owen Rambow, Steven Whittaker. 2000. Evaluation Metrics for Generation. In Proc. of the International Conference on Natural Language Generation (INLG-2000).

Irene Langkilde-Geary. 2002. An Empirical Verification of Coverage and Correctness for a General-Purpose Sentence Generator. In Proc. of the International Natural Language Generation Conference (INLG-02).

Charles Callaway. 2003. Evaluating Coverage for Large Symbolic NLG Grammars. In Proc. of the 18th International Joint Conference on Artificial Intelligence (IJCAI-03).

Policy on Academic Misconduct

As with any class at this university, students are required to follow the Ohio State Code of Student Conduct. In particular, note that students are not allowed to, among other things, submit plagiarized (copied but unacknowledged) work for credit. If any violation occurs, the instructor is required to report the violation to the Council on Academic Misconduct.