I am a Ph.D. candidate in linguistics and a presidential fellow at OSU.
My general academic
interests include computational, formal, and mathematical linguistics, specifically, the syntax-semantics
interface (dynamic semantics, paraphrase alignment, generation) and morphosyntax (French clitics, Tagalog daw).
My work in linguistic theory is mostly on developing a logic-based framework for describing the
syntax-semantics interface, with Carl Pollard, Craige Roberts, and others.
This work, which is the subject of my dissertation, focuses on developing a natural
language discourse semantics that can model both foreground and background information.
I also work with Michael White on OpenCCG (an open-source parser and realizer for CCG). Our main goals
are to improve automatic paraphrase alignment and generation. A secondary goal is to improve MT evaluation by generating more high-quality reference sentences.
Here's my CV.
Google Scholar has an author
profile for me, and you can also find photos, my software engineering portfolio, and other stuff about me over at coffeeblack.
On Categorial Grammar and Dynamic Semantics
- A multistratal
account of the projective Tagalog evidential ‘daw’. In Proceedings of SALT 22, 2012. (With Greg Kierstead.)
- Given at the conference (hosted by the University of Chicago Linguistics Department), in Chicago, Illinois, May 19, 2012.
Weak Familiarity and Anaphoric Accessibility in Dynamic Semantics. In Formal Grammar, number 7395 in Lecture Notes in Computer Science, 2012. doi:10.1007/978-3-642-32024-8.
- Given at the conference (a satellite event of ESSLLI 2011)
in Ljubljana, Slovenia, August 6, 2011.
A Higher-Order Theory of
Presupposition. Studia Logica 100(4):727–751, 2012. doi:10.1007/s11225-012-9427-6. (With Carl Pollard.)
- Given at SWAMP 2010 (hosted by the
University of Michigan Linguistics
Department) in Ann Arbor, Michigan, November 13, 2010.
- Hyperintensional Dynamic
Semantics: Analyzing Definiteness with Enriched Contexts. In Formal Grammar, number 7395 in Lecture Notes in Computer Science, 2012. doi:10.1007/978-3-642-32024-8. (With Carl Pollard.)
- Dynamic Semantics in Direct Style. Presented
in Commies, April 29–May 13, 2010.
- Enriching Contexts for Type-Theoretic
Dynamics. Invited talk given at the CAuLD workshop on Logical Methods for Discourse
(hosted by INRIA),
Nancy, France, December 14, 2009. (With Carl Pollard.)
- A Proof-theoretic Approach to
French Pronominal Clitics. In Proceedings of the 13th ESSLLI Student Session, 2008.
- Presented at ESSLLI, Hamburg, Germany, August 7, 2008.
On Natural Language Generation and Paraphrasing
- A Joint Phrasal and Dependency Model for Paraphrase Alignment. In Proceedings of COLING 24, 2012. (With Kapil Thadani and Michael White.)
- Creating Disjunctive Logical Forms
from Aligned Sentences for Grammar-Based Paraphrase Generation. In Proceedings of the Workshop on Monolingual Text-to-Text Generation, 2011.
(With Michael White.)
- Given at the workshop (co-located with ACL 2011) in Portland, Oregon, June 24, 2011.
- Using Semantic Dependencies to Improve
Paraphrase Alignment. Presented in Clippers, November 6, 2009.
Engineering for CCG using Ant and XSLT. In Proceedings of SETQA-NLP, 2009. (With Rajakrishnan Rajkumar and Michael White.)
- Presented at the workshop (co-located with NAACL-HLT 2009) in
Boulder, Colorado, June 5, 2009.
- Developing an Annotation
Scheme for ELL Spelling Errors. In Proceedings of
MCLC 5, 2008. (With D.J. Hovermale.)
- Towards Broad
Coverage Surface Realization with CCG. In Proceedings of UCNLG+MT, 2007. (With Michael White and Rajakrishnan Rajkumar.)
- 680: Formal Foundations of Linguistic Theory
- (Assistant to Carl Pollard.) Foundational
course on the mathematical tools used in formal linguistics.
- 602.01: Syntax 1
- (Assistant to Bob Levine.) Overview of
syntactic theory and description based on HPSG.
- 384: Language and Computers
- Broad-based overview of topics in computational linguistics.
- 280: Language and Formal Reasoning
- Truth-conditional meaning in natural language and its interaction with deductive reasoning.
- 201: Introduction to Language
- Survey course in general linguistics.
This corpus is an enhanced version of the
Edinburgh paraphrase corpus,
with both machine- and hand-corrected tokenization, hand-corrected alignments based on retokenization,
parses from both the OpenCCG parser and the
Stanford dependency parser.
It also includes named entity annotations generated by the Stanford parser and
Meteor alignments for use as a baseline.
- (03/22/2013 release)
The corpus is encoded in JSON format, but comes
with a handy Python script that outputs just the alignments. The training and a test partitions are based on the
partitioning scheme in
my COLING 2012 paper.
The name PEP stands for
is an Earley Parser and is an
example of direct left recursion. PEP is an implementation of
Earley's chart-parsing algorithm
in Java. It includes a thin command-line interface, but is intended to
be used as a library. PEP is free software released under the
GNU Lesser General Public License.
PEP source and binaries
- Version 0.4
- generated using my public key
- API Documentation
- generated by JavaDoc
The tar bundle above contains PEP's binaries, full source code,
generated documentation, and an Ant
build file. It also includes several sample grammars for
testing and automated JUnit tests.
PEP can parse strings licensed by any
CFG (including those
that contain recursive rules). PEP's charts
use backpointers so that if a grammar allows ambiguity, PEP keeps track
of all of the possible parses in a set of traversable parse trees.
Version 0.4 is generalized to allow rules
with right-hand sides that include a mix of terminals and nonterminals.
As an example, if the file
duck.xml specifies the following CFG,
S → NP VP
VP → VT NP
VP → VS S
VS → saw
VT → saw
NP → Mary
NP → Det N
Det → her
NP → her
N → duck
VP → duck
then PEP can be invoked to parse the string Mary saw her duck
$ pep -g duck.xml -s S "Mary saw her duck"
ACCEPT: S -> [Mary, saw, her, duck] (2)
argument tells PEP to parse for category S. The output says that the string is accepted,
then gives the two parse trees licensed by the ambiguous grammar.
Google's recently-released Web 1T
5-gram Corpus contains so much data that many machines with average amounts of memory are unable to even load it.
Funnel is a free tool (released under the GPL)
for filtering enormous LMs down to a more manageable size based on
user-definable criteria, such as a limited vocabulary.
Funnel source and binaries
- Version 0.1
- generated using my public key
Custom filters can be specified by implementing a very simple interface with one method. Filters can also be chained in series, so the effects of one can be made to cascade to others. Funnel works with single-file count LMs as well as with the Google multiple-file format.