I finished my Ph.D. in linguistics at OSU in 2013. I'm currently a Research Scientist at Nuance's NLU and AI laboratory. Here's my CV. Google Scholar has an author profile for me, and you can also find photos, my software engineering portfolio, and other stuff about me over at coffeeblack.
My general academic interests include computational, formal, and mathematical linguistics, specifically, the syntax-semantics interface (dynamic semantics, paraphrase alignment, generation) and morphosyntax (French clitics, Tagalog daw).
My work in linguistic theory is mostly on developing a logic-based framework for describing the syntax-semantics interface, with Carl Pollard, Craige Roberts, and others. This work, which was the subject of my dissertation, focuses on developing a natural language discourse semantics that can model both foreground and background information.
I also work with Michael White on automatic paraphrase alignment and generation. One of our secondary goals is to improve MT evaluation by generating more high-quality reference sentences. Along with Mike, I'm also one of the main authors of OpenCCG (an open-source parser and realizer for CCG).
My thesis fleshes out Grice's taxonomy of implicatures more fully, categorizing anaphora and Potts's "CIs" as instances of a more general class, giving a more general notion of contextual felicity, and reclassifying many so-called presuppositions as mere entailments. Based on this new meaning taxonomy, an explicit formal theory is developed that accounts for both anaphora and Potts's "CIs", yielding more empirically adequate predictions than Potts's.
This corpus is an enhanced version of the Edinburgh paraphrase corpus, with both machine- and hand-corrected tokenization, hand-corrected alignments based on retokenization, parses from both the OpenCCG parser and the Stanford dependency parser. It also includes named entity annotations generated by the Stanford parser and Meteor alignments for use as a baseline.
The corpus is encoded in JSON format, but comes with a handy Python script that outputs just the alignments. The training and a test partitions are based on the partitioning scheme in my COLING 2012 paper.
The name PEP stands for PEP is an Earley Parser and is an example of direct left recursion. PEP is an implementation of Earley's chart-parsing algorithm in Java. It includes a thin command-line interface, but is intended to be used as a library. PEP is free software released under the GNU Lesser General Public License.
PEP can parse strings licensed by any CFG (including those that contain recursive rules). PEP's charts use backpointers so that if a grammar allows ambiguity, PEP keeps track of all of the possible parses in a set of traversable parse trees. Version 0.4 is generalized to allow rules with right-hand sides that include a mix of terminals and nonterminals.
As an example, if the file
duck.xml specifies the following CFG,
S → NP VPthen PEP can be invoked to parse the string Mary saw her duck as follows:
VP → VT NP
VP → VS S
VS → saw
VT → saw
NP → Mary
NP → Det N
Det → her
NP → her
N → duck
VP → duck
$ pep -g duck.xml -s S "Mary saw her duck" ACCEPT: S -> [Mary, saw, her, duck] (2) 1. [S[NP[Mary]][VP[VT[saw]][NP[Det[her]][N[duck]]]]] 2. [S[NP[Mary]][VP[VS[saw]][S[NP[her]][VP[duck]]]]]
-s Sargument tells PEP to parse for category S. The output says that the string is accepted, then gives the two parse trees licensed by the ambiguous grammar.
Google's Web 1T 5-gram Corpus contains so much data that many machines with average amounts of memory are unable to even load it. Funnel is a free tool (released under the GPL) for filtering enormous LMs down to a more manageable size based on user-definable criteria, such as a limited vocabulary.
Custom filters can be specified by implementing a very simple interface with one method. Filters can also be chained in series, so the effects of one can be made to cascade to others. Funnel works with single-file count LMs as well as with the Google multiple-file format.