currently on leave from the Linguistics department,
where I'm usually one of
two three faculty members specializing in
linguistics (but stay tuned, we've just been joined by new CL
faculty member Micha
Elsner and will soon be joined by Marie-Catherine de Marneffe!).
I work closely with the other CL faculty —
William Schuler in Linguistics,
Eric Fosler-Lussier in CSE, and
Simon Dennis in Psychology —
and am affiliated with the
language technologies lab in computer science &
My research interests are in natural
language generation, spoken language and multimodal dialogue
systems, and the connection between natural language generation
and speech synthesis. Much of my research uses
Prior to joining the faculty here, I was a Research Fellow in the School of Informatics at the University of Edinburgh. Before crossing the pond to Scotland, I worked for many years at CoGenTex, Inc., a small company dedicated to developing commercial natural language generation software, as well as advancing research in NLG. Before joining the CGT crew, I obtained a Ph.D. in computer science from the University of Pennsylvania.
Some recent activities:
AFRL - Subcontract to BBN, 2012
Collaborators: Scott Martin, Dennis Mehay
Abstract. This project aims to (1) improve monolingual alignment with exact decoding, taking maximum advantage of consistent phrasal and dependency matches, and (2) use the resulting alignments in high-precision paraphrasing to automatically approximate the well-validated HTER method of human-in-the-loop MT evaluation. By improving the correlation between automatic and human judgments of translation quality, the project is expected to help drive progress in machine translation research.
NSF IIS - Robust Intelligence Grant, 2011–2013
Collaborators: Dominic Espinosa, David Howcroft, Rajakrishnan Rajkumar
Abstract. Natural Language Generation (NLG) systems aim to improve the accessibility and impact of information by turning data into coherent and fluent text or speech, automatically. Developing high-quality NLG systems, however, remains a difficult and costly undertaking, in large part because bridging the gap between content planning and surface realization—a task known as sentence planning—continues to require extensive knowledge engineering.
This Early Grant for Exploratory Research investigates ways of bridging this gap by employing machine learning together with Discourse Combinatory Categorial Grammar (DCCG). Using a restaurant recommendation application as a proof-of-concept, the project explores methods of (1) adapting previous work on acquiring lexicalized grammar entries for semantic parsing to learn mappings from domain-general semantic dependency representations to application-specific representations of messages; (2) extending the approach to learn rules for combining messages; (3) employing the acquired resources to map content plans to disjunctive logical forms (DLFs), which compactly specify the range of possible realizations of the selected content; and (4) improving the efficiency of realizing DLFs with OpenCCG through grammar specialization.
The project will evaluate the success of these novel methods and assess the portability of the approach. By demonstrating methods for radically simplifying the construction of NLG systems, the project promises to transform the way NLG systems are built, from today's knowledge-intensive approach to one that relies primarily on assembling a parallel corpus of input-output pairs. Ultimately, it will facilitate the development of generation components in data-to-text systems as well as dialogue systems, including ones for the visually impaired.
NSF IIS - Robust Intelligence Grant, 2008–2012
Collaborators: Steve Boxwell, Dominic Espinosa, Scott Martin, Dennis Mehay, Crystal Nakatsu, Rajakrishnan Rajkumar
Abstract. Research on automatic paraphrase generation has been gaining steam in recent years. Or in other words, research on generating paraphrases automatically has seen increasing progress lately. Automatic paraphrasing is considered vital to applications as diverse as machine translation (MT), question answering, summarization, and dialogue systems. Paraphrasing has also been shown recently to hold promise for automatic methods of evaluating MT, when the paraphrases are of sufficiently high quality.
This project investigates novel methods for acquiring and generating such high quality paraphrases in order to automatically approximate the human translation error rate (HTER) metric for MT evaluation, where human annotators post-edit MT outputs into acceptable paraphrases of the reference translations. The project emphasizes the use of a linguistically informed, grammar-based parser and realizer for acquiring and generating paraphrases using disjunctive logical forms (DLFs), in sharp contrast to most recent work that relies entirely on shallow methods. Specifically, the project investigates methods of (1) engineering a broad coverage English grammar from the CCGbank, with semantic roles integrated from Propbank; (2) scaling up OpenCCG for efficient parsing and realization with this grammar, adapting supertagging and parse ranking methods for generation; (3) adapting and extending previous methods of acquiring paraphrases to work on DLFs; (4) generating high quality n-best paraphrases of one or more reference sentences; and (5) experimentally evaluating whether the automatically generated paraphrases can be used with current MT metrics to yield improved correlations with human judgments of translation quality.
By providing a way to automatically approximate the HTER metric, the project will help drive future MT research. Additionally, by dramatically extending the realization capacity of OpenCCG, the project promises to benefit a wide range of NLP tasks where the breadth of target texts is of crucial importance.
OSU Arts & Humanities Innovation Grant, 2007–2009
Collaborators: Chris Brew, Dominic Espinosa, Eric Fosler-Lussier, Kiwako Ito, Rajakrishnan Rajkumar, Shari Speer
Abstract. The focus of the project is investigating methods of building synthetic voices for conversational systems that are capable of expressing natural and contextually appropriate intonation. While data-driven techniques for producing synthetic speech have made great strides in the past ten years, at present general purpose synthetic voices are only good at synthesizing declarative sentences with neutral intonation. Neutral intonation does not suffice, however, in conversational systems: instead it sounds disengaged or "dead", and is often misleading as to the intended meaning. To overcome this impasse, we will pursue recently developed techniques for custom building expressive synthetic voices that target the capabilities of particular conversational systems.
The specific objectives of the project are twofold. Firstly, we will investigate the extent to which custom synthetic voices can produce natural sounding intonation via a psycholinguistic experiment. To do so, we will use an expressive synthetic voice, rather than recorded human speech, to replicate recent eye-tracking experiments which investigated the role of pitch accents during online discourse comprehension. These experiments demonstrated a processing advantage for contextually appropriate as compared to inappropriate uses of pitch accents in instructions. Eye movement monitoring is an ideal method of evaluating speech synthesis, since it provides an objective, non-intrusive, implicit measure of processing difficulty; how people process synthetic speech is also an interesting question in its own right. Secondly, we will devise a new, utility-based algorithm for optimizing the selection of intonationally varied sentences to record when building a custom synthetic voice, and evaluate its effectiveness in a perception experiment. This algorithm will fill in a crucial missing piece of the expressive synthesis puzzle, as most existing text selection algorithms do not take prosody into account.
Kapil Thadani, Scott Martin and Michael White. 2012. A Joint Phrasal and Dependency Model for Paraphrase Alignment. In Proc. of COLING 2012. (poster)
Dennis N. Mehay and Michael White. 2012. Shallow and Deep Paraphrasing for Improved Machine Translation Parameter Optimization. In Proc. of the AMTA 2012 Workshop on Monolingual Machine Translation (MONOMT 2012).
Michael White and Rajakrishnan Rajkumar. 2012. Minimal Dependency Length in Realization Ranking. In Proc. EMNLP-12. (bib) (data)
Michael White. 2012. Shared Task Proposal: Syntactic Paraphrase Ranking. In Proc. of the 7th International Conference on Natural Language Generation (INLG-12). (bib)
Michael White. 2011. Glue Rules for Robust Chart Realization. In Proc. of the 13th European Workshop on Natural Language Generation. (poster)
Anja Belz, Michael White, Dominic Espinosa, Eric Kow, Deirdre Hogan and Amanda Stent. 2011. The First Surface Realisation Shared Task: Overview and Evaluation Results. In Proc. of the 13th European Workshop on Natural Language Generation.
Rajakrishnan Rajkumar, Dominic Espinosa and Michael White. 2011. The OSU System for Surface Realization at Generation Challenges 2011. In Proc. of the 13th European Workshop on Natural Language Generation. (poster)
Rajakrishnan Rajkumar and Michael White. 2011. Linguistically Motivated Complementizer Choice in Surface Realization. In Proc. of the EMNLP-11 Workshop on Using Corpora in NLG. (bib)
Scott Martin and Michael White. 2011. Creating Disjunctive Logical Forms from Aligned Sentences for Grammar-Based Paraphrase Generation. In Proc. of the ACL-11 Workshop on Monolingual Text-to-Text Generation. (bib)
Dominic Espinosa, Rajakrishnan Rajkumar, Michael White and Shoshana Berleant. 2010. Further Meta-Evaluation of Broad Coverage Surface Realization. In Proc. EMNLP-10. (bib) (data)
Rajakrishnan Rajkumar, Michael White, Shari R. Speer and Kiwako Ito. 2010. Evaluating Prosody in Synthetic Speech with Online (Eye-Tracking) and Offline (Rating) Methods. In Proc. 7th Speech Synthesis Workshop.
Dominic Espinosa, Michael White, Eric Fosler-Lussier and Chris Brew. 2010. Machine Learning for Text Selection with Expressive Unit-Selection Voices. In Proc. Interspeech-10.
Rajakrishnan Rajkumar and Michael White. 2010. Designing Agreement Features for Realization Ranking. In Proc. of COLING-10. (poster) (bib)
Crystal Nakatsu and Michael White. 2010. Generating with Discourse Combinatory Categorial Grammar. In Linguistic Issues in Language Technology, 4(1):1–62.
Michael White, Robert A. J. Clark and Johanna D. Moore. 2010. Generating tailored, comparative descriptions with contextually appropriate intonation. Computational Linguistics, 36(2):159–201. (link to stimuli)
Michael White, Rajakrishnan Rajkumar, Kiwako Ito and Shari Speer. 2009. Eye Tracking for the Online Evaluation of Prosody in Speech Synthesis: Not So Fast! In Proc. of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH-09).
Michael White and Rajakrishnan Rajkumar. 2009. Perceptron Reranking for CCG Realization. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2009). (bib)
Rajakrishnan Rajkumar, Michael White and Dominic Espinosa. 2009. Exploiting Named Entity Classes in CCG Surface Realization. In Proc. of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2009). (bib) (poster)
Scott Martin, Rajakrishnan Rajkumar and Michael White. 2009. Grammar Engineering for CCG using Ant and XSLT. In Proc. of the NAACL HLT 2009 Workshop on Software Engineering, Testing and Quality Assurance for Natural Language Processing (SETQA-NLP 2009). (bib) (poster)
Michael White and Rajakrishnan Rajkumar. 2008. A More Precise Analysis of Punctuation for Broad-Coverage Surface Realization with CCG. In Proc. of the Workshop on Grammar Engineering Across Frameworks (GEAF08). (bib)
Dominic Espinosa, Michael White and Dennis Mehay. 2008. Hypertagging: Supertagging for Surface Realization with CCG. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-08: HLT). (bib)
Stephen A. Boxwell and Michael White. 2008. Projecting Propbank Roles onto the CCGbank. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC-08).
Robert Dale and Michael White, editors. 2007. Report from the Workshop on Shared Tasks and Comparative Evaluation in Natural Language Generation.
Vasile Rus, Arthur C. Graesser, Amanda Stent, Marilyn Walker and Michael White. 2007. Text-to-Text Generation. In Report from the Workshop on Shared Tasks and Comparative Evaluation in Natural Language Generation.
Michael White, Rajakrishnan Rajkumar and Scott Martin. 2007. Towards Broad Coverage Surface Realization with CCG. In Proc. of the 2007 Workshop on Using Corpora for NLG: Language Generation and Machine Translation (UCNLG+MT).
Mary Ellen Foster and Michael White. 2007. Avoiding Repetition in Generated Text. In Proc. of the 11th European Workshop on Natural Language Generation. (bib)
Robert Dale and Michael White, editors. 2007. Position Papers of the Workshop on Shared Tasks and Comparative Evaluation in Natural Language Generation.
Michael White. 2006. CCG Chart Realization from Disjunctive Inputs. In Proc. of the 4th International Conference on Natural Language Generation (INLG-06). (bib)
Crystal Nakatsu and Michael White. 2006. Learning to Say It Well: Reranking Realizations by Predicted Synthesis Quality. In Proc. COLING-ACL-06. (bib)
Michael White. 2006. Efficient Realization of Coordinate Structures in Combinatory Categorial Grammar. Research on Language and Computation, 4(1):39–75. (prefinal version)
Mary Ellen Foster and Michael White. 2005. Assessing the Impact of Adaptive Generation in the COMIC Multimodal Dialogue System. In Proc. of the IJCAI-05 Workshop on Knowledge and Reasoning in Practical Dialogue Systems.
Carsten Brockmann, Amy Isard, Jon Oberlander, and Michael White. 2005. Modelling alignment for affective dialogue. In Proc. of the UM-05 Workshop on Adapting the Interaction Style to Affective Factors.
Michael White, Mary Ellen Foster, Jon Oberlander, and Ash Brown. 2005. Using Facial Feedback to Enhance Turn-Taking in a Multimodal Dialogue System. In Proc. of the HCI International 2005 Thematic Session on Universal Access in Human-Computer Interaction.
Michael White. 2005. Designing an Extensible API for Integrating Language Modeling and Realization. In Proc. ACL-05 Workshop on Software.
Mary Ellen Foster, Michael White, Andrea Setzer, and Roberta Catizone. 2005. Generating Multimodal Output in the COMIC Dialogue System. ACL 2005 Demo Session. (Poster [A0 PDF])
Mary Ellen Foster and Michael White. 2004. Techniques for Text Planning with XSLT. In Proc. of the 4th NLPXML Workshop.
Michael White. 2004. Reining in CCG Chart Realization. In Proc. of the 3rd International Conference on Natural Language Generation (INLG-04).
Rachel Baker, Robert A. J. Clark, and Michael White. 2004. Synthesising Contextually Appropriate Intonation in Limited Domains. In Proc. of the 5th ISCA Speech Synthesis Workshop.
Johanna Moore, Mary Ellen Foster, Oliver Lemon, and Michael White. 2004. Generating Tailored, Comparative Descriptions in Spoken Dialogue. In Proc. of the 17th International FLAIRS Conference.
Michael White and Jason Baldridge. 2003. Adapting Chart Realization to CCG. In Proc. of the 9th European Workshop on Natural Language Generation.