Stone Soup Translation: The Linked Automata Model

Paul C. Davis

Ph.D. Dissertation in Linguistics. Ohio State University. 2002.


Abstract

The automated translation of one natural language to another, known as machine translation (MT), typically requires successful modeling of the grammars of the languages and the relationship between them. Rather than hand-coding these grammars and relationships, some machine translation efforts employ data-driven methods, where the goal is to learn from a large amount of training examples of accurate translations. One such data-driven approach is statistical MT, where language and alignment models are automatically induced from parallel corpora. This work has also been extended to probabilistic finite-state approaches, most often via transducers.

This dissertation introduces and begins an investigation of an MT model consisting of a novel combination finite-state devices. The model proposed is more flexible than transducer models, giving increased ability to handle word order differences between languages, as well as crossing and discontinuous alignments between words. The linked automata MT model consists of a source language automaton, a target language automaton, and an alignment table---a function which probabilistically links sequences of source and target language transitions. It is this augmentation to the finite-state base which gives the linked automata model its flexibility.

The dissertation describes the linked automata model from the ground up, beginning with a description of some of the relevant MT history and empirical MT literature, and the preparatory steps for building the model, including a detailed discussion of word alignment and the introduction of a new technique for word alignment evaluation. Discussion then centers on the description of the model and its use of probabilities, including algorithms for its construction from word-aligned bitexts and for the translation process. The focus next moves to expanding the linked automata approach, first through generalization and techniques for extracting partial results, and then by increasing the coverage, both in terms of using additional linguistic information and using more complex alignments. The dissertation presents preliminary results for a test corpus of English to Spanish translations, and suggests ways in which the model can be further expanded as the foundation of a more powerful MT system.


Electronically available file formats:


Bibtex entry:

@phdthesis{Davis:diss,
  author   ={Davis, Paul C.},
  title    ={{Stone Soup Translation:  The Linked Automata Model}},
  school   ={Ohio State University},
  year     ={2002},
}