|
Abstract:
This thesis has three interrelated goals:
The main goal is an analysis of Czech clitics, units of grammar on the borderline between morphology and syntax with rather peculiar ordering properties both relative to the whole clause and to each other. We examine the actual set of clitics, their rather rigid ordering properties, and finally the properties of so-called clitic climbing. The analysis evaluates previous research, but it also provides new insights, especially in the position of the clitic cluster and in the constraints on clitic climbing. We show that many of the constraints regarding position of the clitic cluster suggested in previous research do not hold. We also argue that cases when clitics do not follow the first constituent are in fact not exceptions in clitic placement but instead unusual frontings. The second goal is the development of a framework within Higher Order Grammar (HOG) supporting a transparent and modular treatment of word order. Unlike previous versions of HOG, we work with signs (containing phonological, syntactic and potentially other information) as actual objects of the grammar. Apart from that, we build on the simplicity and elegance of the pre-formal part of the linearization framework within Head-driven Phrase Structure Grammar. Finally, the third objective is to test the result of the second goal by applying it on the results of the first goal. |
BibTeX:
@PHDTHESIS{hana:diss,
author = {Hana, Jiri},
title = {Czech Clitics in Higher Order Grammar},
school = {The Ohio State University},
year = {2007},
pdf = {http://ling.osu.edu/~hana/biblio/hana-diss.pdf}
}
|
| Abstract: We describe a knowledge and resource light system for an automatic morphological analysis and tagging of Brazilian Portuguese. We avoid the use of labor intensive resources; particularly, large annotated corpora and lexicons. Instead, we use (i) an annotated corpus of Peninsular Spanish, a language related to Portuguese, (ii) an unannotated corpus of Portuguese, (iii) a description of Portuguese morphology on the level of a basic grammar book. We extend the similar work that we have done (Hana et al., 2004; Feldman et al., 2006) by proposing an alternative algorithm for cognate transfer that effectively projects the Spanish emission probabilities into Portuguese. Our experiments use minimal new human effort and show 21% error reduction over even emissions on a fine-grained tagset. |
BibTeX:
@INPROCEEDINGS{hana:etal:2006-eacl,
author = {Jirka Hana and Anna Feldman and Luiz Amaral and Chris Brew},
title = {Tagging Portuguese with a Spanish Tagger Using Cognates},
booktitle = {Proceedings of the Workshop on Cross-language Knowledge Induction,
11th Conference of the European Chapter of the Association for Computational
Linguistics (EACL-2006), Trento, Italy.},
year = {2006},
pdf = {http://www.ling.ohio-state.edu/~hana/biblio/hanaEtal2006-eacl.pdf}
}
|
|
Abstract:
This paper presents an analysis of certain aspects of Czech sentential clitics
in Higher Order Grammar. I focus on the relative order of clitics within the clitic
cluster. The overall aim of the paper is to show that constraints governing Czech
sentential clitics, ‘ though quite complex, can be captured relatively easily within a
higher order formalism such as Higher Order Grammar. |
BibTeX:
@INCOLLECTION{hana:2004,
author = {Jirka Hana},
title = {{Czech clitics in Higher Order Grammar}},
booktitle = {{Working Papers in Slavic Studies}},
publisher = {Department of Slavic and East European Languages and Literatures},
year = {2004},
address = {Columbus, Ohio},
pdf = {http://www.ling.ohio-state.edu/~hana/biblio/Hana2004-Clitics.pdf}
} |
| Abstract: In this paper, we describe a resource-light system for the automatic morphological analysis and tagging of Russian. We eschew the use of extensive resources (particularly, large annotated corpora and lexicons), exploiting instead (i) pre-existing annotated corpora of Czech; (ii) an unannotated corpus of Russian. We show that our approach has benefits, and present what we believe to be one of the first full evaluations of a Russian tagger in the openly available literature. |
BibTeX:
@INPROCEEDINGS{hana:etal:2004,
author = {Jiri Hana and Anna Feldman and Chris Brew},
title = {{A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources}},
booktitle = {{Proceedings of EMNLP 2004}},
year = {2004},
address = {Barcelona, Spain},
pdf = {http://www.ling.ohio-state.edu/~hana/biblio/HanaFeldmanBrew2004-RusMorphLite.pdf}
}
|
| Abstract: We show that the standard account of neutrality and coordination in type-logical grammar is untenable. However, when using as our framework a version of Lambek’s categorical grammar with a type theory based on Lambek and Scott’s higher order intuitionistic logic (the internal language of a topos) rather than the Lambek calculus, the account can largely be salvaged. Because of the difficulty of phonologically interpreting coordinated functors of differing directionality we need to handle both phonology and syntax within a single polymorphically typed lambda calculus. |
BibTeX:
@INPROCEEDINGS{pollard:hana:2003,
author = {Carl Pollard and Jiri Hana},
title = {Ambiguity, neutrality, and coordination in higher order grammar},
booktitle = {Proceedings of Formal Grammar},
year = {2003},
editor = {Gerhard Jaeger and Paola Monachesi and Gerald Penn and Shuly Wintner},
pages = {125--136},
address = {Wien},
pdf = {http://www.ling.ohio-state.edu/~hana/biblio/pollard-hana2003-fg-vienna.pdf}
}
|
| Abstract: We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languages, for which such resources are likely to remain unavailable in the foreseeable future. We compare the performance of our system on languages that belong to different language families (Romance vs. Slavic), as well as different language pairs within the same language family (Portuguese via Spanish vs. Catalan via Spanish). We show that across language families, the most difficult category is the category of nominals (the noun homonymy is challenging for morphological analysis and the order variation of adjectives within a sentence makes it challenging to create a realiable model), whereas different language families present different challenges with respect to their morpho-syntactic descriptions: for the Slavic languages, case is the most challenging category; for the Romance languages, gender is more challenging than case. In addition, we present an alternative evaluation metric for our system, where we measure how much human labor will be needed to convert the result of our tagging to a high precision annotated resource. |
BibTeX:
@INPROCEEDINGS{feldman:etal:2006-lrec,
author = {Anna Feldman and Jirka Hana and Chris Brew},
title = {A cross-language approach to rapid creation of new morpho-syntactically
annotated resources},
booktitle = {Proceedings of the fifth international conference on Language Resources
and Evaluation (LREC 2006). Genoa, Italy},
year = {2006},
pdf = {http://www.ling.ohio-state.edu/~hana/biblio/feldmanHanaBrew2006-lrec.pdf}
}
|
| Abstract: Annotated corpora are valuable resources for NLP which are often costly to create. We introduce a method for transferring annotation from a morphologically annotated corpus of a source language to a target language. Our approach assumes only that an unannotated text corpus exists for the target language and a simple textbook which describes the basic morphological properties of that language is available. Our paper describes experiments with Polish, Czech, and Russian. However, the method is not tied in any way to these languages. In all the experiments we use the TnT tagger, a second-order Markov model. Our approach assumes that the information acquired about one language can be used for processing a related language. We have found out that even breathtakingly naive things (such as approximating the Russian transitions by Czech and/or Polish and approximating the Russian emissions by (manually/ automatically derived) Czech cognates) can lead to a significant improvement of the tagger’s performance. |
BibTeX:
@INPROCEEDINGS{feldman:2006-cicling,
author = {Anna Feldman and Jirka Hana and Chris Brew},
title = {Experiments in Morphological Annotation Transfer},
booktitle = {Proceedings of Computational Linguistics and Intelligent Text Processing
(CICLing)},
year = {2006},
editor = {A. Gelbukh},
series = {Lecture Notes in Computer Science},
publisher = {Springer-Verlag},
pdf = {http://www.ling.ohio-state.edu/~hana/biblio/feldmanHanaBrew2006-cicling.pdf}
}
|
| Abstract: Coming soon. |
BibTeX:
Coming soon.
|
| Abstract: We report on morphological tagging of Russian using very limited Russian resources. We train the TnT tagger (Brants, 2000) on a modified Czech corpus to get the transition probabilities. We believe that the two languages are similar enough for the transitional information to be useful. The Russian emission symbols are obtained using a morphological analyzer that does not rely on a manually created lexicon. Finally, we report on several simple systematic modifications transforming a Czech text into a text with more Russian-like morphological properties. |
BibTeX:
@INPROCEEDINGS{hana:feldman:2004,
author = {Jiri Hana and Anna Feldman},
title = {{Portable Language Technology: Russian via Czech}},
booktitle = {{Proceedings from the Midwest Computational Linguistics Colloquium,
June 25-26, 2004}},
year = {2004},
address = {Bloomington, Indiana},
pdf = {http://www.ling.ohio-state.edu/~hana/biblio/HanaFeldman2004-RusViaCze.pdf}
}
|
| Abstract: This paper describes a multilingual text generation system in the domain of CAD/CAM software instructions for Bulgarian, Czech and Russian. Starting from a language-independent semantic representation, the system drafts natural, continuous text as typically found in software manuals. The core modules for strategic and tactical generation are implemented using the KPML platform for linguistic resource development and generation. Prominent characteristics of the approach implemented are a treatment of multilinguality that makes maximal use of the commonalities between languages while also accounting for their differences and a common representational strategy for both text planning and sentence generation. |
BibTeX:
Coming soon.
|
| Abstract: Coming soon. |
BibTeX:
Coming soon.
|
BibTeX:
@TECHREPORT{hanaEtAl:2005-morphManual,
author = {Jiri Hana and Daniel Zeman and Jan Haji{\v{c}} and Hana Hanov{\'{a}} and Barbora Hladk{\'{a}} and Emil Je{\v{r}}{\'{a}}bek},
title = {{Manual for Morphological Annotation, Revision for the Prague Dependency
Treebank 2.0}},
institution = {{\'{U}}FAL MFF UK},
year = {2005},
number = {TR-2005-27},
address = {Prague, Czech Rep.},
booktitle = {{}},
issn = {1214-5521},
language = {eng},
pageswhole = {55}
}
|
BibTeX:
@TECHREPORT{hana:etal:2002,
author = {Jiri Hana and Hana Hanov{\'a} and Jan Hajic and Barbora Vidov{\'a}-Hladk\'a
and Emil Jer{\'a}bek},
title = {Manual for Morphological Annotation},
institution = {CKL MFF UK},
year = {2002},
number = {TR-2002-14}
}
|
| Abstract: Coming soon. |
BibTeX:
Coming soon.
|
| Abstract: The thesis describes the morphology of Esperanto by a two-level morphology system. Esperanto is an agglutinating language, therefore the two-level morphology approach is extremely suitable for it. THe system is evaluated on a large corpus of Esperanto text. |
BibTeX:
Coming soon.
|
| Abstract: Coming soon. |
BibTeX:
Coming soon.
|