Table of Contents
List of Tables
We are pleased to publish the first version of the manual for morphological annotation of Czech sentences. We believe that such guidelines can be of use to the users of Prague Dependency Treebank 1.0 (PDT 1.0), as well as for preparation of new data.
Let us recall the most important steps we passed in order to get about two million morphologically annotated words (PDT 1.0). At the very beginning, we put together a team of eight annotators – we did introduce them to a system of morphological tags we designed to describe Czech morphological properties; we also introduced them a morphological analyzer for processing isolated words we use (as a preprocessing step), and, last but not least, we did rely on their knowledge of Czech morphology they have acquired while studying at secondary school, i.e. we did not offer them any annotation guidelines.
One can assume that this strategy is too hazardous – how to deal with discrepancies the annotators produce to ensure the consistency of annotation? First, two annotators annotated each text file. Then, by a "blind" automatic procedure (no matter what word is processed – just comparing two strings) we detected words annotated differently. Consequently, the only one annotator (as a member of just two-member team) handled these cases and, also, checked the morphological annotations against the syntactic-analytical annotations. This way we replaced the absence of annotation guidelines by sequential elimination of discrepancies across both the morphological and syntactic-analytical levels of annotation.
Along the way we were writing this annotation manual. It is not intended as a comprehensive guide to the morphological annotation of Czech sentences (in contrast to the manual for syntactic-analytical annotations). The authors concentrate "only" on those cases which caused the most ambiguities and problems while annotating PDT 1.0. The ongoing effort is directed to the treating of not- yet-solved problematic cases in accord with the conventions of automatic morphological analyzer.
The morphological annotation of PDT 1.0 was carried out in the framework of experimental verification of the definition of formal representation of the analysis of Czech sentences (the project GAČR 405/96/0198, "Formal representation of language structures"). The material obtained in this way (data) is used in many domains of research in computational linguistics, above all as basic (training) data in projects of the automatic language analysis, the MŠMT research project MSM113000006, the "Laboratory for Language Data Processing" (the MŠMT project VS961510) and the Center for Computational Linguistics (the MŠMT project LN00A063). These data have been also used as verification material for various partial projects within the complex program GAČR 405/96/K214 ("Czech Language in Computer Age"). The "Center for Computational Linguistics" project financially supported work on these morphological annotation guidelines.
We are grateful to Petr Pajas – this document “as it is” would not appear without his XML and LaTex skills.
Typographical conventions.
| Vertical bar on the outer side of the page is used to highlight comments we make or suggestions we propose. |
| Gray is used to highlight something what should be checked. |
Sometimes, the writer uses the word incorrectly – e.g. a name of a woman as a name of a man, surname as a first name, etc. it is necessary to annotate the real usage not the should- be usage.
Maybe it should be somehow marked, if we encounter it. |
To get an idea what a foreign name, etc. mean it is useful to try to find using an internet portal, in an encyclopedia, on a map, etc. During annotation, we have found the following internet links useful:
Portals.
| http://www.seznam.cz – for Czech products, companies |
| http://search.seznam.cz/search.cgi?mod=f&hlp=y – for Czech companies |
| http://www.google.com |
| http://www.altavista.com (shop section for various searching products) |
Encyclopedias.
| http://www.britannica.com |
| http://www.encyclopedia.com |
| http://www.encarta.msn.com |
Dictionaries.
| http://dictionary.oed.com/entrance.dtl – Oxford English Dictionary |
| http://slovnik.seznam.cz – various dictionaries |
Maps.
| http://mapy.atlas.cz – Czechia |
| http://www.mapquest.com/maps – U.S.A and the world |
Table of Contents
Lemma in PDT 1.0 has two parts. First part, the lemma proper, has to be a unique identifier of the lexical item. Usually it is the base form (e.g. infinitive for a verb) of the word, possibly followed by a number distinguishing different lemmas with the same base forms. Second part (optional) is not part of the identifier and contains additional information about the lemma, e.g. semantic or derivational information.
Note: There is a convention that if lemmas use numbers to distinguish lexical items with the same base form, they all have to use them- i.e. instead of sets of lemmas {X, X-1, X- 2} or {X, X-2, X-3}, there should be a set {X-1, X-2, X-3}
Note: The lemmas having different semantic suffixes should have different numbers. In this manual we behave as the annotator. We try to mark such improper numbers by roman font (other part of the lemma is in italics). For example stop in akce Stop million will be marked as stop-1_;m and not stop-1_;m).
Table 2.1. Examples
| Whole lemma | Lemma proper | Second part |
|---|---|---|
| Chemik | chemik | |
| maso_^(jídlo_apod.) | maso | _^(jídlo_apod.) |
| Bonn_;G | Bonn | _;G |
| vazba-1_^(obviněného) | vazba-1 | _^(obviněného) |
| vazba-2_^(spojení) | vazba-2 | _^(spojení) |
| Martinův-1_;Y_^(*4-1) | Martinův-1 | _;Y_^(*4-1) |
The morphological component used in PDT 1.0, handles only inflection, not derivations – it means lemmas are rather shallow. However, sometimes the lemma contains information about lemmas it is derived. For example lemmas of possessive adjectives contain information about the noun they are derived from (otcův ← otec). The information is encoded in the following way – how many characters you have to remove from the end, and what string you have to add to get the deeper lemma. Only the proper lemmas are both input and output of this process.
Some lemmas (esp. names) contain suffixes expressing semantic information about their use, etc.:
| G – geographical name: Praha, Ústí nad Labem |
| Y – given (first) name, formely used as default: Petr, John |
| S – surname (last name): Dvořák, Zelený, Agassi, Bush |
| E – name of a nationality: Čech, Kolumbijec |
| R – name of a product: Tatra (the car), |
| K – name of a company: Tatra (the company) |
| m – default – names of mines, stadiums, guerilla bases, etc; also used for functional words in names. |
A positional tag is a string of 15 characters. Every position encodes one morphological category using one character (mostly upper case letters or numbers).
| Position | Name | Description |
|---|---|---|
| 1 | POS | Part of speech |
| 2 | SubPOS | Detailed part of speech |
| 3 | Gender | Gender |
| 4 | Number | Number |
| 5 | Case | Case |
| 6 | PossGender | Possessor's gender |
| 7 | PossNumber | Possessor's number |
| 8 | Person | Person |
| 9 | Tense | Tense |
| 10 | Grade | Degree of comparison |
| 11 | Negation | Negation |
| 12 | Voice | Voice |
| 13 | Reserve1 | Reserve |
| 14 | Reserve2 | Reserve |
| 15 | Var | Variant, style |
Some of the characters encode aggregation of more atomic values – for example: 'X' – means any value, 'Y' means masculine animate ('M') or inanimate ('I'). Dash ('-') means no value (e.g. tense for nouns).
Not all combinations of tag values are possible. There is about 4K tags[1].
Examples:
| hraniční: AAIS4----1A---- standard adjective, masc. inanimate, singular, accusative, positive |
| potok: NNIS4-----A---- noun, masc. inanimate, singular, accusative, positive |
| karikaturistou: NNMS7-----A---- noun, masc. animate, singular, instrumental, positive |
| ODS: NNFXX-----A---8 noun, feminine, any number, any case, positive, abbreviation |
| podle: RR--2---------- preposition (non vocalized), requiring genitive |
| volen: VsYS---XX-AP--- verb, passive participle, masculine, singular, any person, any tense, positive, passive |
| Value | Description |
|---|---|
| A | Adjective |
| C | Numeral |
| D | Adverb |
| I | Interjection |
| J | Conjunction |
| N | Noun |
| P | Pronoun |
| V | Verb |
| R | Preposition |
| T | Particle |
| X | Unknown, Not Determined, Unclassifiable |
| Z | Punctuation (also used for the Sentence Boundary token) |
Further subcategorizes POS. The POS value is uniquely specified by SubPOS value.
Table 2.2. SUBPOS
| Value | Description | POS |
|---|---|---|
| # | Sentence boundary | |
| * | Word krát (lit.: times) | C – numeral |
| , | Conjunction subordinate (incl. aby, kdyby in all forms) | J – conjuction |
| } | Numeral, written using Roman numerals (XIV) | C – numeral |
| : | Punctuation (except for the virtual sentence boundary word ###, which uses the Table 2.2 #) | |
| = | Number written using digits | C – numeral |
| ? | Numeral kolik (lit. how many/how much) | C – numeral |
| @ | Unrecognized word form | X – unknown |
| ^ | Conjunction (connecting main clauses, not subordinate) | J – conjunction |
| 4 | Relative/interrogative pronoun with adjectival declension of both types (soft and hard) (jaký, který, čí, ..., lit. what, which, whose, ...) | P – pronoun |
| 5 | The pronoun he in forms requested after any preposition (with prefix n-: něj, něho, ..., lit. him in various cases) | P – pronoun |
| 6 | Reflexive pronoun se in long forms (sebe, sobě, sebou, lit. myself / yourself / herself / himself in various cases; se is personless) | P – pronoun |
| 7 | Reflexive pronouns se (Table 2.4 = 4), si (Table 2.4 = 3), plus the same two forms with contracted -s: ses, sis (distinguished by Table 2.5 = 2; also number is singular only) This should be done somehow more consistently, virtually any word can have this contracted -s (cos, polívkus, ...) |
P – pronoun |
| 8 | Possessive reflexive pronoun svůj (lit. my/your/her/his when the possessor is the subject of the sentence) | P – pronoun |
| 9 | Relative pronoun jenž, již, ... after a preposition (n-: něhož, niž, ..., lit. who) | P – pronoun |
| A | Adjective, general | A – adjective |
| B | Verb, present or future form | V – verb |
| C | Adjective, nominal (short, participial) form rád, schopen, ... | A – adjective |
| D | Pronoun, demonstrative (ten, onen, ..., lit. this, that, that ... over there, ... ) | P – pronoun |
| E | Relative pronoun což (corresponding to English which in subordinate clauses referring to a part of the preceding text) | P – pronoun |
| F | Preposition, part of; never appears isolated, always in a phrase (nehledě (na), vzhledem (k), ..., lit. regardless, because of) | R – preposition |
| G | Adjective derived from present transgressive form of a verb | A – adjective |
| H | Personal pronoun, clitical (short) form (mě, mi, ti, mu, ...); these forms are used in the second position in a clause (lit. me, you, her, him), even though some of them (mě) might be regularly used anywhere as well | P – pronoun |
| I | Interjections | I – interjection |
| J | Relative pronoun jenž, již, ... not after a preposition (lit. who, whom) | P – pronoun |
| K | Relative/interrogative pronoun kdo (lit. who), incl. forms with affixes -ž and -s (affixes are distinguished by the category Table 2.8 (for -ž) and Table 2.5 (for -s)) | P – pronoun |
| L | Pronoun, indefinite všechnen, sám (lit. all, alone) | P – pronoun |
| M | Adjective derived from verbal past transgressive form | A – adjective |
| N | Noun (general) | N – noun |
| O | Pronoun svůj, nesvůj, tentam alone (lit. own self, not-in-mood, gone) | P – pronoun |
| P | Personal pronoun já, ty, on (lit. I, you, he ) (incl. forms with the enclitic -s, e.g. tys, lit. you're); gender position is used for third person to distinguish on/ona/ono (lit. he/she/it), and number for all three persons | P – pronoun |
| Q | Pronoun relative/interrogative co, copak, cožpak (lit. what, isn't-it-true-that) | P – pronoun |
| R | Preposition (general, without vocalization) | R – preposition |
| S | Pronoun possessive můj, tvůj, jeho (lit. my, your, his); gender position used for third person to distinguish jeho, její, jeho (lit. his, her, its), and number for all three pronouns | P – pronoun |
| T | Particle | T – particle |
| U | Adjective possessive (with the masculine ending -ův as well as feminine -in) | A – adjective |
| V | Preposition (with vocalization -e or -u): (ve, pode, ku, ..., lit. in, under, to) | R – preposition |
| W | Pronoun negative (nic, nikdo, nijaký, žádný, ..., lit. nothing, nobody, not-worth-mentioning, no/none) | P – pronoun |
| X | (temporary) Word form recognized, but tag is missing in dictionary due to delays in (asynchronous) dictionary creation | |
| Y | Pronoun relative/interrogative co as an enclitic (after a preposition) (oč, nač, zač, lit. about what, on/onto what, after/for what) | P – pronoun |
| Z | Pronoun indefinite (nějaký, některý, číkoli, cosi, ..., lit. some, some, anybody's, something) | P – pronoun |
| a | Numeral, indefinite (mnoho, málo, tolik, několik, kdovíkolik, ..., lit. much/many, little/few, that much/many, some (number of), who-knows-how-much/many) | C – numeral |
| b | Adverb (without a possibility to form negation and degrees of comparison, e.g. pozadu, naplocho, ..., lit. behind, flatly); i.e. both the Table 2.7 as well as the Table 2.6 attributes in the same tag are marked by – (Not applicable) | D – adverb |
| c | Conditional (of the verb být (lit. to be) only) (by, bych, bys, bychom, byste, lit. would) | V – verb |
| d | Numeral, generic with adjectival declension (dvojí, desaterý, ..., lit. two-kinds/..., ten-...) | C – numeral |
| e | Verb, transgressive present (endings -e/-ě, -íc, -íce) | V – verb |
| f | Verb, infinitive | V – verb |
| g | Adverb, forming negation (Table 2.7 set to A/N) and degrees of comparison Table 2.6 set to 1/2/3 (comparative/superlative), e.g. velký, za\-jí\-ma\-vý, ..., lit. big, interesting | |
| h | Numeral, generic; only jedny and nejedny (lit. one-kind/sort-of, not-only-one-kind/sort-of) | C – numeral |
| i | Verb, imperative form | V – verb |
| j | Numeral, generic greater than or equal to 4 used as a syntactic noun (čtvero, desatero, ..., lit. four-kinds/sorts-of, ten-...) | C – numeral |
| k | Numeral, generic greater than or equal to 4 used as a syntactic adjective, short form (čtvery, ..., lit. four-kinds/sorts-of) | C – numeral |
| l | Numeral, cardinal jeden, dva, tři, čtyři, půl, ... (lit. one, two, three, four); also sto and tisíc (lit. hundred, thousand) if noun declension is not used | C – numeral |
| m | Verb, past transgressive; also archaic present transgressive of perfective verbs (ex.: udělav, lit. (he-)having-done; arch. also udělaje (Table 2.8 = 4), lit. (he-)having-done) | V – verb |
| n | Numeral, cardinal greater than or equal to 5 | C – numeral |
| o | Numeral, multiplicative indefinite (-krát, lit. (times): mnohokrát, tolikrát, ..., lit. many times, that many times) | C – numeral |
| p | Verb, past participle, active (including forms with the enclitic – s, lit. 're (are)) | V – verb |
| q | Verb, past participle, active, with the enclitic -ť, lit. (perhaps) – could-you-imagine-that? or but-because- (both archaic) | V – verb |
| r | Numeral, ordinal (adjective declension without degrees of comparison) | C – numeral |
| s | Verb, past participle, passive (including forms with the enclitic -s, lit. 're (are)) | V – verb |
| t | Verb, present or future tense, with the enclitic -ť, lit. (perhaps) -could-you-imagine-that? or but-because- (both archaic) | V – verb |
| u | Numeral, interrogative kolikrát, lit. how many times? | C – numeral |
| v | Numeral, multiplicative, definite (-krát, lit. times: pětkrát, ..., lit. five times) | C – numeral |
| w | Numeral, indefinite, adjectival declension (nejeden, tolikátý, ..., lit. not-only-one, so-many-times-repeated) | C – numeral |
| y | Numeral, fraction ending at -ina; used as a noun (pětina, lit. one-fifth) | C – numeral |
| z | Numeral, interrogative kolikátý, lit. what (at-what-position- place-in-a-sequence) | C – numeral |
| Value | Description |
|---|---|
| F | Feminine |
| H | {F, N} – Feminine or Neuter |
| I | Masculine inanimate |
| M | Masculine animate |
| N | Neuter |
| Q | Feminine (with singular only) or Neuter (with plural only); used only with participles and nominal forms of adjectives |
| T | Masculine inanimate or Feminine (plural only); used only with participles and nominal forms of adjectives |
| X | Any |
| Y | {M, I} – Masculine (either animate or inanimate) |
| Z | {M, I, N} – Not fenimine (i.e., Masculine animate/inanimate or Neuter); only for (some) pronoun forms and certain numerals |
| Value | Description |
|---|---|
| D | Dual , e.g. nohama |
| P | Plural, e.g. nohami |
| S | Singular, e.g. noha |
| W | Singular for feminine gender, plural with neuter; can only appear in participle or nominal adjective form with gender value Q |
| X | Any |
| Value | Description |
|---|---|
| F | Feminine, e.g. matčin, její |
| M | Masculine animate (adjectives only), e.g. otců |
| X | Any |
| Z | {M, I, N} – Not feminine, e.g. jeho |
Table 2.8. VAR
| Value | Description |
|---|---|
| - | Basic variant, standard contemporary style; also used for standard forms allowed for use in writing by the Czech Standard Orthography Rules despite being marked there as colloquial |
| 1 | Variant, second most used ( less frequent), still standard |
| 2 | Variant, rarely used, bookish, or archaic |
| 3 | Very archaic, also archaic + colloquial |
| 4 | Very archaic or bookish, but standard at the time |
| 5 | Colloquial, but (almost) tolerated even in public |
| 6 | Colloquial (standard in spoken Czech) |
| 7 | Colloquial (standard in spoken Czech), less frequent variant |
| 8 | Abbreviations |
| 9 | Special uses, e.g. personal pronouns after prepositions etc. |
For most (but not all cases) just omit the dashes from positional tags. For more information, see http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/compact_tags.pdf
In certain cases (including some places in this manual), the following tag abbreviations are used. Most of them are self-evident (dashes and rarely used fields dropped), as you can see in the following list:
| Ngnc – noun; NFS1 = NNFS1-----A---- |
| Aagnc – adjective; AAXXX = AAXXX----1A---- |
| Db – adverb; Db = Db------------- |
| Dg – adverb; Dg = Dg-------1A---- |
| Dgd – adverb; Dga2 = Dg-------2A---- |
| J^ – conjunction; J^ = J^------------- |
| J, – conjunction; J, = J,------------- |
| Rc, RRc – preposition, RR7 = RR--7---------- |
| RVc – vocalized preposition, RV7 = RV--7---------- |
| TT – particle; TT = TT------------- |
| Ng-8, NNgXX-8 – noun abreviation; NFXX-8 = NNFXX-----A---8 |
| AX-8, AAXXX-8 – adjective abreviation; AAXXX-8 = AAXXX----1A---8 |
| Db-8 – adverb abreviation; Db-8 = Db------------8 |
| Rc-8, RRc-8 – preposition abreviation; RR7-8 = RR--7---------8 |
[1] See also: http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/docc0pos.pdf, for quick reference: http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr.html.
Table of Contents
Proper names (either directly or the lemmas they consist of) have suffixes marking the category of that name:
| G – geographical name: Praha, Ústí nad Labem |
| Y – given (first) name, formely used as default: Petr, John |
| S – surname (last name): Dvořák, Zelený, Agassi, Bush |
| E – name of a nationality: Čech, Kolumbijec |
| R – name of a product: Tatra (the car), |
| K – name of a company: Tatra (the company) |
| m – default – names of mines, stadiums, guerilla bases, etc; also used for functional words in names. |
The lemma should start with upper case if the word is always in upper-case in names (Tatra is always in uppercase, but banka not).
Keeping this categorization in the same level as lemmas is quite unsustainable and very unsuitable. |
|
|
Some names are sometimes declined, sometimes not (Bill – o Bill Clintonovi, o Billu Clintonovi, o Billovi). The tag for nondeclined form is NgXXA.
For names (e.g Ludwig van Beethoven) the van, etc. phrase is perceived as a surname – annotate it that way. For other it is still perceived as geographical name (e.g Kryštof Harant z Polžic a Bezdružic). Of course the borderline is fuzzy.
Usage. The surname precedes the given name. In most cases, the whole name is used (not just the family name). The thing is complicated by the fact, that many Chinese living abroad often change the order of their name or use their given name as a surname, etc. The discussion below can help you to determine, which part of a name is the given name and which part is the surname. If you are in doubt annotate them all as given names (Y).
That was the original recommendation, but probably annotating them as S would be better, because they are often used that way (You can say Clinton for Bill Clinton, but you cannot say Po for Po Li). |
Surnames. There are relatively few surnames in China (200 most common surnames account for >96% of all surnames). Most of them consist of one syllable (Wang, Li, Chen, etc.) Only few surnames consist of two syllables (Ou-yang, Mo-qi, Si-ma, Pu-yang). Married women do not get their husband's surname.
Given names. Mostly two syllables, often connected with a dash (however sometimes separated by a space). Some can be widely used, some can be unique. Often it is impossible to say whether it is a name of a male or a female. The second syllable is usually used in informal addressing. The first syllable can be shared by all siblings. In traditional China a person had several given names during his/her life.
Most common Chinese surnames (in Pinyin): Cai, Ceng-Zeng, Chen, Chen-Shen, Deng, Gao, Guo, He, Hu, Huang, Li, Liang, Lin, Lü, Ma, She, Sun, Tang, Wang, Wu, Xie, Xu, Yang, Ye, Zhang, Zhao, Zheng, Zhu
Links.
| http://www.wlu.edu/~hhill/names.html – Chinese names explained |
| http://www.geocities.com/Tokyo/3919/atoz.html – Alphabetical Index of Chinese Surnames (incl. Pinyin, Anglicized and other versions) |
Korean names behave similarly as Chinese names. Surname precedes given name. Given name of most Koreans consists of two parts, in Latin alphabet often connected with a dash. Most common Korean surnames are (45% of the population): Kim, Lee (often spelled as Rhee, Yi or Li), Park.
Sometimes you can encounter names that are Czech in their origin, but are somehow altered to fit other languages (diacritics is omitted, female and male surnames are the same – e.g. Judy Sedivy).
Use the following guidelines to decide the lemma and tag for such a name:
a name that does not distinguish female and male variant, should have just one lemma and three different tags (gender M, F, X[2])
| Peter Janda – Janda_;S + NNMXX-----A---- or NNMS1-----A---- |
| Jane Janda – Janda_;S + NNFXX-----A---- |
| Jane a Peter Janda – Janda_;S + NNXXX-----A---- |
a name that has the same spelling as in Czech, should use the Czech lemma Jane Janda – Janda_;S + NNFXX-----A----
a name with altered spelling has its own lemma (with ,t suffix) Judy Sedivy – Sedivy_;S_,t + NNFXX-----A----
All lemmas of autosemantic words in compound names must have the category determined by the whole name (e.g. K, R). The lemmas of functional words contain default type category (m).
The problem is that a name of one type can occur as part of a name of a different type:
| New England – G |
| New England Association of Chemistry Teachers – K |
| New England Association of Chemistry Teachers Journal – R |
England is G noun in the first, K adjective in the second and R adj. in the third name.
If the lemma of the category you need does not exist and you have to insert a new one, do not care about numbering of lemmas, somebody else will do it (it would impossible to ensure that the numbers were unique across all annotators). That means, if there is other lemma having just different category (e.g. there is England_;G available, but you need England_;R), just change the category label.
|
Using the above-proposed separation[3] of morphology and name categorization, the New England example would be annotated quite easily (only England is marked by a category (G) by the morphological analyzer, the rest is done by some other kind of tool):
If the annotator did not recognize the components of the name (e.g. it is in Burmese), (s)he would annotate just the highest level. |
The categorization is sometimes quite tricky – you do not know, whether to consider a phrase a name or a name plus normal word:
| Nobelova nadace – Nobelův_;K nadace_;K[4] |
| Nobelův stůl (e.g. in a museum) – Nobelův_;S stůl |
| Nobelova cena – hard to say (m vs. normal), decided: Nobelův_;S cena. |
Examples:
| Brownův pohyb – Brownův_;S |
| Cena J. Debrau – Debrau_;S cena |
| Mérieuxův ústav – Mérieuxův_;K ústav (Should be ústav_;K but is not) |
| Divadlo J. Grossmana – divadlo_;K J-4_:B_;K Grossman_;K |
| příloha Kolumbus (in Lidové noviny) – Kolumbus_;m |
| v Dobrovského ulici nejezdí ... – Dobrovský_;G |
| v Dobrovského nejezdí ... – Dobrovský_;G |
| poliklinika Dobrovského (unofficial, it is located in D. Street) – Dobrovský_;G |
Using the separation of morphology and name categorization, this is quite easy:
| Nobelova nadace – (Nobelův_;S nadace)K |
| Nobelův stůl (e.g. in a museum) – Nobelův_;S stůl |
| Nobelova cena – easy to say: (Nobelův_;S cena)m. |
Examples:
| Brownův pohyb – Brownův_;S pohyb |
| Cena J. Debrau – (Debrau_;S cena)m |
| Mérieuxův ústav – (Mérieuxův_;S ústav)K |
| Divadlo J. Grossmana – (divadlo J-0_:B_;Y Grossman_;S)K |
| příloha Kolumbus (in Lidové noviny) – (příloha Columbus_;S)m |
| Dobrovského ulice – (Dobrovský_;S ulice)G |
| v Dobrovského – (Dobrovský_;S)G |
| poliklinika Dobrovského(unofficial, it is located in D. street) – (poliklinika (Dobrovský_;S)G)K |
Horses have all kind of names (e.g. Vinná réva, Deprivace, He Shall Reign, La Paloma Monitor, Frýdlant, Gold End, Lučina, Green Peace, Areál, First, Bounty), and quite often you do not know if it is female or male (sometimes even female like names belong to a male horse). One clue is, that in an Oak (a horse contest type), all horses are young mares – females.
In PDT 1.0 the names of horses where mostly not annotated correctly – simply any available name was selected (Otherwise, a new lemma with category Y would have to be inserted in each case: e.g. Deprivace would be Deprivace_;Y, annotated as deprivace, He Shall Reign annotated as normal English phrase: he_,t, shall_,t reign_,t).
In our opinion, if the Y category were independent of the lemma, the horse name should be annotated correctly. |
Similar problem is with the names of musical groups and DJ's. For famous groups and DJ's enter separate lemmas, for others use normal available lemmas.
Name of the town in the club name: if only the town is noted, it is annotated as a geographic name (G), if the whole name of the club is noted, it is annotated as an institution (K). It is analogous to countries. (Česko vs. Německo are annotated as G)
Examples:
| Cheb vs. Plzeň – Cheb_;G Plzeň_;G |
| SKP Union Cheb vs. Plzeň – SKP_:B_;K Union_;K[5] Cheb_;K Plzeň_;G |
Of course, it can be a problem to know it with foreign clubs. If you do not know, annotate it as an institution (K).
Examples:
| Chelsea – part of London, UK |
| Chelsea – Chelsea_;G |
| Chelsea FC – Chelsea_;K FC-1_:B_;K_;w_^(...) |
| Ferencvaros – part of Budapest, Hungary |
| Ferencvaros – Ferencvaros_;G |
| Ferencvaros TC – Ferencvaros_;K TC-6_:B_;K |
| Sparta – Sparta-2_;K |
| Sparta Praha – Sparta-2_;K Praha_;K |
| Viktorie Žižkov – Viktoria-2_;K_^(jméno_sport.klubu) Žižkov_;K |
| Udinese – Udinese_;K_,t + NNNXX-----A---- |
It is the adjective of Udine (town in NE Italy), the official name of the football club is Udinese Calcio (calcio = football). However in Czech, the name is perceived as a noun and as the name of that club, therefore it is probably better to use it in that way:
To determine, whether something is a name of a town or a club, you can try to find that name on a map (eg. http://www.expedia.com/pub/agent.dll?qscr=mmfn) and also find the club (e.g. http://www.soccerage.com).
|
Using the above-proposed[6] separation of morphology and name categorization, this looks much more consistent:
|
The name of the sport club often contains some abbreviation. Some are common and present in the analyzer's lexicon (e.g. FC, AC) some are quite unusual (e.g. EV, ERC, EC, ERC, EG, VS, AS). If they are not present in the lexicon, entering them, suffixing the lemma by _:B_;K_;w and using NNNXX-----A---8 as tag,
Insisting on inclusion of name categories (K, R, etc.), implies explosion of number of lemmas. We follow each examples section by analogous examples using the above- proposed separation of morphology and name categorization (see Section 3.2).
Streets. We suppose that the word ulice, etc. is always present, even if elided on the surface.
Examples:
| Dlouhá – dlouhý_;G+ AAFS1----1A---- |
| Dlouhá ulice – dlouhý_;G+ AAFS1----1A---- ulice + NNFS1-----A---- |
| Palackého, Dobrovského, etc. – Palacký_;G, Dobrovský_;G+ NNMS2-----A---- |
Towns. Words in one-word names consisting that were originally adjectives are annotated as nouns.
Examples:
| Hluboká – Hluboká_;G + NFS1 |
| Dobrá Voda – dobrá_;G + AFS1 Voda_;G^(součást_názvu_Odolena_Voda) + NFS1 |
| Ohrada u Hluboké – Ohrada_;G + NFS1 u_;m + RR2 Hluboká_;G + NFS2 |
Examples:
|
A separate character for aggregate gender {M,F} would be good (for initials following a letter in newspaper, an initial before a foreign last name, foreign names, etc.).
This category contains for example companies, foundations, shops, clubs, sport clubs, restaurants, etc. All autosemantic words in names of restaurants have lemmas with K. The exceptions are functional words that are annotated as default type (m)
Restaurants.
Examples:
| Bar Viola – bar-2_;K, Viola-2_;K |
| U Medvídků – u-2_;m, medvídek-2_;K |
| La cambusa – Le-1_;m_,t_^(franc._člen_jako_souč._jmen_a_názvů)[8], cambusa_;K_,t |
| Restaurant HaPi – restaurant-2_;K HaPi_;K |
| Čínská restaurace Jin Jiang – čínský-2_;K, restaurace-2_;K, jin-2_;K, jiang-2_;K_,t |
| restaurace Jin Jiang – restaurace-1, jin-2_;K, jiang-2_;K_,t |
| Francouzská restaurace v Obecním domě – francouzský-2_;K, restaurace-2_;K, v-2_;m obecní-2_;K dům-2_;K |
| Hospůdka U vylitýho mrože – hospůdka-2_;K u-2_;m vylitý-2_;K mrož-2_;K |
All events should receive special lemmas with m. However, if it is registered as a company and used in that meaning, then it should be K. If not certain use m.
Examples:[9]
| Paris Indoor – Paris-2_;m_,t Indoor_;m_,t + NNNXX-----A---- |
| US Open – US-3_:B_;m_,t + AAXXX----1A---8 Open-1_,t_;m AAXXX----1A----[10] |
| akce Stop milión – stop-1_;m milión`1000000_;mm |
|
Generally televisions are annotated as institutions (K). Only, if a company runs several channels, then the channels are annotated as products (R); but it is currently used only with Czech(oslovak) public television (ČT1, ČT2 and F1).
All autosemantic word in names of news or magazines have lemmas with R. Currently, some of the newspapers are in the lexicon as institutions (e.g. Sme), this is not correct. Foreign names are often used as in plural, even if in the original there are in singular.
Names of songs, TV programs etc. are annotated as normal words. The only reason is practical – it would cause explosion of the lexicon. If the categories and morphology are separated (see beginning of Chapter 3), these items can be annotated as R or m.
[2] If {M,F} gender is introduced, the tag NN{M,F}XX-----A---- should be used.
[4] The lemmas have different numbers (e.g. Nobelův-1_;S, Nobelův-2_;K).
[5] In PDT 1.0, the lemma is Union, but it should Union_;K
[7] Frequent names of towns and names when POS changes, have separate entries. Therefore not (hluboká)G
[8] In the current morphological lexicon, the m is missing.
[9] Many of these entries are not in the lexicon, therefore the actual numbers can be different once it is there. See note in Section 2.1, e.g. mistrovství: mistrovství-1, mistrovství-2_;m, mistrovství-3_;R, etc.
[10] We think, it is perceived as noun, probably inanimate, in Czech.
Table of Contents
For discussion about inserting abbreviation not present in the morphological lexicon, see Chapter 10
Abbreviations can be used with different genders (e.g ODS – feminine (strana) or neuter). Any abbreviation can have neuter gender. If the gender cannot be disambiguated by the context, use the gender used elsewhere in article. If the author mixes genders or there are no disambiguating contexts, use the gender inherent gender of the abbreviation. In Czech, is usually easy to determine – it is the gender of the head of unabbreviated equivalent (e.g. ODS – strana → f). With foreign abbreviations it is much more problematic, different people use different genders (e.g. because of different translation). If you are not certain which of the gender is most widely used, use the default neutrum.
Normal abbreviations have sometimes as a lemma the abbreviation (and sometimes the original unabbreviated word. Usually the former method is used for abbreviation that are more common then the unabbreviated word (and for abbreviation of multi word expressions). But it is not always true.
For discussion about determining the gender of an abbreviation, see Section 4.1
Examples:
| např.: například_:B + Db------------8 |
| P.S.: |
| post-2_:B_,t_^(lat.,_po,_např._P.S.) + RR--X---------8 |
| scriptum_:B_,t_^(př._P.S.) + NNNXX-----A---8 |
| n.L.: nad-1_:B[11]+ RR--7---------8, Labe_:B_;G + NNNS7-----A---8 |
| r. 1998: rok_:B + NNIXX-----A---8 |
| r.: režie_:B + NNFXX-----A---8 |
| rež.: režie_:B+ NNFXX-----A---8 |
Note: The following is still not official.
Isolated letters (e.g. A-konto) are handled as abbreviations. The only exception is if they are not in the name (zápas skupiny B). Many of the annotations suggested bellow are still not offered by the morphological analyzer. Moreover, sometimes the morphological analyzer is constrained to offer appropriate lemma and tag only if the letter is followed by a dot. Should be repaired.
You have to select (or insert) the lemma according to the semantic category:
| K-0_:B_;Y – first (and most middle) names |
| K-4_:B_;K – names of institutions |
| K-5_:B_;G – geographical names |
| K-6_:B_;R – names of products |
| K-7_:B_;m – other names (sporting events, etc) |
| K-9_:B_;S – last (and some middle) names |
| k-8_:B_^(ost._zkratka) – other abbreviations (not names) |
| k-3_^(označení_pomocí_písmene) – other letters (not abbreviations, not in names) |
Frequent abbreviations have their own lemmas, for example V – V-1`volt_:B or k: ABC k.s. – komanditní_:B_^(jen_komanditní_společnost).
Tag selection (or insertion):
noun: gender is known: NNgXX-----A---8 (g ∊ {MFIN})
noun: gender is unknown: NNXXX-----A---8
adjective: AAXXX----1A---8 or AAgXX----1A---8
others: X@------------1 (variant of X@------------- for one letter words)
Examples:
| A: A-mužstvo – a-3_^(označení_pomocí_písmene) + AAXXX----1A---- |
| d: odst. 1 písm. d) – d-3_^(označení_pomocí_písmene) + NNNXX-----A---- |
| A: 16 A – A-1`ampér_:B + NNIXX-----A---8 |
| A: A konto (or A-konto) – A-6_:B_;R + AAXXX----1A---- |
| a: ABC a.s. – akciový_:B_^(jen_akciová_společnost) + AAXXX----1A---8 |
| s: na s. 128 – strana-4_:B_^(v_knize,_rukopise,...) + NNFXX-----A---8 |
|
An abbreviation preceding a noun is an adjective, an abbreviation following a noun is a noun. We would suggest to annotate them all as nouns (see Section 6.1.1). Does it mean that HIV in HIV virus and virus HIV have different POS.
Units called after some males person (V – volt, A – ampér, etc.), have inanimate gender. However, units using degrees (°C, °F) have masculine animate gender, because the word stupeň is always present (even if omitted in the written text). Absolute temperature uses as the unit called Kelvin (K) not degree of Kelvin. Therefore the unit has inanimate masculine gender. However, if the author uses it errorneously as degree, the tag as to be masculine animate.
Examples:
| C: Ráno byly 3 °C. – Celsius_:B – NNMXX-----A---8° |
| C: Ráno byly 3 C. (read as Ráno byly tři stupně Celsia) – Celsius_:B – NNMXX-----A---8 |
| K: Teplota 5000 K. – Celsius_:B – NNMXX-----A---8 |
| K: Teplota 5000 °K.- Celsius_:B – NNMXX-----A---8° |
If the C character is preceded by some character trying to look as the degree symbol ° (eg. -C, o C, O C), then you should mark it as an error – as lemma insert the degree[12] symbol ° and as tag X@------------1. It should be converted into a punctuation mark.
The author's name abbreviations used in newspapers (e.g. Ber, mas, jst, ... ) have lemma as the form + -99_:B_;S and tag NNXXX-----A---8. There is X for gender because usually we do not know it. If the {M,F} gender is introduced, it should be used here. These abbreviations are not present in the lexicon, therefore you have to insert them.
Titles distinguish genders – there has to be one lemma for men, and one lemma for women (JUDr-1_:B_^(doktor_práv) vs. JUDr-2_:B_^(doktorka_práv)); to keep it consistent the masculine has number 1, the feminine has number 2. We think, the titles should have the same form for women and men. Just the tag should be different, with possibility to have X if the gender is not known (e.g. a letter subscribed as Dr. A. B.)
Table of Contents
If an official alternative to the colloquial form exist, then the the colloquial form has the same tag except a different variant ('5', '6', '7', ev. '3' – see Section 2.2.1.13).
Examples:
| které: stavení, které – P4NP4---------5 |
| Novákovic: Novákovic pes – Novákův_;S_^(*2) -AUXXXM--------6[13] |
| takovejhlema: takovýhle – AAFP7----1A---6 |
| hovadinama: hovadina – NNFP7-----A---6 |
| naší: pro naší atletiku (officially short: naši) – můj_^(přivlast.) – PSFS4-P1------6 |
We tagged these words as if they were without -s and added -9 at the end.
In our opinion it would be better to divide such an expression in two words (e.g. cos → co + být, analogous to abych → aby + být) and tag them like two normal words, just with some variant recognize it. |
Should not be treated as misspelling, but annotated as (colloquial) variant of official -á forms (variant '5').
Table of Contents
General rule
For a longer phrase (or citations) in a foreign language, use morphology of that language (but distinguish genders M and I ??) (Hence citation use).
For a single word or shorter phrase use Czech morphology. (Hence word use) The borderline is fuzzy, of course.
Many foreign words used in Czech sentence can have different part of speech than in their original language. Usually the hint is how it behaves in different context, if it is declined as a noun, if agrees with its head, etc.
All foreign lemmas have _,t suffix.
It would be good to somehow distinguish foreign words in word or citation use.
All nouns in attributive use are annotated as adjectives.
|
That's quite problematic:
We think, it should be annotated as two nouns. |
| V kostele XY zpívala Musica Bohemica. |
| Bohemica annotated as a noun; in Latin it is an adjective. |
| Reason: When the phrase is declined, Bohemica is declined as a noun (žena): pozvali Musicu Bohemicu, *pozvali Musicu Bohemicou |
| Annotation: Musica_,t_;K + NFS1A, Bohemica_,t_;K NFS1A |
| To je trochu ad hoc. |
| hoc is annotated as a noun; in Latin it is an adverb. |
| Annotation: ad_,t RRX, hoc_,t NXXXA |
In the following, the section headers refer to the categories of the foreign language.
English an should be a form of a.
Articles merged with a preposition (fra du, ita della, deu im, aufs, zur) are treated as prepositions (?Split into two words?)
Arabic short words (##)(?articles, ?prepositions) are treated as articles.
Same as single words
Should distinguish gender, number, and/or case Therefore: TTgnc or AAgnc ??
| Tag: TT------------- |
| Lemma: Usually the same as the form |
Originally, we wanted to treat articles as adjectives. Forms having different gender, number and/or case, would have the same lemma (der for forms der, die, das, des, dem, den). The problem is that Czech does not respect the original categories (nebezpečný La Manche – la i