|
  |
1. Overview and some basics
1.1. The basic parts of ToBI
A ToBI transcription for an utterance consists minimally of a
recording of the speech, an associated electronic or paper record of
the fundamental frequency contour, and (the transcription proper)
symbolic labels for events arranged in four parallel tiers. (Other
tiers can be added for the needs of particular sites -- see Section
4.) The four tiers of labels, arranged in the order that they appear
in the default labels window for the examples and exercises programs,
are:
(1) a tone tier
(2) an orthographic tier
(3) a break-index tier
(4) a miscellaneous tier
The tone and break-index tiers represent the core prosodic analysis.
The tone tier is the part of the transcription that corresponds most
closely to a phonological analysis of the utterance's intonation
pattern. It consists of labels for distinctive pitch events,
transcribed as a sequence of high (H) and low (L) tones marked with
diacritics indicating their intonational function as parts of pitch
accents or as phrase tones marking the edges of two types of
intonationally marked prosodic units. The inventory of pitch events
and their definitions are based on autosegmental analyses, in
particular the analysis of Pierrehumbert and her colleagues (see
Pierrehumbert & Hirschberg, 1990, and the references cited in it) with
some modifications toward such alternative analyses as that of Ladd
(1983). In example utterance <<jam1>>, there is a production of the
question "Will you have marmalade, or jam?" with two pitch accents
(the L* tones), two phrase accents (the H- tones), and a H% boundary
tone.
EXAMPLE <<jam1>>: Will you have marmalade, or jam?
L* H- L* H-H%
The break-index tier marks the prosodic grouping of the words in an
utterance by labelling the end of each word for the subjective strength
of its association with the next word, on a scale from 0 (for the
strongest perceived conjoining) to 4 (for the most disjoint). These
categories of association strength, or `break indices' are based on
work by Mari Ostendorf, Patti Price, Stefanie Shattuck-Hufnagel, and
their associates (see, e.g., Price et al., 1991). We equate the two
highest break indices with prosodic groupings that are marked
intonationally. For example, break index 3 after the word "marmalade"
in utterance <<jam1>> corresponds to the end of the intermediate
phrase indicated by the H- phrase accent.
EXAMPLE <<jam1>>: Will you have marmalade, or jam?
1 1 1 3 1 4
The orthographic tier is arguably not part of any core prosodic
analysis, except inasmuch as the labels on this tier can be used to
interface the transcription to dictionary entries which do indicate
such things as which syllable is likely to be most stressed in each
word, prosodic information which is not otherwise included in the ToBI
system. The orthographic tier is a straightforward transcription of
all of the words in the utterance, in ordinary English orthography.
When using waves(tm) and a transcriber script, or any similar computer
labelling system, the convention is to align each orthographic label to
the end of the word.
The miscellaneous tier, like the orthographic tier, can include many
events that are arguably not part of prosody per se. However, many
events that are typically marked on this tier are important for
interpreting the analyses on the tone tier and break-index tier,
because they disrupt the smooth rhythm of the utterance or interrupt
the intonation contour. This tier is essentially a `comment' tier
that can be used to mark events such as the cough in example utterance
<<cough>>. Except for very few exceptions (most notably, the
label `disfl' often stands alone to flag the occurrence of a perceived
disfluency of some type), labels on this tier come in pairs, to mark
the beginning and end of each event interval. If it were not for the
disruption of the cough labelled on the miscellaneous tier here, the
tone transcription would have to be parsed as either unfinished or
ill-formed.
EXAMPLE <<cough>>: Will you have marmalade ...
L* L*
1 1 1 1p
cough< cough>
1.2. Guiding principles
As should be obvious from the preceding examples, ToBI does not try to
transcribe all aspects of prosody, or even all aspects that are
amenable to symbolic transcription. In deciding what to include and
what to leave out, we were guided by three principles. First, we
wanted to be able to distinguish in our transcription all of the
categorically distinct intonation patterns and prosodic units of the
language (or rather of the three intonationally similar dialects that
we claim to cover -- see Section 0.4 above). Second, we felt we
should not transcribe aspects of prosody which are more amenable to
quantitative measures than to the categorical divisions of a symbolic
transcription. Finally, we did not want to squander the user's
energies in transcribing even categorical aspects of prosody which are
predictable from other parts of the transcription or from auxiliary
tools such as dictionaries.
The categorical aspects of prosody which we try to capture completely
(by the first principle) are of two types. The first is the prosodic
structure -- the rhythm of more and less stressed words alternating
with each other, and the grouping of words into prosodic constituents
of various sizes -- and the second is the intonation pattern -- the
sequence of contrastive pitch events that we call pitch accents,
phrase accents, and boundary tones.
An example of the noncategorical aspects of prosody which we leave out
(in accordance with the second principle) is the local tempo of each
word in the utterance, which we feel could be more accurately and
directly captured by some quantitative measure such as normalized
segment duration (e.g., Campbell, 1992) than by any symbolic
transcription such as an arbitrary division into, say, categories `1',
`2', and `3' (for `slow', `medium', and `fast' tempi). An exception
to this principle is the marking for each phrase of the point of
highest fundamental frequency associated with an accent (HiF0), which
we use as a measure of pitch range in order to facilitate research on
the relationship between pitch range and discourse structure (see,
e.g., Grosz & Hirschberg, 1992, and references therein). We
anticipate being able to do away with this marking when we have
developed automatic tools for detecting accent-related peaks directly
from the fundamental frequency contour in conjunction with the tone
tier transcription.
A categorical aspect of prosody which we leave out (in accordance with
the third principle) because it should be fairly predictable is the
marking of the stressed and unstressed syllables within each word. By
this level of stress we mean the word-internal alternation between
more and less stressed syllables where the relative prominence of any
pair of syllables is fairly fixed and can be thought of as inherent to
the word's dictionary entry. For example, if the first and third
syllables in the word "marmalade" are not pronounced with more
prominence than the second, native speakers will judge the vowels in
these two syllable to be mispronounced. (That is the first and third
syllables should not have reduced vowels, whereas the second one
should.) Since such word-internal rhythms are thus a fixed part of
the word's pronunciation, we leave this specification out. That is,
for example, in the transcription of utterances <<jam1>> and
<<cough>>, we have not marked the first and third syllables as
relatively more stressed than the second syllable, since this aspect
of the prosodic structure would be marked in any dictionary entry for
the word, so that users of ToBI-transcribed databases could interface
the orthographic tier with an online dictionary to fill in this
information.
1.3. The marking of stress -- Pitch accents and prominence
If the stress patterns within words are largely predictable from the
dictionary entries for the word, what about other levels of stress?
It has been recognized for some time now (e.g. Bolinger, 1972) that
other aspects of the stress pattern cannot be predicted from the
grammar with anything like the confidence with which we can predict
the more stressed syllables within a word. Indeed the factors
predicting the prominence of a word relative to other words in the
same sentence is a matter of much current debate (see e.g.
Hirschberg, 1993), and is one of the issues which we hope ToBI
transcribed databases will be most useful in helping to resolve.
Example utterance <<made1>> illustrates the unpredictability of
prominences above the word, with three different productions of the
same sentence -- "Marianna made the marmalade" -- each of which has a
different stress pattern. In the first production, there are two
syllables that are relatively more prominent than any other, the
accented syllables in the words "Marianna" and "marmalade". In the
second production of the sentence, on the other hand, there is only
the one relatively more prominent syllable in "Marianna", and
"marmalade" has been `deaccented'. This level of stress is marked in
the ToBI system by directly transcribing the pitch accent on the tone
tier. Thus, in the transcription of the first production in the
example, there are H* accents marked for both "Marianna" and
"marmalade", whereas in the second production there is only the L+H*
accent marked on "Marianna". (The third production, like the first,
also has accents on "Marianna" and "marmalade", but it has a different
stress pattern because both of these accents are nuclear stresses,
whereas in the first production only "marmalade" has a nuclear accent.
We will describe this higher level of stress in more detail in the
next subsection.)
EXAMPLE <<made1>>: Marianna made the marmalade.
in three productions 1) H* H* L-L%
2) L+H* L-L%
3) L+H*L-H% L* H* L-L%
Note that there is another difference between the first production and
the last two: the second and third productions begin at a much lower
fundamental frequency than the first. This is due to the distinction,
marked on the tone tier, between a single-tone H* pitch accent and a
bitonal L+H* pitch accent. This contrast is independent of the
difference in stress pattern, which depends on the pattern of pitch
accent PLACEMENT and not on the type of pitch accent. To see this,
compare the first two productions of the sentence in <<made2>>
with the second two productions. (These first two sentences are the
same as productions (1) and (2) in <<made1>>.)
EXAMPLE <<made2>>: Marianna made the marmalade.
in four productions 1) H* H* L-L%
2) L+H* L-L%
3) L+H* !H* L-L%
4) H* L-L%
The stress patterns are the same, but the choice of H* versus L+H*
pitch accent type is the opposite. (For the relationship between the
second pitch accent and the first in production (3) and the diacritic
`!' that marks this relationship, see Section 2.8 below. The somewhat
less low beginning in the third production is also dicussed in Section
2.2.) The same stress patterns are illustrated again in the third and
fourth productions in <<made3>> with yet another pitch accent type,
this time a L* pitch accent (with a following rise into H- phrase
accent and H% boundary tone).
EXAMPLE <<made3>>: Marianna made the marmalade.
in four productions 1) L+H* !H* L-L%
2) H* L-L%
3) L* L* H-H%
4) L* H-H%
In transcriptions using waves(tm) label files (or any similar computer
labelling system), the stress that comes from associated pitch accents
can be parsed from reading the tone tier, since the waveform is used
to place the mark for a pitch accent somewhere in the syllable that is
phonologically associated to the accent. In the non-waves(tm)
transcription conventions, the stress is marked even more explicitly
in the symbolic string, by putting an asterisk in the orthographic
transcription just before the vowel of each accented syllable.
1.4. The marking of stress -- Intonational phrasing and prominence
Above the level of contrast between pitch-accented versus unaccented
words, native speakers of English can distinguish another level of
stress contrast, that between the last accented word of a phrase and
any preceding accent. In the first production in utterance
<<made1>>, for example, the word "marmalade" feels more prominent
than "Marianna". In the last production of the sentence, on the other
hand, "marmalade" does not feel necessarily more prominent than
"Marianna". The sentence has been divided into two intonational
phrases, so that each of these words is the last accented word in its
own phrase. (This level of prominence is often called the `nuclear
stress' or `nuclear accent' of the phrase.) Note that the level of
prominence need not be marked explicitly, since the word with nuclear
stress is defined positionally; it is the last accented word, or the
accented word (if there is only one in the phrase). Thus the
prominence contrast between a nuclear accent and a mere (prenuclear)
accent can be read from the transcription of the accents on the tone
tier relative to the boundaries marked between the phrases.
EXAMPLE <<made1>>: Marianna made the marmalade.
in three productions 1) H* H* L-L%
1 1 1 4
2) L+H* L-L%
1 1 1 4
3) L+H*L-H% L* H* L-L%
4 1 1 4
There are two separate markings indicating the boundaries of an
intonation phrase; one is the sequence of phrase accent and boundary
tone on the tone tier, and the other is the 4 on the break-index
tier. The break indices are numbered from 0 (for least disjuncture)
to 4 (for most pronounced disjuncture). The numbering captures the
hierarchical nature of these prosodic groupings. At the highest level
of the break index hierarchy and at the next lower level, the sense of
disjuncture between adjacent words is connected closely to the
intonation pattern. The boundary after "Marianna" in the third
production of the sentence in <<made1>> is one at the highest level in
the break index hierarchy transcribed in ToBI. This level is marked
tonally by a boundary tone (H% or L%) at its end (and sometimes at its
beginning, too, in which case it is %H). The next lower level (break
index 3) is marked by a phrase accent (H- or L-) at its end.
An intonation phrase contains one or more intermediate phrases, and
the end of an intonation phrase is by definition also the end of an
intermediate phrase (break index 3). This fact is reflected on the
tone tier in the requirement that there be a sequence of phrase accent
(for the last intermediate phrase) followed by a boundary tone at the
end of every intonation phrase. The last production of the sentence
in <<made1>> illustrates this nicely with clear reflexes of the
tone string in the fundamental frequency contour. Note first the fall
from the peak for the L+H* nuclear pitch accent to the L- phrase
accent for the first intermediate phrase, followed by the small rise
in fundamental frequency to the H% boundary tone at the intonation
phrase boundary.
Utterance <<insert>> illustrates the next lower level of disjuncture,
that between two intermediate phrases that are grouped into one
intonation phrase. In the second production of the sentence "`I'
means insert", there is a fall from a H* nuclear accent into a L-
phrase accent, but there is no subsequent boundary tone, since this in
not an intonation phrase boundary.
EXAMPLE <<insert>> -- `I' means insert.
in two productions 1) H* H* L-L%
1 1 4
2) H* L- H* L-L%
3 1 4
Note that the first production of the sentence in <<insert>> contrasts
with this second production in its stress pattern in the same way as
the first and third productions of <<made1>>. The notion of nuclear
accent is defined relative to the intermediate phrase. The
contrasting productions in <<made4>> illustrate the same contrast in
one versus two nuclear accents with L* pitch accents and a H- phrase
accent at the boundary between the two intermediate phrases in the
production with two nuclear accents. (The *? on the "made" in the
first production illustrates a very common type of ambiguity about
accent placement that is discussed below in Section 2.9.)
EXAMPLE <<made4>>: Marianna made the marmalade.
in two productions 1) L* *? L* H-H%
1 1 1 4
2) L* H- L* H-H%
3 1 1 4
1.5. What lines up with what?
The conventions for placing labels when using the waves(tm) labelling
system are prescribed in the ToBI Annotation Conventions so that
labellers can use tools such as John Pitrelli's checker program to
check for inadvertent omissions and grammatical errors. To quickly
summarize, the break index label is placed at or just after the word
label. Phrase accent and boundary tone labels are placed on or just
before the corresponding 3 or 4 break index label. Pitch accents
are placed somewhere within the accented syllable, preferably within
the interval that can be identified with the syllable's vowel.
In the non-waves(tm) transcription conventions, the orthographic,
tone, and break index labels are ordered within each line so that such
a transcription could be generated fairly quickly by merging and
sorting a set of waves(tm)-format label files.
labelling_guide_v2.ASCII (augmented by some HTML)
This page is maintained by M. Beckman (mbeckman@ling.osu.edu)
|