Data analyis problem on "Vowel differences and speaker differences"
(Data analysis assignment 4 for Ling H286, Autumn 2007, Ohio State University)
Copyright © 2007 Mary E. Beckman
0. Due date (and a reminder about collaboration).
Do the data analysis described below and turn in your report
at the beginning of class on Wednesday, November 14.
(You may work in groups to make the histograms and scatterplots
and so on. However, if you do so, you
must remember to acknowledge the contributions of others
in your report. Also, the writing of your report on the
data must be your own individual work.)
1. The primary data
The data that we will use for this assignment are from the study of
Detroit dialect vowels by James Hillenbrand and colleagues that
Peter Ladefoged cites in Chapter 5 of "Vowels and consonants" and uses
in his Figure 5.7 on p. 46. Please note that the
data are copyrighted by James Hillenbrand. The full data set is
available from:
http://homepages.wmich.edu/~hillenbr/voweldata.html
and whenever you use the data, you should cite the following paper,
in which the study was described:
James Hillenbrand, Laura A. Getty, Michael J. Clark, &
Kimberlee Wheeler (1995). Acoustic characteristics of American English vowels.
Journal of the Acoustical Society of America, 97, 3099-3111.
In this study, Hillenbrand et al. elicited 12 /h_d/ words from 139
speakers of a Northern Cities variety of American English. The words
and the codes that the Hillenbrand et al. use for the target vowels
are: ae="had", ah="hod" (the vowel in "cot"), aw="hawed", eh="head",
er="heard", ey="hayed", ih="hid", iy="heed", oa="hoed" (/o/ as in "boat"),
oo="hood",uh="hud", uw="who'd". They then measured formant values at
several different places in the vowel, including near the vowel midpoint,
as we are doing for the words that we recorded for the term project.
They also played the audio files to a group of 20 listeners (also
natives of the Detroit area, as were the talkers who produced the
words), asking them to identify the word.
I have downloaded the file that contains the formant measurements
to a subdirectory of our course web page called
HillenbrandHighVowels
and renamed it as bigdata.txt so that you can look at it
on a Windows PC just by clicking on the icon.
There is also a "massaged" copy of the identification data in
the file iddataMinusAsterisk.txt in that directory.
Download these and the file of R code there. That is download.
- bigdata.txt
- iddataMinusAsterisk.txt
- HillenbrandHighFrontVowels.R
2. Analyze the effects of vowel type versus speaker type.
Read the comments and code in the file HillenbrandHighFrontVowels.R
and make the six histograms and two scatterplots that the code
describes. These are:
- A histogram of the first formant in all 139 tokens of the vowel
i in "heed" with a histogram of the F1 in the vowel I
in "hid" overlaid.
- Like the first figure, but for the second formant.
- Like the first figure, but for vowel duration.
- Like the first figure, but with the F1 values separated by
speaker sex instead of by vowel category.
- Like the second figure, but with the F2 values separated by
speaker sex.
- Like the third figure, but with the vowel durations separated by
speaker sex.
- Vowel space scatterplot with the two vowels plotted with
different plotting characters ("i" for i and "I" for I)
and with red circles drawn around the four tokens of the word "heed"
that one listener misidentified as "hid".
- Vowel space scatterplot with males and females plotted with
different plotting characters.
Also do the six t-tests that the code describes, changing
the type of test (i.e., one-tailed or two-tailed) as
appropriate, if you do not agree with the expectations
that the code suggests.
3. Writing the report
Embed the two histograms for duration and the two
scatterplots into your report and then write four short
paragraphs addressing the following sets of issues and
questions.
- The mean values for F1, F2, and vowel duration
differ between i and I.
For each parameter, say which vowel has the lower formant
or shorter duration. Is the size of the difference
significant? Is the direction of the difference what
you would expect from what you know about these two vowels?
Why or why not?
- The mean values for F1, F2, and vowel duration
differ between male and female speakers.
For each parameter, say which sex has the lower formant
or shorter duration. Is the size of the difference
significant? Is the direction of the difference what
you would expect from what you know about males and females?
Why or why not?
- Looking at the histograms for the two formants for
the two vowel categories,
describe the degree of overlap between the two vowels.
If you knew the F1 value of a vowel token from this
set, could you say which vowel it is? What
about the F2 value?
Looking the scatterplot, suppose you knew both the
F1 and F2 together, could you then say with some confidence
which vowel it is?
Bonus: Does the scatterplot suggest any explanation
for why the four tokens of "heed" were confused with "hid"?
- Looking at the histograms for the two formants for
the two speaker categories,
describe the degree of overlap between the two sexes.
If you knew the F1 value of a vowel token from this
set, could you say whether the speaker was a male or a female?
What about the F2 value?
Looking the scatterplot, suppose you knew both the
F1 and F2 together, could you then say with some confidence
whether the speaker was a male or female?
Bonus: What if you separated out men from boys,
women, and girls, instead of just males from females?
Could you feel more confident about telling whether the
vowel was produced by a man than about whether the vowel
was produced by a male?
- Is having significantly different mean values
on a variable the same thing as having discriminative power?
One or two of your paragraphs should refer
to the relevant (set of) t-test(s) and you should
embed the results of the t-test(s) there, using the
following reporting style:
The red mice were on average 16 grams heavier than
the grey mice, a difference that was highly significant
by a one-tailed t-test (t=-9.657, p<0.001).