Data analyis problem on "Vowel inventories and vowel spaces"
(Data analysis assignment 3 for Ling H286, Autumn 2007, Ohio State University)
Copyright © 2007 Mary E. Beckman
0. Due date (and a reminder about collaboration).
Do the data analysis described below and turn in your report
at the beginning of class on Wednesday, October 31.
(You may work in groups to analyze the vowel systems and to
make the histogram and so on. However, if you do so, you
must remember to acknowledge the contributions of others
in your report. Also, the writing of your report on the
data must be your own individual work.)
1. The primary data
The data that we will be using for this exercise are from Ian Maddieson's
UCLA Phonological Segment Inventory Database (UPSID), which lists the vowels
and consonants for 451 languages, as determined by looking at published
descriptions of the languages.
For example, here are the five vowels that UPSID
lists for Spanish and Japanese:
- Spanish: a, e, i, o, u
- Japanese: a, e, i, o, uu (a high back unrounded vowel)
Note that UPSID lists Spanish as having the triangular
vowel system that Peter Ladefoged shows in Figure 5.4, but the
high back vowel of Japanese is listed as an unrounded vowel,
so that the data point is displaced a bit toward the left and
away from the corner of the "nicely symmetrical triangular
vowel space" that Peter Ladefoged says is "the most efficient
way" to distribute a set of five vowels to make them
as auditorily distinct as any five vowels can be"
(Ladefoged, 2005, p. 37).
We have made a data frame with this kind of list for each of
the 80 UPSID languages that have five vowels, for each of the
25 UPSID languages that have four vowels, and for each of 23
UPSID languages that have three vowels.
These data frames are in sub-directory of our class web page that
is called dataFiles
which also contains a fourth file that lists all of the UPSID
languages and gives the consonant and vowel counts in each one.
Download all four of these files:
- vows5v.txt (the file for the 5-vowel languages)
- vows4v.txt (the file for the 4-vowel languages)
- vows3v.txt (the file for the 3-vowel languages)
- UPSIDlgs.txt (the file with the vowel and consonant counts for
all 451 languages)
You should also download the file of
R code
in the dataFiles directory and look at the comments in Part 2b
to see how to interpret the symbols for the different vowel types
in vows5v.txt, vows4v.txt, and vows3v.txt.
2a. Analyze the vowel count data and make a histogram
Use the data in UPSIDlgs.txt to make a histogram showing
the distribution of different sizes of vowel inventories
of different sizes.
Calculate the median size and the mean size of the vowel inventories.
(You may find it useful to use the code in
this file of R code
to see what the R commands are for calculate means and medians,
and to see how to make a histogram in R.
2b. Analyze the vowel spaces of the 3-, 4-, and 5-vowel languages
Then look at the lists of vowels for the 5-vowel languages,
4-vowel languages, and 3-vowel languages in the other three files.
Use these files to determine the following numbers.
- The number of 3-vowel languages that
have a perfectly triangular system with the vowels {a, i, u}.
- The number of 4-vowel languages that have these three vowels
plus one other vowel.
- The number of 5-vowel languages that have a perfectly triangular
system with the vowels {a, e, i, o, u}.
3. Writing the report
Embed the histogram that you made in Part 2a into your report
and then write a short paragraph answering the following sets of
questions about the count data.
-
What is the single most likely size for a vowel inventory
in the 451 UPSID languages? That is, which size of vowel
system is the most frequent?
-
What is the median size for a vowel inventory?
Is the median size the same as the most likely size?
If yes, is this expected?
If no, is the median larger or smaller than the most likely size,
and is the direction of the difference what you would expect?
In either case, explain your answer.
-
What is the mean size for a vowel inventory?
Is the mean size that you calculated a theoretically possible
inventory size? Why or why not?
Then referring to the histogram again, write a second short
paragraph which adresses the following sets of questions:
-
What is the probability that a language in the UPSID database has
exactly four vowels?
What is the probability that a language
in the UPSID database has exactly five vowels?
-
What is the probability that a language in the UPSID database has
exactly six vowels?
What is the probability that a language
in the UPSID database has exactly seven vowels?
-
Is Peter Ladefoged completely correct when he says (p. 37):
Given that the auditory space for possible vowels is somewhat
triangular, the selection of the three most distant vowels
i, a, u is obviously beneficial.
It would be possible for languages to add just one vowel to
these basic three, and, indeed some languages do have only four vowels.
But it turns out that far more langauges have five or seven vowels
than have four or six. With five or seven vowels it is possible
to have a nicely symmetrical triangular vowel space.
Then using the numbers that you calculated in part 2b of the
data analysis instructions, write a third paragraph that answers
the following sets of questions.
-
What is the conditional probability that a language with three
vowels has the perfectly triangular vowel space made of
the set of vowels {i, a, u}?
-
What is the conditional probability that a language with four
vowels includes these "basic three" plus just one more?
-
What is the conditional probability that a language with five
vowels has the "nicely symmetrical triangular vowel space"
made of the set {i, e, a, o, u}?
-
Do the actual distributions of the above types of
"efficient" vowel spaces for 3-vowel, 4-vowel, and
5-vowel languages support Peter Ladefoged's suggestion
in the quote above
about why five vowels might be better than four vowels?
4. Acknowledgments
The quotations from Peter Ladefoged are from his book:
Peter Ladefoged (2003) Vowels and Consonants, 2nd edition.
Malden, MA: Blackwell.
The data for this analysis problem are from:
UPSID-PC. The UCLA Phonological Segment Inventory Database.
Data on the phonological systems of 451 languages, with
programs to access it, by Ian Maddieson and Kristin Precoda.
The UPSID-PC program was downloaded from:
http://www.linguistics.ucla.edu/faciliti/sales/software.htm
It is an MS-DOS program for accessing the database of languages'
phoneme inventories that appeared in print in:
Ian Maddieson (1984). Patterns of Sounds. Cambridge
University Press.
See the references in the book or in the individual language
files in UPSID-PC for the sources of information on the specific
languages.