Data analyis problem on "Word lengths and segment inventories"

(Data analysis assignment 5 for Ling H286, Autumn 2007, Ohio State University)

Copyright © 2007 Mary E. Beckman

0. Due date (and a reminder about collaboration).

Interpret the data analyses described below and turn in your report at the beginning of class on Wednesday, November 28.

(You may work in groups to understand the scatterplots and so on. However, if you do so, you must remember to acknowledge the contributions of others in your report. Also, the actual writing of your report on the data must be your own individual work.)


1. The primary data

1.1. Nettle (1995)

The first set of data that we will use for this assignment are from a study by Daniel Nettle of the relationship between segmental inventory size (i.e., the number of consonants and vowels a language has) and the average length of words. The article describing the study is:

Daniel Nettle (1995). Segmental inventory size, word length, and communicative efficiency. Linguistics, 33, 359-367.

In this study, Nettle chose ten languages representing a wide range of segmental inventory sizes as well as a reasonably varied selection of language families. The languages are (from the smallest to the largest):

  1. Hawaiian (a Polynesian language with only 1000 native speakers left)
  2. !Xu (a Khoisan language with about 4000 native speakers)
  3. Nahuatl (a Aztecan language with about 60000 native speakers)
  4. Georgian (a Kartvelian language with more than 4 million native speakers)
  5. Thai (the largest Tai language, with more than 20 million native speakers)
  6. Turkish (an Altaic language, with more than 50 million native speakers)
  7. Italian (a Romance language, with more than 60 million native speakers)
  8. German (one of the Germanic languages, with more than 95 million native speakers)
  9. Hindi (an Indo-Iranian language, with more than 180 million native speakers)
  10. Mandarin Chinese (the largest Sinetic language, with more than 870 million native speakers)
Nettle determined the segmental inventory size by consulting sources such as:

George Campbell (1991). Compendium of the world's languages. London: Routledge.

He also looked at a dictionary for each language, selecting 50 words at random from evenly spaced pages of the dictionary, and calculating the average word length, in terms of the average number of vowels and consonants in the transcription of each word. These numbers are reported in Table 1 on p. 362 of Nettle's article, and also plotted in Figure 2 on p. 365 of the article, where he shows the relationship between mean word length (plotted on the y-axis in his figure) and the number of consonants and vowels (plotted on the x-axis in his figure).

We have taken the data from Table 1 in this paper, and "corrected" the number for Thai, to reflect the fact that the dictionary that he was consulting used standard Thai orthography, which represents tone by the combination of two "silent" consonants and various consonant diacritics at the end. We also added another column where we increased the counts for Thai and for Mandarin Chinese, to reflect the fact that in these languages, there are lexical tones that effectively multiply the number of vowels by allowing contrasts in pitch pattern (tone) as well as in timbre pattern (first and second formant). We have typed these numbers from his table into the data file correctedNettle1995.txt which is stored in the subdirectory NettleData under our course web page. In this data file, we "corrected" the number for Thai, as mentioned above, and added three more columns for the number of native speakers (as reported in the Ethnologue database at http://www.ethnologue.org), for the segmental counts when lexical tone is taken into account (as noted above), and for the total number of vowels. If you want follow along as we describe the scatterplots and regression analyses that we will give you below, download the data file and the associated file of R code from the NettleData directory.

1.2. Moe, Hopkins, and Rush (1982)

The second set of primary data is the list of about 6300 words derived from an old study of children's vocabularies that used recordings of conversations with first graders, described in:

Alden J. Moe, Carol J. Hopkins, & R. Timothy Rush (1982). The vocabulary of first-grade children. Springfield, IL: Thomas.

This is the same data set that we used in the first data analysis set. We used it again this time, to evaluate type frequencies for vowels in different kinds of syllables in this set of words of English. If you want follow along as we describe the barplots and Chi-square analyses that we will give you below, download the Moe, Hopkins, & Rush data file and the associated file of R code from the NettleData directory.


2. The analyses

2.1. Analyzing the relationship between inventory size and average word length

The motivating idea for these analyses is related to the idea that Peter Ladefoged discusses in the first two paragraphs of section 1.2 on p. 4 of his book Vowels and Consonants. Remember, that is where he says:

If a language had only one or two vowels and a couple of consonants it could still have half a dozen syllables, and make an infinite number of words in different orders. But many of the words would be very long and difficult to remember. If words are to be kept short and distinct so that they can be easily distinguished and remembered, then the language must have a sufficient number of vowels and consonants to make more than a handful of syllables.

Here is how Daniel Nettle describes this same idea in the introduction to the article describing his study, where he talks about quantitative models in linguistics and their relationship to functional explanations for language patterns. Here's what he says:

The starting point of these [quantitative] models is the assumption that language is functionally adapted to the needs of efficient communcation, which are taken to be the need for articulatory ease and the need for perceptual salience. However, if only these needs are considered, it is unclear why any language should have more than a bare minimum of contrastive segments, as having a larger segmental inventory seems likely to either increase articulatory cost, because more extreme articulatory gestures will be needed, or decrease the ease of encoding, as the perceptual space will be more crowded, or both. The number of segments actually used by natural languages varies a great deal, from 12 to at least 120 ....

From [a functional perspective], the aforementioned disadvantages of increasing the inventory size should have the compensatory advantage of allowing shorter linguistic units, and greater economy. It has often been hypothesized that languages with larger segmental inventories will have generally shorter words .... but the effect has not been demonstrated empirically for a sizeable set of languages.

So, the analyses that we give you below are Nettle's figure demonstrating the effect, plus a related figure using modified numbers to evaluate whether lexical tone contrasts should be taken into account.

Figure 1. Replicating Figure 2 in Nettle (1995)

The scatterplot above is a replication of Figure 2 from Nettle's paper, with two small modifications. First, the number of segments for Thai is different ("corrected" as described above). Second, we have added a straight regression line. Here are the results of the regression that we did to get the intercept and slope to draw that line:

# Residuals:
#     Min      1Q  Median      3Q     Max 
# -2.6030 -0.5072  0.1900  0.4440  1.9032 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  7.55411    0.75716   9.977 8.64e-06 ***
# NoSegs      -0.03336    0.01553  -2.148    0.064 .  
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
# 
# Residual standard error: 1.332 on 8 degrees of freedom
# Multiple R-Squared: 0.3657,     Adjusted R-squared: 0.2864 
# F-statistic: 4.613 on 1 and 8 DF,  p-value: 0.064 

Figure 2. Factoring in tone

This second scatterplot is like the first, except that we have used the counts of segments in Thai and Mandarin where the number of vowels is multiplid by the number of tones. Here are the results of the regression that we did to get the intercept and slope to draw that line:

# Residuals:
#     Min      1Q  Median      3Q     Max 
# -0.9509 -0.5571 -0.2236  0.4874  1.4448 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  8.170675   0.468530  17.439 1.19e-07 ***
# withTone    -0.040239   0.007985  -5.039  0.00100 ** 
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
# 
# Residual standard error: 0.8189 on 8 degrees of freedom
# Multiple R-Squared: 0.7604,     Adjusted R-squared: 0.7305 
# F-statistic: 25.39 on 1 and 8 DF,  p-value: 0.001003 

The grey symbols and regression line are from a model of what the relationship should be if there were minimum redundancy in the system, after adjusting for Peter Ladefoged's implicit claim about the constraints on syllable structure on p. 4 of his book. That is, to get these numbers, we assumed a lexicon of 45,000 words, as in Nagy & Anderson's (1984) estimate of the number of words that an American high school senior knows. We calculated the average length of words one would get if each language were maximally efficient in using segments to get the shortest possible words, given the following constraints: (1) the only possible syllable types are V and CV and (2) V can only occur at the beginning of a word, so that there are no V.V sequences. In this way, the language has to "use up" the shortest words first, before going on to use the next shortest words, and so on. As you can see, the predicted lengths are shorter than the observed lengths, but they follow the same trend.

2.2. Analyzing the distribution of vowels in different types of English syllables

Figure 3. Distribution of vowels in words produced by English-speaking first-graders (from Moe, Hopkins, and Rush, 1982).

This set of four bar plots shows the type counts for the 15 vowels of American English in all syllables (second plot), stressed syllables (third plot), and unstressed syllables (fourth plot) in the list of words derived from the conversations with first-graders in Moe, Hopkins, and Rush (1982). The vowels are ordered from lowest to highest overall token frequency in the Caterette and Jones corpus study, as depicted in the left-hand bar plot in Figure 9.5 in Peter Ladefgod's book, and the [r] vowel of words like bird is included. The vowel that is called "^" in the figure includes the mid-low back vowel [ʌ] in words such as bud as well as the schwa vowel [ə] in the first syllable of about.

As you can see, the vowels are not evenly distributed; some occur in many syllables in these words, some occur in only a few. The first bar plot shows how the second bar plot should look if the vowels were evenly distributed -- i.e., if the 15 vowels were all equally likely to occur in any syllable of English. The following Chi-squared test evaluates the probability that the uneven distribution in the second bar plot could have come about by chance if the vowels were actually all equally likely to occur.

#         Pearson's Chi-squared test
# 
# data:  V.model 
# X-squared = 2902.707, df = 14, p-value < 2.2e-16

Also note that the relative frequencies are different in different types of syllables; the vowel [ə] occurs very often in unstressed syllables, whereas vowels such as the diphthong [aj] of bide and the [æ] of bad almost never occur in an unstressed syllable. The following Chi-squared test shows the probability that these differences in the distribution across the two syllable types could have come about by chance.

#         Pearson's Chi-squared test
# 
# data:  V.table[, c("stressed", "unstressed")] 
# X-squared = 4419.355, df = 14, p-value < 2.2e-16


3. Writing the report

Look at the scatterplots in Figures 1 and 2 and the associated regression analyses, and then write a short paragraph addressing the following sets of issues and questions.

Compare the black and gray data points and regression lines in Figure 2. Then look at the bar plots in Figure 3 and the associated Chi-square analyses comparing the model to the overall distribution and comparing the distributions in stressed and unstressed syllables. Then write another short paragraph addressing the following sets of issues and questions.