Data analyis problem on "Counting words of different lengths"

(Data analysis assignment 1 for Ling H286, Autumn 2007, Ohio State University)

Problem drafted by Julia Papke

Copyright © 2007 Grant McGuire and Mary E. Beckman

0. Due date (and a reminder about collaboration).

Do the data analysis described below and turn in you report at the beginning of class on Monday, October 1.

(You may work in groups to get the data and to figure out how to make the barplots. However, if you did so, you must remember to acknowledge the contributions of others in your report. Also, the writing of your report on the data must be your own individual work.)

1. Data

1.1 Data set 1 -- Moe, Hopkins, and Rush (1982)

The following two figures are made from the type and token counts for words of different lengths in a list of about 6300 words derived from an old study of children's vocabularies that used recordings of conversations with first graders (Moe, Hopkins, & Rush, 1982).


1.2 Data set 2 -- Buckeye Corpus snippet

The following is a snippet from a transcript of one of the speakers in an interview that is part of the Buckeye Corpus. (The transcript is available as a plain ASCII text file here, in case you want to read it into Excel or some other data managing program, to help you answer your questions and do your analysis. If you are a PC user, right-click on the link to download it.) Read it through, decide how many syllables there are in each word token, in a form that lets you answer the two questions that follow.

Transcript:

uh whatever they call it thirty three that's on the ballot yes we did just last night well personally I voted that the thirty three should pass I think they should the way I un- understood I'm they have ways of wording those things that you had to read it very carefully to know whether you should say yes or no on the thing I said yes because I uh uh analyzed it to say that they're repealing something that was passed I think it one or two years ago that was allowings things to be done up around the polaris area like new roads and this and that and I want that repealed because I don't think we should be helping anybody else to do that and taking the money away from the other things oh northland yes uh I agree with everything he has said but uh I I'm a little more I don't know what the right word is to say but sometimes I think politics are a little bit dirty and I think that sometimes they don't even count all the votes or something and sometimes that they really want it to go through bad enough even though the majority voted not for it I mean even though the majority doesn't want that to happen that somehow or other they'll count it so it did or they'll well yknow right right oh yes oh yes I did vote absentee a number of years ago when we were traveling a lot and everything else I uh we did absentee voting and I've just kept it up now one year I forgot to send it in so I went to the polls absentee

Questions

  1. How many word types are there of one syllable? two syllables? three syllables? four syllables? etc.
  2. How many word tokens are there of one syllable? two syllables? etc.

2. Picturing your analysis

Using the data you gathered in section 1.1, do the following:

  1. Create a histogram using your counts based on the word tokens.
  2. Create a histogram using your counts based on the word types.

There is sample code (which you can get by right-clicking here) illustrating how to make the histograms, if you decide to use R. You also can download this sample code along with the sample data files that it uses from the scripts directory on our course web page.

3. Writing the report

Embed the two histograms you made in Part 2 into your report and write a short paragraph that answers the following questions about your two histograms.

  1. What do your histograms mean?
  2. How are the two histograms different?
  3. Why are they different?

Referring to the two histograms from Moe et al. (1982) shown above in section 1.1, write another very short pararaph answering the following questions.

  1. (Comparing individual histograms across word lists): What differences are there between each of your histograms and the analogous histogram in section 1.1?
  2. (Comparing the pair-wise relationship between the pairs.) Is the pattern of differences between these two histograms the same as the pattern of differences between your two histograms?
  3. If yes, why is the pattern the same? If not, why is the pattern different?

4. References

The data for the two figures of type and token counts in the Moe et al. study are from this book:

Alden J. Moe, Carol J. Hopkins, & R. Timothy Rush (1982). The vocabulary of first-grade children. Springfield, IL: Thomas.

The Buckeye Corpus was created here at Ohio State University and can be obtained from:

http://vic.psy.ohio-state.edu/

The Buckeye Corpus was first described in print in the following reference:

Mark A. Pitt, Keith Johnson, Elizabeth Hume, Scott Kiesling, & William Raymond (2005) The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Communication, 45: 89-95.