(Data analysis assignment 1 for Ling H286, Autumn 2007, Ohio State University)
Copyright © 2007 Grant McGuire and Mary E. Beckman
If you want to raise your grade by redoing Data Analysis Problem 1, also do the added analysis described below and answer the associated extra questions. Turn in this addendum by the beginning of class on Wednesday, October 10.
(As with the original data analysis, you may work in groups to figure out how to make the graph. However, if you did so, you must remember to acknowledge the contributions of others in your report.)
Using the transcript of the Buckeye Corpus snippet that you analyzed, count the number of times each word occurred. Figure out which word type at each of the four word lengths occurs most often in the corpus. Make a bar plot for the four different word lengths like the original word types histogram, but this time making the height of each bar represent the maximum number of occurrences for a word of each word length instead of to the count of word types for that length. Next, calculate the mean number of occurrences for each word length, and then make a second bar plot in which the height of each bar represents the mean number of occurrences for words of that length.
Embed the new bar plot that you made in Part 1 into your report and write a short paragraph that does the following things.
The Buckeye Corpus was created here at Ohio State University and can be obtained from:
http://vic.psy.ohio-state.edu/
The Buckeye Corpus was first described in print in the following reference:
Mark A. Pitt, Keith Johnson, Elizabeth Hume, Scott Kiesling, & William Raymond (2005) The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Communication, 45: 89-95.