Extra questions for report problem on "Counting words of different lengths"

(Data analysis assignment 1 for Ling H286, Autumn 2007, Ohio State University)

Copyright © 2007 Grant McGuire and Mary E. Beckman

0. Due date.

If you want to raise your grade by redoing Data Analysis Problem 1, also do the added analysis described below and answer the associated extra questions. Turn in this addendum by the beginning of class on Wednesday, October 10.

(As with the original data analysis, you may work in groups to figure out how to make the graph. However, if you did so, you must remember to acknowledge the contributions of others in your report.)

1. Adding two new graphs

Using the transcript of the Buckeye Corpus snippet that you analyzed, count the number of times each word occurred. Figure out which word type at each of the four word lengths occurs most often in the corpus. Make a bar plot for the four different word lengths like the original word types histogram, but this time making the height of each bar represent the maximum number of occurrences for a word of each word length instead of to the count of word types for that length. Next, calculate the mean number of occurrences for each word length, and then make a second bar plot in which the height of each bar represents the mean number of occurrences for words of that length.

2. Writing the report

Embed the new bar plot that you made in Part 1 into your report and write a short paragraph that does the following things.

  1. List the words that are the most frequent frequent for each length (i.e., list the word types whose counts are represented by the heights of the first of the two new bar plots).
  2. List the words (or give examples sets of words) that are the least frequent ones, and say what the number of word tokens is for this set of words.
  3. Say whether the mean frequencies are the same for words of different lengths, and if not describe what the pattern is across this second bar plot. For example, say which word length has the largest mean number of occurrences and which the smallest.
  4. Using these descriptive statements about the two bar plots, explain the pattern of relative heights for the four different bars in the two histograms that you turned in your original first paragraph.

3. References

The Buckeye Corpus was created here at Ohio State University and can be obtained from:

http://vic.psy.ohio-state.edu/

The Buckeye Corpus was first described in print in the following reference:

Mark A. Pitt, Keith Johnson, Elizabeth Hume, Scott Kiesling, & William Raymond (2005) The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Communication, 45: 89-95.