Marie-Catherine de Marneffe & Micha Elsner

LING 5050 - Technical tools for linguists

Maysession 2016

Homework 2

DUE: Friday May 27, 2016 (no late homework accepted!)

We will now try to deal with the original Fisher data, and not data that we actually pre-transformed for you ;-) We still did a bit of work for you, and put on Carmen only the relevant files of the Fisher data that we will need (see the folder "OriginalFisher" on Carmen).

Let's take a look at the file we looked at previously in the first "Unix" class, to see how the file is structured in the original transcript.

$ more 065/more fe_03_06596.txt

# fe_03_06596.sph
# Transcribed by BBN/WordWave

0.59 1.92 A: hello

1.96 2.97 B: (( hello ))

2.95 3.98 A: hello

3.71 5.43 B: my name is kevin gonzales

5.49 7.30 A: hi this is carol

6.95 8.35 B: carol okay carol

8.42 9.44 A: [laughter]

9.56 11.27 A: well that was fast

12.48 13.79 B: uh hello

13.47 14.85 A: yeah i'm here

14.58 15.53 B: okay

16.81 25.87 A: um so do you think that public or private school have the right

The transcripts do not contain gender information... So we will need to extract such information from another file which keeps track of that. The gender information is in the file "fe_03_p2_calldata.tbl". Let's look at a few lines of the file (at the beginning and at the end):

CALL_ID,DATE_TIME,TOPICID,SIG_GRADE,CNV_GRADE,APIN,ASX.DL,APHNUM,APHSET,APHTYP,BPIN,BSX.DL,BPHNUM,BPHSET,BPHTYP
05851,20030514_18:49:18,ENG10,4,4,96498,f.o,650428gei,4,3,86972,f.a,no_BPHNUM,4,2
05852,20030514_18:58:48,ENG10,4,2.5,26375,f.a,480948qqo,,3,39550,f.a,917757yjm,,1
05853,20030514_19:06:10,ENG10,4,3.5,45776,f.a,818312pbg,4,1,72959,f.a,no_BPHNUM,4,2
05854,20030514_19:38:43,ENG10,4,3.5,12903,f.a,218879qim,4,3,82880,f.a,718444ekj,4,2
05855,20030514_19:42:07,ENG10,4,4,86322,f.a,931906iqu,4,3,55384,f.a,931526jhk,4,2
05856,20030514_19:43:56,ENG10,4,3.5,95020,f.a,814322ojp,4,3,44763,f.a,no_BPHNUM,2,3
...
11692,20031118_19:47:30,ENG32,4,4,16321,m.a,585615upb,4,3,62607,f.a,206241jlv,4,2
11693,20031118_19:57:18,ENG32,4,4,31775,m.a,765452yfm,4,3,52954,f.o,no_BPHNUM,,
11694,20031118_20:02:23,ENG32,4,4,74447,m.a,718491eip,4,3,27475,m.a,301770vqn,4,3
11695,20031118_20:21:14,ENG32,4,4,17087,m.a,617755gqs,4,1,50757,m.a,727461knu,4,2
11696,20031118_20:31:06,ENG32,4,4,50278,m.a,612599elj,4,1,99630,m.a,512791rev,4,1
11697,20031118_20:51:04,ENG32,4,4,46881,f.a,614276ico,4,2,65668,m.a,203265lbe,4,1
11698,20031118_21:02:23,ENG32,4,4,23441,m.a,210349gbf,4,3,93991,f.a,818762njk,4,3
11699,20031118_21:18:22,ENG32,4,4,18625,m.a,818752hfj,4,3,14313,m.a,215349ppw,4,3

We see that there are different fields, separated by commas, and that the gender information for participants A and B are in the 7th and 12th fields, respectively (ASX.DL and BSX.DL). This format is standardly referred to as "csv" (comma-separated values). The first field contains the conversation ID. The names of the transcript files contain the conversation ID too: the last digits before the ".txt" extension.

With that information, write a python script that outputs:

  • the raw total number of words spoken by women
  • the raw total number of words spoken by men
  • the total number of utterances spoken by women
  • the total number of utterance spoken by men
  • the average number of words per utterance spoken by women and by men
  • the number of female speakers
  • the number of male speakers
  1. Your script should now read the data in the "OriginalFisher" directory which you can download from Carmen. Note that for space and time reasons, I only included 10 of the original 60 data folders. But your script should work on any number of folders!
  2. The user should be able to specify the Fisher directory path at the command line (e.g., python3 processOriginalFisher.py OriginalFisher). The script should run wherever it is on the computer, provided that the user gives a correct path to the directory. This time I don't want to have to go into your script to modify that part!
  3. At the top of your script, explain in English, in a few commented sentences, how you are going to solve the problem. What kind of data structure do you need to use? What will it contain?
  4. Also, now that you have more data, is your answer to the question "do women talk more than men" different from your previous one? Explain why. Write this at the bottom of your script, commented.

You will submit your code on Carmen in the HW2 submissions folder. Make sure your code runs. Make sure to appropriately comment your code! Find the right balance in your comments: too few or too many isn't helpful.