Assignment 1: Sample solution STEP 1 - First step is to extract all verbs from Dorr's database, sort and uniqify. Notes: 1) the egrep expression excludes lines that start with ;, because these are comments 2) the cut expression was determined by tinkering 3) I sort, even though the forms seem to be sorted, just to be sure. 4) Don't reduce to lower case yet egrep '^[^;].*DEF_WORD' ../dorr-verbs-English.lcs | cut -f 3 -d ' ' | sed 's/"//g' | sort -u > dorr.forms Checking wc dorr.forms 4269 4269 31322 dorr.forms head dorr.forms December abandon abase abash abate abbreviate abdicate abduct abhor abide STEP 2 - Now do the same for the treebank gawk '{for(i=1; i < NF; i++) {if(i % 2 == 1 && $(i+1) ~ /V/) { print $i}}}' data/*.tags | sort -u > ptb.forms wc ptb.forms 8188 8188 69221 ptb.forms head ptb.forms 'S 'd 'm 're 's 've 15 ADMITTED ADOPTED AM STEP 3 - Now stem the forms, producing files of
I used the Python version of the Porter stemmer, with the following calling code. import fileinput import stemmer if __name__ == "__main__": p = stemmer.PorterStemmer() for line in fileinput.input(): line = line.strip() low = line.lower() # otherwise stemmer misfires on all caps st = p.stem(low,0,len(low)-1) print st, line Notes: the Python stemmer differs from the C stemmer that Martin used in that it doesn't itself convert to lower case. So the Python calling code fixes that commands python code/stem.py ptb.forms > ptb.stems wc ptb.stems 8188 16376 123334 ptb.stems head ptb.stems 's 'S 'd 'd 'm 'm 're 're 's 's 've 've 15 15 admit ADMITTED adopt ADOPTED am AM Notes: I checked compatibility with the C version by doing ~/foo/porter_stemmer ptb.forms | abut - ptb.forms > ! ptb.stems2 then searched for differences by doing abut ptb.stems ptb.stems2 | awk '$1 != $3' > differences (an initial try using diff didn't help because the C stemmer produces a blank at the beginning of each line, so I used awak to find the relevant lines) Python version is first and second fields C version is third and fourth fields, Differences are due to treatment of hyphenated words. code-nam CODE-NAMED code-name CODE-NAMED cross-br CROSS-BRED cross-bred CROSS-BRED inflation-adjust Inflation-adjusted inflat-adjust Inflation-adjusted belly-flop belly-flopped belli-flop belly-flopped buy-back buy-back bui-back buy-back capital-drain capital-draining capit-drain capital-draining color-cod color-coded color-code color-coded color-cod color-coding color-code color-coding contract-dril contract-drilling contract-drill contract-drilling double-cross double-crossed doubl-cross double-crossed fine-tun fine-tuning fine-tune fine-tuning gas-gath gas-gathering ga-gather gas-gathering government-set government-set govern-set government-set jury-rig jury-rigged juri-rig jury-rigged log-rol log-rolled log-roll log-rolled nose-div nose-dived nose-dive nose-dived out-trad out-trade out-trade out-trade policy-mak policy-making polici-make policy-making pre-tri pre-try pre-try pre-try re-ent re-enter re-enter re-enter re-ent re-entered re-enter re-entered re-ent re-entering re-enter re-entering still-rag still-raging still-rage still-raging test-driv test-drive test-drive test-drive theory-teach theory-teaching theori-teach theory-teaching well-stat well-stated well-state well-stated (OK. go with what Python did) python code/stem.py dorr.forms > dorr.stems wc dorr.stems 4269 8538 60044 dorr.stems head dorr.stems decemb December abandon abandon abas abase abash abash abat abate abbrevi abbreviate abdic abdicate abduct abduct abhor abhor abid abide - STEP 4: Now we have stems for both, we can use Unix's join command to merge the databases. First, make sure that both are correctly sorted sort ptb.stems > ptb.sort sort dorr.stems > dorr.sort now do the join operation join dorr.sort ptb.sort > both.stems Check wc both.stems 6538 19614 143010 both.stems head both.stems abandon abandon Abandoning abandon abandon abandon abandon abandon abandoned abandon abandon abandoning abandon abandon abandons abat abate abate abat abate abated abat abate abates abat abate abating abdic abdicate abdicate (here, the stem is in field 1, the key from Dorr's database in field2 and the PTB keys in field3 - STEP 5, using both.stems, extract from PTB. Use an awk program for this and reuse the trick from step 2. We read in the stem information first in the begin statement. I chose to print out more than was asked for, including an indication of the position of the word in the sentence. It might be nice to have indication of section number and sentence number within section. It's just about OK (in a throwaway program like this to hardwire "both.stems" into the code. This is a deficiency that could BEGIN { while(getline < "both.stems") { stem[$3]=$1 } } {for(i=1; i <= NF; i++) { if(i % 2 == 1 && stem[$i] && $(i+1) ~ /V/) { print stem[$i],$i,(i+1)/2,":",$0 } } } then the calling code is gawk -f code/extract.awk data/*.tags > occurrences wc occurrences 84707 5165512 22687990 occurrences head occurrences revit Revitalized 18 : In IN an DT Oct. NNP 19 CD review NN of IN `` `` The DT Misanthrope NN '' '' at IN Chicago NNP 's POS Goodman NNP Theatre NNP -LRB- -LRB- `` `` Revitalized VBN Classics NNS Take VBP the DT Stage NN in IN Windy NNP City NNP , , '' '' Leisure NN & CC Arts NNS -RRB- -RRB- , , the DT role NN of IN Celimene NNP , , played VBN by IN Kim NNP Cattrall NNP , , was VBD mistakenly RB attributed VBN to TO Christina NNP Haag NNP . . take Take 20 : In IN an DT Oct. NNP 19 CD review NN of IN `` `` The DT Misanthrope NN '' '' at IN Chicago NNP 's POS Goodman NNP Theatre NNP -LRB- -LRB- `` `` Revitalized VBN Classics NNS Take VBP the DT Stage NN in IN Windy NNP City NNP , , '' '' Leisure NN & CC Arts NNS -RRB- -RRB- , , the DT role NN of IN Celimene NNP , , played VBN by IN Kim NNP Cattrall NNP , , was VBD mistakenly RB attributed VBN to TO Christina NNP Haag NNP . . plai played 38 : In IN an DT Oct. NNP 19 CD review NN of IN `` `` The DT Misanthrope NN '' '' at IN Chicago NNP 's POS Goodman NNP Theatre NNP -LRB- -LRB- `` `` Revitalized VBN Classics NNS Take VBP the DT Stage NN in IN Windy NNP City NNP , , '' '' Leisure NN & CC Arts NNS -RRB- -RRB- , , the DT role NN of IN Celimene NNP , , played VBN by IN Kim NNP Cattrall NNP , , was VBD mistakenly RB attributed VBN to TO Christina NNP Haag NNP . . attribut attributed 45 : In IN an DT Oct. NNP 19 CD review NN of IN `` `` The DT Misanthrope NN '' '' at IN Chicago NNP 's POS Goodman NNP Theatre NNP -LRB- -LRB- `` `` Revitalized VBN Classics NNS Take VBP the DT Stage NN in IN Windy NNP City NNP , , '' '' Leisure NN & CC Arts NNS -RRB- -RRB- , , the DT role NN of IN Celimene NNP , , played VBN by IN Kim NNP Cattrall NNP , , was VBD mistakenly RB attributed VBN to TO Christina NNP Haag NNP . . plai plays 3 : Ms. NNP Haag NNP plays VBZ Elianti NNP . . expect expects 7 : Rolls-Royce NNP Motor NNP Cars NNPS Inc. NNP said VBD it PRP expects VBZ its PRP$ U.S. NNP sales NNS to TO remain VB steady JJ at IN about IN 1,200 CD cars NNS in IN 1990 CD . . remain remain 12 : Rolls-Royce NNP Motor NNP Cars NNPS Inc. NNP said VBD it PRP expects VBZ its PRP$ U.S. NNP sales NNS to TO remain VB steady JJ at IN about IN 1,200 CD cars NNS in IN 1990 CD . . anticip anticipates 12 : Howard NNP Mosher NNP , , president NN and CC chief JJ executive NN officer NN , , said VBD he PRP anticipates VBZ growth NN for IN the DT luxury NN auto NN maker NN in IN Britain NNP and CC Europe NNP , , and CC in IN Far JJ Eastern JJ markets NNS . . increas increased 4 : BELL NNP INDUSTRIES NNP Inc. NNP increased VBD its PRP$ quarterly NN to TO 10 CD cents NNS from IN seven CD cents NNS a DT share NN . . be be 5 : The DT new JJ rate NN will MD be VB payable JJ Feb. NNP 15 CD . . STEP 6 - sort these occurrences sort occurrences > sort.occ abandon Abandoning 1 : Abandoning VBG socialism NN means NNS abandoning VBG the DT East JJ German JJ state NN 's POS reason NN for IN existence NN , , and CC with IN it PRP the DT justification NN for IN its PRP$ watchdogs NNS and CC its PRP$ Wall NNP . . abandon abandon 15 : Such JJ dignity NN `` `` has VBZ to TO do VB crucially RB with IN a DT butler NN 's POS ability NN not RB to TO abandon VB the DT professional JJ being NN he PRP inhabits VBZ . . '' '' abandon abandon 19 : One CD Colombian JJ drug NN boss NN , , upon IN hearing NN in IN 1987 CD that IN Gen. NNP Noriega NNP was VBD negotiating VBG with IN the DT U.S. NNP to TO abandon VB his PRP$ command NN for IN a DT comfortable JJ exile NN , , sent VBD him PRP a DT hand-sized JJ mahogany NN coffin NN engraved VBN with IN his PRP$ name NN . . abandon abandon 22 : Last JJ week NN , , Ford NNP encountered VBD a DT setback NN in IN its PRP$ effort NN to TO broaden VB its PRP$ U.S. NNP luxury NN offerings NNS when WRB it PRP was VBD forced VBN to TO abandon VB a DT four-year-old JJ effort NN to TO market VB its PRP$ German-built JJ Scorpio NNP sedan NN in IN the DT U.S. NNP as IN a DT luxury NN import NN under IN the DT Merkur NNP brand NN name NN . . abandon abandon 27 : Insisting VBG that IN they PRP are VBP protected VBN by IN the DT Voting NNP Rights NNP Act NNP , , a DT group NN of IN whites NNS brought VBD a DT federal JJ suit NN in IN 1987 CD to TO demand VB that IN the DT city NN abandon VB at-large JJ voting NN for IN the DT nine CD member NN City NNP Council NNP and CC create VB nine CD electoral JJ districts NNS , , including VBG four CD safe JJ white JJ districts NNS . . abandon abandon 27 : Pretoria NN releases VBZ the DT ANC NNP leaders NNS , , most JJS of IN whom WP were VBD serving VBG life NN sentences NNS , , and CC allows VBZ them PRP to TO speak VB freely RB , , hoping VBG that IN the DT ANC NNP will MD abandon VB its PRP$ use NN of IN violence NN . . abandon abandon 31 : The DT committee NN is VBZ formulating VBG Hong NNP Kong NNP 's POS constitution NN for IN when WRB it PRP reverts VBZ to TO Chinese JJ control NN in IN 1997 CD , , and CC Chinese JJ lawmakers NNS said VBD the DT two CD can MD only RB return VB if IN they PRP `` `` abandon VBP their PRP$ antagonistic JJ stand NN against IN the DT Chinese JJ government NN and CC their PRP$ attempt NN to TO nullify VB the DT Sino-British JJ joint NN declaration NN on IN Hong NNP Kong NNP . . '' '' abandon abandon 35 : In IN what WP amounts NNS to TO an DT admission NN that IN the DT transition NN has VBZ n't RB gone VBN as RB smoothly RB as IN Sears NNS had VBD hoped VBN , , the DT giant JJ retailer NN is VBZ now RB trying VBG new JJ ways NNS to TO drum VB up RP business NN without IN appearing VBG to TO abandon VB its PRP$ seven-month-old JJ strategy NN . . abandon abandon 36 : In IN a DT hearing NN before IN the DT House NNP Ways NNPS and CC Means NNPS Committee NNP , , the DT General NNP Accounting NNP Office NNP and CC the DT Congressional NNP Budget NNP Office NNP , , which WDT both DT are VBP arms NNS of IN Congress NNP , , advised VBD the DT new JJ S&L NN bailout NN agency NN to TO abandon VB plans NNS to TO raise VB temporary JJ working JJ capital NN through IN debt NN issued VBN from IN an DT agency NN that WDT would MD n't RB be VB counted VBN on IN the DT federal JJ budget NN . . abandon abandon 37 : This DT is VBZ no DT place NN for IN pedestrians NNS , , but CC at IN 7:30 CD on IN a DT recent JJ morning NN , , when WRB construction NN choked VBD traffic NN at IN the DT famous JJ Four CD Corners NNPS intersection NN to TO one CD lane NN , , a DT taxi NN passenger NN found VBD it PRP faster JJR to TO abandon VB the DT cab NN and CC walk NN to TO her PRP$ destination NN . . ... Congratulations if you got this far!