########################################################################### # Wagon feature (i.e. datafile) extractor from ALL THREE CORPORA # Written by Kyuchul Yoon ( kyoon@ling.osu.edu ) # Extracts Wagon features from one speaker (consisting of 400 TextGrids). # For each token, it extracts features contained in its interval (or column) across mutiple tiers (or rows) # The script assumes that you already have the TextGrid files labelled by professional K-ToBI labelers. # The script will read in all ejk-???.TextGrid.lab.wagon.36 files one by one from the subdirectory; # ..\06.corpus-EJK\corpus-original\10.wagon-features\36.romanized.current.token # and write an output file ejk.ALL.wagon to the current directory where the script is located. # # Features include (0) PREDICTEE: types of boundaries (AP, IP, or none) # PREDICTORS (1) morpheme identity (morph/POS pairs) # (2) token length in syllables <== non-syntactic features (& below) # (3) distance in syllables from previous comma # (4) distance in syllables to the following comma, # (5) distance in eojeols from previous comma, (6) to the following comma, # (7) distance in syllables from previous AP (...a) # (8) distance in syllables from previous IP (...%) # (9) distance in eojeols from previous AP # (10) distance in eojeols from previous IP # (11) distance in syllables from sentence beginning # (12) distance in syllables from sentence end # (13) distance in eojeols from sentence beginning # (14) distance in eojeols from sentence end # (15) POS of current token <== morpho-syntactic features (& below) # (16) POS of -3 preceding token # (17) POS of -2 preceding token # (18) POS of -1 preceding token # (19) POS of +1 following token # (20) POS of +2 following token # (21) POS of +3 following token # (22) phrasal category of terminal node of current token <== syntactic features (& below) # (23) phrasal category of terminal node of -3 preceding token # (24) phrasal category of terminal node of -2 preceding token # (25) phrasal category of terminal node of -1 preceding token # (26) phrasal category of terminal node of +1 following token # (27) phrasal category of terminal node of +2 following token # (28) phrasal category of terminal node of +3 following token # (29) phrasal category of pre-terminal node of current token # (30) phrasal category of pre-terminal node of -3 preceding token # (31) phrasal category of pre-terminal node of -2 preceding token # (32) phrasal category of pre-terminal node of -1 preceding token # (33) phrasal category of pre-terminal node of +1 following token # (34) phrasal category of pre-terminal node of +2 following token # (35) phrasal category of pre-terminal node of +3 following token # For convenience (36) romanized token (Not to be used in actual training process) # Most features are just copied from each cell (i.e., an interval in a tier). # However, features (1), (22), and (29) are scanned and user-defined string values are chosen. ########################################################################## # Specify files and folders # For cases where the interval tier (tier 1) and phonology tier (tier 2) have not been # synchronized in point/interval placement. If the gaps are big, adjust the tolerance value form Select files word subFolderToProcess ..\06.corpus-EJK\corpus-original\10.wagon-features\36.romanized.current.token word fileExtOfDoneFiles wagon.36 integer wordTier 1 integer predicteeTier 2 integer startOfPredictorTiers 6 integer morphIdentityTier 6 integer startOfPhrCatTier 27 integer endOfPhrCatTier 40 word outputFileName ejk.ALL-trial-01.wagon.datafile word dummyFileNumPointer FILE.PROCESSED.MOMENT.AGO--- endform # If an old output file exists, delete it first and then write out the new file filedelete 'outputFileName$' # Get the list of filenames of TextGrid.done files Create Strings as file list... fileList 'subFolderToProcess$'\*.'fileExtOfDoneFiles$' Sort numFiles = Get number of strings pause 'numFiles' labeled textgrids identified. Continue? # Loop throught each file for iFile to numFiles select Strings fileList # Get the name for a TextGrid.done file doneFile$ = Get string... iFile Read from file... 'subFolderToProcess$'\'doneFile$' Rename... textgrid numTiers = Get number of tiers numIntervals = Get number of intervals... 1 # Counter for phonology tier iPhonoTier = 1 for iToken from 2 to (numIntervals-1) # Identify the predictee values first ################# AP/IP boundary type detector #################### ### Get the RHS end time of the word tier interval and compare that ### ### with the time point of the phonololgy tier. If they're close enough, print it ### ########################################################## endTimeOfInterval = Get end point... wordTier iToken pointLabel$ = Get label of point... predicteeTier iPhonoTier typeOfBoundary$ = right$(pointLabel$, 1) timeOfPointLabel = Get time of point... predicteeTier iPhonoTier # If the boundary exists within the prosodic word, skip it and go to the next phono tier point # Go ahead and extract the point label from the phonology (= predictee) point tier if timeOfPointLabel = endTimeOfInterval # If the boundary is an accentual phrase or an intonational phrase if (typeOfBoundary$ = "a" or typeOfBoundary$ = "%") fileappend 'outputFileName$' 'pointLabel$''tab$' # Increaset the iPhonoTier by one iPhonoTier = iPhonoTier +1 endif # Otherwise, do not print the boundary type # There are two cases for this. One is the boundary exists "within" the token # The other is the case where the endTimeOfInterval corresponds to the token boundary # as in "sa-lam/eun" where the AP boundary comes after the "eun". # The algorithm here is to figure out which comes first, the endTimeOfInterval # or the timeOfPointLabel. else # If the "endTimeOfInterval" comes after the timeOfPointLabel, # then skip the boundary by increasing the iPhonoTier index by one. if endTimeOfInterval > timeOfPointLabel iPhonoTier = iPhonoTier +1 fileappend 'outputFileName$' 0'tab$' # Otherwise, i.e. endTimeOfInterval comes before the timeOfPointLabel, # then proceed without increasing the iPhonoTier index. else fileappend 'outputFileName$' 0'tab$' endif endif # And then loop through each "tier" for the current token to extract the interval text for iTier from startOfPredictorTiers to numTiers ################################ #### Choose which features to use #### #### Only applies to the tiers specified #### #### in the form above. #### ################################ intervalText$ = Get label of interval... iTier iToken if iTier = morphIdentityTier # If the intervalText$ is one of those morphemes defined above, # then set it accordingly, otherwise, set it to zero ############ Try other categories and see the prediction performance!!! if (intervalText$ = "eun/PAU" or intervalText$ = "do/PAU" ... or intervalText$ = "i/PCA" or intervalText$ = "go/ECS" ... or intervalText$ = "go/PAD" or intervalText$ = "eu-myeo/ECS" ... or intervalText$ = "eu-myeon/ECS" or intervalText$ = "eo/ECS" ... or intervalText$ = "myeon-seo/ECS" or intervalText$ = "neun-de/ECS") fileappend 'outputFileName$' 'intervalText$''tab$' else fileappend 'outputFileName$' 0'tab$' endif elsif (iTier >= startOfPhrCatTier and iTier <= endOfPhrCatTier) ############ Try other categories and see the prediction performance!!! if (intervalText$ = "NP-SBJ" or intervalText$ = "NP-ADV" ... or intervalText$ = "S-COMP" or intervalText$ = "VP" ... or intervalText$ = "NP") fileappend 'outputFileName$' 'intervalText$''tab$' else fileappend 'outputFileName$' 0'tab$' endif else fileappend 'outputFileName$' 'intervalText$''tab$' endif endfor fileappend 'outputFileName$' 'newline$' endfor #### If you do not want a blank line between sentences, comment out the following line # fileappend 'outputFileName$' 'newline$' #### FILE COUNTER. Since the script seems to take quite some time, it's necessary to notify the user of the progress #### Since Praat appears to be frozen when it's working, it'd be better to create a "dummy" file number pointer file # If an old fileIndex file exists, delete it first and then write out the new fileIndex file filedelete 'fileIndex$' fileIndex$ = dummyFileNumPointer$ + doneFile$ fileappend 'fileIndex$' Nothing # Comment out the following two lines when running the script seriously Edit pause Remove endfor select Strings fileList Remove #### END OF SCRIPT ####