clear
Procedures:
At folder /home/projects/ameritech/korean/scripts/, I ran the following command
>> cat ../data/newlts/byfreq.*.txt.out | fest2hyphenated > EVAL.list &
to produce the 'festival-output'(hyphenated version) of all the files in newlts/ folder.
I added a colum of eojeol frequencies by
>> paste EVAL.list freq.col > EVAL-freq.list
where freq.col is the eojeol frequency column extracted from eval-freq-len-rules.list file in eval-list/ folder. Then I added a column of syllable length by
>> paste EVAL-freq.list len-col > EVAL-freq-len.list
where len-col is also a pre-extracted file from eval-freq-len-rules.list file.
Did the same thing with #rules column
>> paste EVAL-freq-len.list rules-col > EVAL-freq-len-rules.list
I got rid of frequency colmn from eval-freq.list by
>> cat eval-freq.list | awk '{print $1}' > eval.list
and then pasted it to EVAL-freq-len-rules.list file by
>> paste eval.list EVAL-freq-len-rules.list > eval-EVAL-freq-len-rules.list
So the format of the last file is
<#rules>
Then (in phonetizer/ folder) I deromanized eval.list (input to festival) and EVAL.list (output from festival) by
>> cat ../data/eval-list/eval.list | deromanize.awk > eval.list.deromanized
>> cat ../data/eval-list/EVAL.list | deromanize.awk > EVAL.list.deromanized
===========================SKIP===============================================
With the master file "eval-EVAL-freq-len-rules.list", I divided it into four frequency bands (1, 10, 100)
0<= ~ <= 1
1< ~ <= 10
10< ~ <=100
100< ~
>> cat eval-EVAL-freq-len-rules.list | awk '$3 > 100' > high
>> cat eval-EVAL-freq-len-rules.list | awk '$3 > 10 && $3 <= 100' > med-high
>> cat eval-EVAL-freq-len-rules.list | awk '$3 > 1 && $3 <= 10' > med-low
PROBLEM FOUND
wc of eval.list and EVAL.list do not match!!
==============================================================================
With the master file "eval-EVAL-freq-len-rules.list", I divided it into three frequency bands
1
2-10
10+
>> cat eval-EVAL-freq-len-rules.list | awk '$3 == 1' > LOW
>> cat eval-EVAL-freq-len-rules.list | awk '$3 >=2 && $3 < 10' > MED
>> cat eval-EVAL-freq-len-rules.list | awk '$3 >= 10' > HIGH
Then by syllable length
>> cat HIGH | awk '$4==2' > HIGH-2_syl
>> cat HIGH | awk '$4==3' > HIGH-3_syl
>> cat HIGH | awk '$4==4' > HIGH-4_syl
>> cat HIGH | awk '$4==5' > HIGH-5_syl
>> cat HIGH | awk '$4>5' > HIGH-6_more_syl
do the same thing with MED and LOW files.
Then by # of rules
>> cat HIGH-2_syl | awk '$5==2' > HIGH-s_syl-1_rules
>> cat HIGH-2_syl | awk '$5==3' > HIGH-s_syl-2_rules
>> cat HIGH-2_syl | awk '$5==4' > HIGH-s_syl-3_rules
>> cat HIGH-2_syl | awk '$5>4' > HIGH-s_syl-4_more_rules
do the same thing with other files.
Selecting evaluation tokens:
800 tokens from the 12 cells of the TABLE, i.e. 66 tokens from each cell.
The total number of tokens for the first cell (upper-left) is 59957.
So 59957/66 = 908. At .../TABLE/ folder, to select every 908th token,
>> cat HIGH/*-1_rules | awk 'NR % 908 == 0' > SELECTED/row_1-col_1
The TABLE with the # of tokens for each cell is;
For N tokens, if you have to select K number of tokens, you do
N/K = m
and then you select every m_th token, the resulting tokens will be K number of tokens. Numbers in () are (m). N.B. K=66 tokens.
==========
high freq. #rules 1 2 3 4 or more
========== ------------------------------------------
#_syl. 59957 35552 12601 2618
(908) (538) (190) (39)
=========
med freq.
========= ------------------------------------------
#_syl. 210178 149829 60917 14826
(3184) (2270) (922) (224)
=============
low freq.
============= ------------------------------------------
#_syl. 211697 162143 70591 18652
(3207) (2456) (1069) (282)
For the rest of the cells,
>> cat HIGH/*-2_rules | awk 'NR % 538 == 0' > SELECTED/row_1-col_2
Do the same thing for the rest cells.
I got 793 tokens from this procedure.
=======================================
Now for the preparation of deromanized evaluation file
I modified "deromanized.awk" printf function and inserted two tabs to make the output more manageable with awk later.
At .../phonetizer/ folder
>> ./deromanize-kyoon.awk ../data/eval-list/TABLE/SELECTED/EVAL-TOKENS-written-col > ../data/eval-list/TABLE/SELECTED > ../data/eval-list/TABLE/SELECTED/EVAL-TOKENS-written-col-deromanized
Same thing for spoken column.
Then I extracted only Korean syllables by writing an awk script "hangul-col.awk" in .../script/kyoon/ folder
--------------------------
#! /opt/gnu/bin/gawk -f
{
for(i=2; i <= NF; i+=2) {
printf("%s\t", $i);
} printf("\n");}
---------------------------
which extracts only even-numbered fields.
>> cat EVAL-TOKENS-written-col-deromanized | hangul-col.awk > EVAL-TOKENS-written-hangul
>> cat EVAL-TOKENS-spoken-col-deromanized | hangul-col.awk > EVAL-TOKENS-spoken-hangul
Then I added hangul columns (written and spoken) to our original EVAL-TOKENS files for correction.
**** To get rid of 67th token from file "row_1-col_4" (N/K = 2618/39) of TABLE, I did
>> cat REVISED-TEST-SHEET | awk '$1 != "nam-nyeo-ca-byeol-geum-ji"' | wc
which gave me 792 tokens removing one tokens from 793 tokens.
Then I prepared an answer sheet using answer.awk script.
>> cat REVISED-TEST-SHEET | awk '{print $1}' | answer.awk > REVISED-answer-sheet
I created a random number list (792 lines) by
>> cat file_of_792_lines | awk '{print rand()}' > randnum
>> paste REVISED-TEST-SHEET REVISED-answer-sheet > FINAL-TEST-SHEET
>> paste randnum FINAL-TEST-SHEET > FINAL-TEST-SHEET-randnum
>> cat FINAL-TEST-SHEET-randnum | sort +0 > RANDOMIZED-FINAL-TEST-SHEET
I also created a Hangul-only version of this file.
======================
After a minor revision to tensify rule section of krlex.scm file, i.e. "nj d" becomes "n dd", I ran cblts script to our evaluation tokens two times, first time with cblts script (using krlex.scm before revision), the second time with cblts2 (using krlex.scm after revision) and the files are at .../SELECTED/7.compare
I extracted two columns (before and after revision) and re-extracted records that differ from each other
>> cat FINAL-compare-eng-randomized-sorted | awk '$1 != $2' > FINAL-where-two-differ
There were 44 pairs (before and after) where the two krlex.scm files produced different output.
Further analysis of 44 pairs needed here
====>>>
=======================
n-l and l-n tokens
To find any pattern for tokens containing n-l or l-n sequences, at .../lts folder, with all .txt files
>> cat *.txt | grep '.*n\-l.*' > conflict-n-l-tokens (wc = 24799)
>> cat *.txt | grep '.*l\-n.*' > conflict-l-n-tokens (wc = 1874)
Then I ran cblts script to the two 'conflict...' files and deromanized them.
>> ./cblts2 ../data/eval-list/TABLE/SELECTED/7.compare/conflict-n-l-tokens
after which, the file "conflict-n-l-tokens.out" results.
Then I ran fest2hyphenated script and deromanize-kyoon.awk piped with hangul-col2.awk. The resulting files are
conflict-n-l-tokens.out.deromanized
conflict-l-n-tokens.out.deromanized