How to turn a corpus into a ruleset and lexicon: ================================================ 1) Cull unwanted sentences from corpus. msxsl bigcorpus.xml cullcorpus.xsl > smallcorpus.xml 2) Add order information to the corpus. perl addorder.perl < smallcorpus.xml > ocorpus.xml 3) Untangle the corpus, producing a more useful XML document. msxsl ocorpus.xml sent.xsl > useful.xml 4) Extract plain-text rules. msxsl useful.xml textrule.xsl > rules.txt 5) Extract lexicon. msxsl useful.xml lexmaker.xsl > lex.txt 6) Remove duplicate rules and duplicate lexical entries. e:\cygwin\bin\sort -u < rules.txt > gram.txt e:\cygwin\bin\sort -u < lex.txt > clex.txt 7) Pivot rules around optional categories perl pivot.perl < gram.txt > mpgram.txt 8) Remove duplicate rules again e:\cygwin\bin\sort -u < mpgram.txt > pgram.txt 9) Cluster the rules perl ordercluster.perl < pgram.txt > clgram.txt 10) Extract LP information from clusters perl orderg12n.perl < clgram.txt > gidlp.txt 11) Turn the text rules into a prolog grammar. perl grammarmaker.perl < gidlp.txt > gram.pl 12) Turn the lexicon into a prolog lexicon. perl lexfilter.perl < clex.txt > lex.pl 13) Generate a set of test sentences. msxsl smallcorpus.xml sents.xsl > sents.txt 14) Turn the test sentences into prolog queries. perl sentfilter.perl < sents.txt > sents.pl