# Goal: Turn # # A -> C B # A -> B C # - # A -> B D # - # A -> B D C # A -> B C D # - # B -> B E # - # B -> B E C D # - # A -> B B B C # A -> B B C B # A -> B C B B # - # # into # # A -> B C # A -> B D; 0 < 1 # A -> B C D; 0 < 1, 0 < 2 # B -> B E; 0 < 1 # B -> B C D E; 3 < 1, 3 < 2, 0 < 1, 0 < 2, 0 < 3, 1 < 2 # A -> B B B C; 0 < 1, 1 < 2, 0 < 2, (0 < 3) # # (Justification for the last rule: by using the same nonterminal, the # annotator is saying that the terminals are indistinguishable. So the # question of "which" B precedes the C must be moot (or else it would # have a different tag). Specifying an order among the # indistinguishables therefore does nothing but reduce spurious # ambiguities.) # # Idea: For each group: # # 1. sort the rhs: (B, C, D) # 1. determine the canonical order (lexical before phrasal) # 2. for each item in the group generate the order constraints: # (B < D, B < C, D < C) # 3. intersect the sets so generated (B < D, B < C) # 4. create a mapping from categories to indices (B=0, C=1, D=2) # 5. output the resulting rule (A -> B C D; 0 < 2, 0 < 1) use Set::Scalar; $first = 1; $curctr = 0; $lhs = ""; $constraints = (); $srhslist = (); $dupes = 0; while (<>) { if (s/=>/->/) { chop; # the space after the semicolon is required. # otherwise, the main regex in grammarmaker won't match print "$_; \n"; next; } if (m/^-\n/) { for (my $i = 1; $i < $curctr; $i++) { $constraints[0] *= $constraints[$i]; } print "$lhs -> "; print join(" ", @srhslist); print "; "; $listfirst = 1; while (defined(my $e = $constraints[0]->each)) { print ", " unless $listfirst; ($l, $r) = ($e =~ m/(.*) < (.*)/); print "$currule{$l} < $currule{$r}"; $listfirst = 0; } if ($dupes) { $curstart = 0; $curitem = $srhslist[0]; for (my $i = 0; $i < @srhslist; $i++) { if ($curitem ne $srhslist[$i]) { if ($i > $curstart + 1) { for (my $k = 0; $k < $i - 1; $k++) { for (my $l = $k + 1; $l <= $i - 1; $l++) { print ", " unless $listfirst; print "$k < $l"; $listfirst = 0; } } } $curitem = $item; $curstart = $i; } } } print "\n"; $first = 1; $curctr = 0; undef @constraints; undef %currule; $dupes = 0; } else { ($lhs, $rhs) = m/(.*) -> (.*)/; @rhslist = split(" ", $rhs); if ($first) { $first = 0; @lexlist = (); @phrlist = (); for (my $i = 0; $i < @rhslist; $i++) { if ($rhslist[$i] =~ m/AA:|AP:|AVP:|CAC:|CAP:|CAVP:|CCP:|CH:|CNP:|CO:|CPP:|CS:|CVP:|CVZ:|DL:|ISU:|MPN:|MTA:|NM:|NP:|PP:|QL:|S:|VP:|VZ:/) { push @phrlist, $rhslist[$i]; } else { push @lexlist, $rhslist[$i]; } } @srhslist = sort(@lexlist); push @srhslist, sort(@phrlist); for (my $i = 0; $i < @srhslist; $i++) { $currule{$srhslist[$i]} = $i; $dupes = 1 if (keys %currule < @rhslist); } } $constraints[$curctr] = Set::Scalar->new; $count = @rhslist - 1; for (my $i = 0; $i < $count; $i++) { for (my $j = $i + 1; $j <= $count; $j++) { if ("$rhslist[$i]" ne "$rhslist[$j]") { if ($constraints[$curctr]-> has("$rhslist[$j] < $rhslist[$i]")) { push @badlist, ("$rhslist[$j] < $rhslist[$i]"); } else { $constraints[$curctr]-> insert("$rhslist[$i] < $rhslist[$j]"); } } } } $constraints[$curctr]->delete(@badlist); @badlist = (); $curctr++; } }