CQP Documentation Addenda & New Features of the Beta Version ============================================================ IMS Corpus Workbench v2.2 beta 17 GOOD NEWS: This beta release fixes a number of serious bugs concerning label references and targets. If you've always wanted to use labels or get correct results from (one-dimensional) frequency distributions over targets, now's a good time to start. :o) BAD NEWS: This version is about 10% slower than the v2.2 release (more for complex queries with many labels). It's the price we have to pay for correct treatment of labels etc. ======================================================================== COMMAND-LINE FLAGS ======================================================================== You can get a full list of available command line flags with > cqp -h There are two new switches. If you're running CQP in a terminal session, it is a good idea to start it with > cqp -e The other is "-c", which is useful for running CQP from a Perl script or another controlling application. In a Perl script, use IPC::Open2 to run "cqp -c" in the background. Then you can send CQP commands to the background process and read information up to a blank line (which indicates completion of a single CQP command). ======================================================================== OPTION SETTINGS ======================================================================== The following options ("variables") are currently available: Variable settings: DefaultCorpus [dc]: Registry [r]: /home/users3/evert/registry:/corpora/c1/registry LocalCorpusDirectory [lcd]: Pager [pg]: more PrintMode [pm]: PrintOptions [po]: PrintStructures [ps]: HardBoundary [hb]: 100 AutoSubquery [sub]: No Paging [p]: No WriteHistory [wh]: No LeftContext [lc] (see below) RightContext [rc] (see below) Context [c]: Left: 25 characters, Right: 25 characters Highlighting [h]: No AutoShow [as]: Yes Timing: No AutoSave: No SaveOnExit: No UserLevel: 0 LeftKWICDelim [ld]: < RightKWICDelim [rd]: > DefaultNonbrackAttr [da]: word HistoryFile: --- Option names are case-insensitive, i.e. all of set PrintOptions ...; set printoptions ...; set PRINTOPTIONS ...; will work. For most of the options, there is an abbreviated form, which is shown in square brackets above. Thus set PrintOptions "html"; set po "html"; are equivalent. String values should be enclosed in double quotes ("). Personal preferences can be set in the file ~/.cqprc, which CQP reads on startup. --- Some useful options: -- set pager "more"; "less" (which is the default pager on many systems) does not display highlighted text; if you get messy output, use this option. -- set printmode ( ascii | html | latex | sgml ); Switch between ASCII (highlighted; this is the default setting), HTML, LaTeX and SGML output. ASCII .. for terminal output HTML .. for writing CGI scripts LATEX .. for inclusion in LaTeX documents (NB this printmode doesn't convert non-ascii characters, so you'll need to specify "\usepackage[latin1]{inputenc}" in your LaTeX document) SGML .. for post-processing (experimental) -- set PrintOptions ; is a comma-separated list of one or more of {wrap, table, trailer, header, border, number} and should be enclosed in double quotes. Most of these options are used for HTML and LaTeX output. Recommended settings are set printmode html; set printoptions "table, wrap"; and set printmode latex; set printoptions "header, number"; Prepend "no" to any option to reset that option. For instance set po "header"; prints a query header on top of the query results. Use set po "noheader"; to switch off header output. There is currently no way of resetting all options to default values. -- set PrintStructures ; is a list of structural attributes with annotated values (created using "encode -V") which will be printed with query results. You can find out which attributes of a corpus have values by typing show cd; ("Structure Values"). Please note that setting the "PrintStructures" option will destroy the list of annotated structural attributes, so "show cd;" won't be useful after that. You must always specify the full list of attributes to print, i.e. set ps "chapter"; set ps "section"; will only display the current section heading, whereas set ps "chapter, section"; will display both. -- set AutoSubquery ( on | off ); Automatically subquery the result of the last query; same as setting "Last" as the active corpus. -- set Paging ( on | off ); Activate / deactivate pager output. This is mostly useful when running from an EMACS shell or so. -- set AutoShow ( on | off ); By default, query results are automatically shown. set as off; will only show the number of matches and remind you to use the "cat" command for printing the actual results. -- set timing ( on | off ); Set this to "on" for query benchmarking. -- set defaultnonbrackattr ; Where is a positional attribute. CQP supports a shorthand notation for simple queries where "comput.*"; is equivalent to [word = "comput.*"]; You can use this shorthand for other attributes by setting the "DefaultNonbrackAttr" option. For example, set da lemma; runs simple queries on the "lemma" attribute. ======================================================================== Query Output ======================================================================== The display format of query results is mostly controlled by the "show" command. Type "show cd;" to display the current settings: MLCC-EN> show cd; ===Context Descriptor======================== left context: 25 characters right context: 25 characters corpus position: shown Positional Attributes: * word pos lemma Structural Attributes: sitting s Structure Values: Aligned Corpora: mlcc-de mlcc-fr mlcc-es ============================================= Use show +; to display additional attributes and show -; to hide them. You can specify multiple attributes in a single show command, separated by blanks. In the example above, show -word +pos +lemma +s -cpos; will display both part_of_speech and lemma, plus sentence boundaries. The original word forms and the corpus positions are not shown. Active attributes are preceded by an asterisk in the Context Descriptor: MLCC-EN> show cd; ===Context Descriptor======================== left context: 25 characters right context: 25 characters corpus position: not shown Positional Attributes: word * pos * lemma Structural Attributes: sitting * s Structure Values: Aligned Corpora: mlcc-de mlcc-fr mlcc-es ============================================= Although the left and right context appear in the context descriptor, they must be set using "set leftcontext" and "set rightcontext". For convience, "set context" sets both sides to the same value. The context setting options take a special syntax. set context ; ... sets context to characters set context word; ... sets context to tokens set context ... sets context to regions (inclusive). As you may have guessed now, set context 1 s; will usually display entire sentences. --- The "show" command can also be used for alignment attributes. show +mlcc-de; will display for each match of a query the corresponding region in an aligned corpus (here MLCC-DE). ======================================================================== Variables ======================================================================== Contrary to what you might have expected, CQP variables are word lists against which attributes can be matched in queries. Use the "define" command to generate such lists: -- define $var = ; Set the variable $var to a _blank_-separated list enclosed in double quotes ("). -- define $var += ; -- define $var -= ; Add / remove entries to / from the list. -- define $var < "file"; Read word list from file. Each line in the file will produce one list entry. -- show $var; Display the contents of $var. You can use variables like strings in queries, for instance to match any of a list of prepositions stored in /usr/local/share/cqp/prepositions.lst you would use the following commands: define $prep < "/usr/local/share/cqp/prepositions.lst"; ... [word = $prep] ...; You can add further prepositions at run-time by typing definde $prep += "beside"; for instance, but you cannot save the modified list to a file. ======================================================================== Targets and Frequency Distributions ======================================================================== Basically, each query returns a list of matching ranges, i.e. a subcorpus. In addition, three single positions can be marked for each match in a subcorpus. They are match leftmost position of matching range target set in query with "@" keyword (see below) and will be collectively referred to as from now on. Please note that 'collocate' is an obsolete and deprecated alias of 'target'. If highlighting is active, marked positions (except match) will usually be printed in bold face. -- "the" @ [pos = "ADJ"] [pos = "N"]; Will return three-word matches of a certain type of NP and mark the adjective in the middle as target. In a terminal, the target will be in bold face. -- set (keyword|target) (nearest|leftmost|rightmost) [ ... ] within (left|right) (words|) from (exclusive|inclusive); Set keyword / target in a subcorpus. [ ... ] is a CQP expression matching a single corpus position. The syntax should be self-explanatory. The following example marks a preposition immediately to the left of the NP matched by the previous query as keyword: set Last keyword nearest [pos="PREP"] within left 1 words from match exclusive; NB 'set target' will overwrite previous targets set by the '@' operator! -- delete without Deletes all matches where the given is not marked. For instance, delete Last without keyword; reduces the previous result to NPs actually preceded by a preposition. -- subset where : [ ... ]; This allows finer control of which matches to delete. In the previous example, we only want to consider some prepositions: subset Last where keyword: "in|on|of|to|from"; [ may not work correctly; don't rely on this function ] Marked positions can be used to create frequency distributions: -- group [ by ] Generate a frequency table for the values of . Optionally, create separate tables for each distinct value at . In the above example, group Last target lemma by keyword word; would create the following output: #--------------------------------------------------------------------- of new 65 #--------------------------------------------------------------------- in near 57 past 54 #--------------------------------------------------------------------- to new 49 #--------------------------------------------------------------------- in current 42 #--------------------------------------------------------------------- from previous 34 #--------------------------------------------------------------------- in latest 31 #--------------------------------------------------------------------- to current 30 #--------------------------------------------------------------------- in new 26 #--------------------------------------------------------------------- of current 26 #--------------------------------------------------------------------- to additional 25 #--------------------------------------------------------------------- on open 25 #--------------------------------------------------------------------- ... If you think this doesn't make sense, send eMail to oli@trados.com, or wait for the next major revision of CQP. :-) ===================================================================== Incremental queries ===================================================================== One way of going about corpus queries is to first create a subcorpus of potentially interesting regions (usually sentences), and then refine the search by successively reducing the set of sentences. This can be done with the "!" flag. For example start with A = "[Ii]diosyncras.*" expand to s; to find all sentences that mention idiosyncrasies, and then split them in smaller sets of sentences which also mention computers, people, ... A; A_COMP = [lemma = "computer"] !; A_PEOPLE = [word = $first_names] [word = $last_names] !; ... Note that a general query takes the following form: [ :: ] [ within (words|) ] [ : [...] ] [ cut ] [ expand [left|right] to ] [ ! ] ; If you don't put the '!' flag at the end, a subquery just returns the regions it matches, i.e. A; A_COMP = [lemma = "computer"]; returns all occurrences of the word 'computer' in sentences that mention idiosyncrasies. Remember the 'marked' positions in a query result? You can use them as so-called anchors in your subqueries. Anchors have a special syntax, which you should regard as an idiosyncrasy of CQP's grammar :o) ... [.word = ".*"] ... This position of the matches will be aligned with the corresponding anchor in the query corpus. Up to 4 anchors are available: match leftmost position of matching range matchend rightmost position of matching range target target (if set in query corpus) keyword keyword (if set in query corpus) Subqueries can be processed very efficiently if the first item of the query expression is an anchor. For example, to find idiosyncrasies of computers and people, try the following: A = "[Ii]diosyncras.*" expand to s; A; A_COMP = [match.word = ".*"] "of" [lemma = "computer"] []* [matchend.word = ".*]; A_PEOPLE = [match.word = ".*"] "of" [word = $first_names] [word = $last_names]; ... A_COMP is expanded to the full query corpus range (in this example, this is the same effect as '!' would have produced), while A_PEOPLE isn't. ===================================================================== Flags in RegExp Matches ===================================================================== Character-level RegExp matches can take the following flags: %l ... "literal" -> match with literal string, not regexp %c ... "ignore case" -> ignore case in regexp match %d ... "ignore diacritics" -> ignore diacritics (mainly European languages) in regexp match (ISO-Latin-1 character set only!) For example, to find periods in a corpus, type "." %l; which is equivalent to "\."; To find capitalised word forms (in sentence initial position) as well as the regular spelling, use "%c": [word = "on" %c]; The "%d" flag is useful for some European languages which contain characters not available on your keyboard, or where electronic texts may sometimes omit diacritics. [lemma = "a" %d] [lemma = "bientot" %d]; ===================================================================== Using Structural Attributes with Values ===================================================================== If your document contains regions with parameter values such as It's raining cats and dogs. A cat has nine lives. ... It's raining cats and dogs. A dog's life. and you have encoded the corresponding structural attribute using the -V switch -- for the example above that might be: ... | encode -S s -V entry -- you can now specify conditions for the parameter values in your queries. For reasons of backward compatibility, this has to be done through a label reference in the global constraints field. Please note that the parameter values of a region are stored as a single string in the encoded corpus, i.e. in the example we would have val: `keyword="cat"' val: `keyword="dog"' You have to use RegExps to "parse" those strings for the information you want. For example, the query [lemma = "cat"] "and" [lemma = "dog"]; would return 2 matches in the sample corpus. If you only want the match from the "dog" entry, type a:[lemma = "cat"] "and" [lemma = "dog"] :: a.entry = "keyword=\"dog\""; Note how the last RegExp "parses" the entire parameter value string. If there is more than one parameter, such as in you must allow for the extra material: a:[lemma = "cat"] "and" [lemma = "dog"] :: a.entry = ".*keyword=\"dog\".*"; It is a good idea to reformat the region start tags to allow simpler access, splitting them into multiple, parallel regions if necessary: --> val: --- --> val: `4432' --> val: `dog' --> val: `full' ... which are much easier to work with and can be compared to other attributes. For instance, this would enable you to find animals from a given list which appear in another animal's entry: define $animals = "cat dog mouse ..."; a: [word = $animal] :: (a.entry_keyword = $animals) & (a.entry_keyword != a.lemma); ===================================================================== Experimental Features ===================================================================== It is possible to test for region boundaries in queries by using an SGML tag notation: [pos = "ADV"] ; finds adverbs at the beginning of a sentence. There is another way of doing this using the builtin functions lbound() and rbound(), which is kept for compatibility reasons: [pos = "ADV" & lbound(s)]; This syntax has been extended so that "bare" references to structural attributes (both ordinary structural attributes and ones with values) evaluate to True within any region of that attribute. This is useful for attributes which are defined only for part of the corpus, such as headers. If you want to constrain your search to headers, but also to match only "within s", the only way of doing this is: ... [ ... & header ] ... within s; --- Finally, there is a new builtin function unify() which allows a crude treatment of agreement phenomena, such as in the following query which finds German noun phrases whose case is unambiguously genitive. a:[pos = "ART"] b:[pos = "ADJA"] c:[pos = "NN"] :: unify(a.agr, unify(b.agr, c.agr)) = "-(Gen:.:..-)+"; This is only useful for morphologically rich languages, and it requires a morphology program to generate the necessary corpus annotations. If you're interested in using this experimental feature, please contact evert@ims.uni-stuttgart.de for details. ===================================================================== 17 Sep 1999, Stefan Evert.