Brief Description: The Penn Treebank is a human-annotated and
partially `skeletally' parsed corpus consisting of over 4.5 million
words of American English. It includes the Brown Corpus (retagged)
and the Wall Street Journal Corpus, as well as Department of Energy
abstracts, Dow Jones Newswire stories, Department of Agriculture
bulletins, Library of America texts, MUC-3 messages, IBM Manual
sentences, WBUR radio transcripts, and ATIS sentences.
The directory: /home/treebank/ contains the documentation, the
tools, and the fully annotated versions of several corpora, including
the Wall Street Journal Corpus. The tgrepable version of the WSJ
corpus is stored in the file:
/home/treebank/tgrepabl/wsj_mrg.crp . For general information
about the treebank, look first at: /home/treebank/doc/README.doc
The main gateway to using the treebank on the ling
network is the tgrep tool--which is basically a grep for trees. For
information on how to use tgrep, see the documentation below.
The Penn Treebank is available on CD-ROM to members of
the Linguistics Data Consortium, of which Ohio State is a member. It
is installed on the linguistics department computer network. For
information on using the treebank elsewhere at OSU, contact email@example.com
Platform: Any Sun on the ling network; recommended server
(especially if you are doing big searches--please do not tie up
julius with treebank searches): puck
Usage: at a unix prompt, type: tgrep 'search-string'
For example, to find all instances of the word "computer"
dominated by the tag PP, in the WSJ corpus,
'PP << computer' /home/treebank/tgrepabl/wsj_mrg.crp
- For tgrep information, on the ling network, type: man tgrepdoc
- Shorter documentation for the individual tools is available via
"man tgrep" and "man tprep".