Penn Treebank

Brief Description: The Penn Treebank is a human-annotated and partially `skeletally' parsed corpus consisting of over 4.5 million words of American English. It includes the Brown Corpus (retagged) and the Wall Street Journal Corpus, as well as Department of Energy abstracts, Dow Jones Newswire stories, Department of Agriculture bulletins, Library of America texts, MUC-3 messages, IBM Manual sentences, WBUR radio transcripts, and ATIS sentences.

The directory: /home/treebank/ contains the documentation, the tools, and the fully annotated versions of several corpora, including the Wall Street Journal Corpus. The tgrepable version of the WSJ corpus is stored in the file: /home/treebank/tgrepabl/wsj_mrg.crp . For general information about the treebank, look first at: /home/treebank/doc/README.doc

The main gateway to using the treebank on the ling network is the tgrep tool--which is basically a grep for trees. For information on how to use tgrep, see the documentation below.

The Penn Treebank is available on CD-ROM to members of the Linguistics Data Consortium, of which Ohio State is a member. It is installed on the linguistics department computer network. For information on using the treebank elsewhere at OSU, contact

Platform: Any Sun on the ling network; recommended server (especially if you are doing big searches--please do not tie up julius with treebank searches): puck

Usage: at a unix prompt, type: tgrep 'search-string'

For example, to find all instances of the word "computer" dominated by the tag PP, in the WSJ corpus,
type: tgrep 'PP << computer' /home/treebank/tgrepabl/wsj_mrg.crp