collected by Markus Dickinson and Detmar Meurers (OSU), February 2002
funding for this project provided by OSU College of Humanities Seed Grant
You can find reference documentation for tools installed at OSU here
You can find a list of our installed corpora here
LT TTT (Text Tokenisation Tool), a text tokenization system from the Language Technology Group
Segmenter segments texts into topical chunks
SATZ, an adaptive sentence boundary detector
MXTERMINATOR by Adwait Ratnaparkhi
CPAN has Text::Sentence (Ave), a module for splitting text into sentences.
Scott Piao's multilingual concordancer has a sentence splitter (I think).
The Illinois Cognitive Computation Group has a sentence splitter
Zhiping Zheng's QA system contains an online sentence segmenter
Lingua-EN-Sentence-0.25 (Shlomo) splits sentences based on regular expressions and lists of abbreviations.
Guenther(?), a sentence segmenter which is to appear, I think (site in German)
Jorg Schuster has a Test Sentencizer site which allows comparison of mxterminator, ave, and shlomo.
Oliver Mason has a tokenizer called QTOKEN
A demo from Xerox Research Centre Europe (XRCE)
WinBrill from Analyse et Traitement Informatique de la Langue Francaise (ATILF)
ACOPOST, a collection of POS taggers, including a maximum entropy tagger, a trigram tagger, an error-driven TBL tagger, and an example-based tagger.
Decision Tree Tagger, developed by Helmut Schmid
Online interface for TreeTagger found here.
CLAWS POS Tagger (costs). A trial version is available here.
AUTomatic Analysis SYStem (AUTASYS), using the LOB & ICE tagsets
XEROX tagger, available via FTP
TNT Tagger by Thorsten Brants. TnT = "Trigrams 'n Tags"
LT POS, a part-of-speech tagger from the Language Technology Group
Brill Tagger, a transformation-based POS tagger. Site also includes supervised & unsupervise POS taggers & a PP-attachment program. The FTP location is found here
Various demos, including one for the Brill Tagger, can be found at the Centre for Language Engineering Demonstrations
An online tagger for German can be found at the University of Zurich
Maximum Entropy POS Tagger (MXPOST) developed by Adwait Ratnaparkhi. Site also has MXTERMINATOR, a sentence boundary detector
QTAG, a probabilistic tagger roughly based on HMMs.
MuTBL, a transformation-based learning system which can train Brill taggers
fnTBL is machine learning toolkit for NLP tasks.
MTP (Münster Tagging Project), featuring Xlex, a suite of tools including a tokenizer, segmenter, tagger, index tool, & collocation tool. An online demo of Xlex can be found here.
AMALGAM , Automatic Mapping Among Lexico-Grammatical Annotation, maps tagsets and phrase structure grammar schemes. (includes a bibliography on lexico-grammatical annotation models)
In addition to a shallow parser and a sentence splitter, the Cognitive Computation Group at Illinois has a SNoW-based Tagger. SNoW papers available here
VISL has a free upload interface for automatic tagging/parsing of several languages at its website.
Hermit Crab, self-described as a "morphological parser and generator for classical generative phonology and morphology"
POSTTAG for use with Korean texts; a tagger & morphological analyzer. POSTPAR is the syntactic analyzer
Morphy, a morphological tool for German with some statistical POS tagging (site is in German)
Morphix, Günter Neumann's morphological component for inflectional languages
GERTWOL, a system for automatic recognition of German word forms, using two-level morphology
Word Manager is "a system for the acquisition and management of reusable morphological and phrasal dictionaries"
DeKo (Derivations und Kompositionsmorphologie) analyzes complex words of the German language
John Carroll has some tools for morphological analysis (morpha), generation (morphg), and a/an insertion (ana).
PC-KIMMO is a two-level processor for morphological analysis, available from sil.org. Also available from sil is AMPLE, which breaks words into morphemes.
ALE-RA, an ALE extension with Realizational morphology and Automata Phonology
Project Deutscher Wortschatz at the University of Leipzig (site in German)
Deutsche Malaga-Morphologie (DMM) is a system for the automatic wordform recognition of German.
CISLEX from the University of Munich (site in German)
For Russian: RUSLO a system for Russian derivational analysis and synthesis (not downloadable)
For Turkish: Turkish Morphological Analyzer is an online analyzer which treats both word formation and inflection; developed by Kemal Oflazer
Krzysztof Szafran's freeware Windows and Linux versions of a morphological analyser for Polish
ChaSen is a morphological analyzer for Japanese
Head-Corner Parser by Gertjan van Noord
LT CHUNK, a syntactic chunk parser from the Language Technology Group
CASS Parser, Steve Abney's robust partial parser
The Apple Pie Parser, from the Proteus Project at NYU is "a bottom-up probabilistic chart parser which finds the parse tree with the best score by best-first search."
The XIP parser from XRCE
SNoW-based shallow parser from the University of Illinois Cognitive Computation Group
Michael Collins has a parser available via ftp
Eugene Charniak's nlparser is available from Brown University
LoPar, a left corner parser for head-lexicalised probabilistic context-free grammars, developed by Helmut Schmid
tCHUNK, as well as tTAG, is available from Infogistics (this is the demo page)
Walter Daelmans et al have a Memory-Based Shallow Parser
A list of probabilistic parsers is available from Stanford here
Mark Johnson has PCFG parsers
Corpus Storage Maintenance and Access System (COSMAS), featuring virtual corpus composition, complex query language, concordancing, collocation analysis, etc.
ATLAS (Architecture and Tools for Linguistic Analysis Systems), self-described as "a generalized model, suitable for annotating signals of essentially arbitrary dimensionality with annotations having essentially arbitrary structure"
VIEW is a nice interface for searching the BNC
Corpuseye offers different searching techniques on different types of corpora and different languages.
NEGRA an annotate tool
Test Suites for Natural Language Processing (TSNLP), an annotation scheme for use on test suites in German, French, & English
VERBMOBIL, some general annotation tools
TIGER Search, a specialized search engine for syntactically annotated corpora
the trees for TIGERSearch use SVG (Scalable Vector Graphics), which are run on Batik
Transcriber, a tool for segmenting, labeling and transcribing speech from the Linugistic Data Consortium (LDC)
INTEX has multiple uses, including parsing & tagging
Xlex has a variety of tools
Alembic Workbench includes customizable tagsets & evaluation tools to analyze annotated data
The Callisto annotation tool supports "linguistic annotation of textual sources for any Unicode-supported language."
WordFreak is an annotation tool for manual and automatic annotation, as well as human correction.
ACE (Automatic Content Extraction) annotation tools support multiple annotation layers.
MMAX Annotation Tool (Multi-Modal Annotation in XML) supports stand-off annotation, among other things.
NXT (NITE XML) supports linguistic annotation for highly structured or cross-annotated data.
PALinkA (Perspicuous and Adjustable Links Annotator) has been used to annotate texts for anaphora resolution, centering, summarization, and so on.
Corpus Workbench (CWB) is used for extraction and searching for data-driven approaches. Uses the Corpus Query Processor (CQP).
SMES, Günter Neumann's information extraction system (with chunker & morphological analyzer)
Connexor has various annotation tools and some online demos of annotating sentences in various languages
As part of the BulTreeBank, the CLaRK system is an XML-based software system for corpora development.
AGTK Annotation Graph ToolKit
TGrep, for searching through the Penn Treebank, is downloadable here. Information on using tgrep is available here.
GSearch, a search tool which uses syntactic criteria, even if the corpus is not syntactically marked up.
LingPipe does named entity recognition, as well as other processing
GATE (General Architecture for Text Engineering) offers a lot of text processing tools
The TALP research center has various analyzers for Spanish and has recently released FreeLing, an open-source C++ library providing language analysis services
LT XML the Language Technology Group's integrated set of XML tools and a developers' tool-kit
MATE (Multilevel Annotation, Tools Engineering) addresses creating, acquiring, & maintaining large corpora
HTML Tidy tool can convert HTML to XML, among other things.
The Mannheim corpus, including links to COSMAS (COrpus Storage Maintenance and Access System), which provides links to corpora at IDS. A listing of the Mannheim corpus can be found here. (All sites in German)
International Corpus of English (ICE), a World Englishes corpus with syntactic annotation -- uses the tool ICECUP (costs)
UCREL (Lancaster) has a decent list of corpora
Linguistic Data Consortium (LDC) contains various corpora, e.g. Portuguese newspapers & Chinese Audio Treebank
(Ohio State has LDC membership for the years 1995, 1999, 2000, and 2001.)
ELRA, a listing of different corpora
Project Gutenberg for English texts. You can buy it here on CD.
And Project Gutenberg-DE is the German version
ECI (European Corpus Initiative) Multilingual Corpus including Frankfurter Rundschau and Donaukurier
ICAME has corpora available, as well as online journals
Doub Biber and Mark Davies are working on tagging a Spanish corpus. See details here.
The EMILLE corpus, containing monolingual written corpus data for 14 South Asian languages
The Lancaster Corpus of Mandarin Chinese, which is part-of-speech tagged and available free of charge
NEGRA, a syntactically annotated corpus of German newspaper texts
VERBMOBIL, a corpus among other things; this is the overview page.
TIGER Project, Linguistic Interpretation of a German Corpus, which will be about 50,000 sentences & annotated using LFG
TUSNELDA, the Tübingen collection of reusable, empirical, linguistic data structures
Penn Treebank Project, a bank of trees, with part of speech tags, among other annotations
DEREKO (Mannheim page -- acquisition) (Tuebingen page -- annotation) (Stuttgart page -- exploitation) provides annotated German corpora
PARC has a dependency bank of 700 sentences available here.
Some searching can be done in the BNC (British National Corpus) online page. Also try BNC World Edition
LDC Online has some stuff available (for members -- see membership years above)
COBUILD has concordance & collocation samplers online, as part of the Wordbanks Online project
I believe the Mannheim corpus has portions available online
Parts of the Gutenberg project are available online.
The Oxford Text Archive (OTA) has various texts
Michael Barlow has a very nice page here, devoted to many facets of corpus linguistics
David Lee has a very extensive site devoted to corpora and corpus resources.
SFB441 has a listing of software for corpus linguistic research
Annotation: a site by Steven Bird which lists all sorts of tools for linguistic annotation. Many of them are speech-based.
Penn Tools is a listing of corpora and tools available at UPenn
TIGER lists several useful links for Treebank projects
Frequency lists of word found in the BNC can be found here
ICAME has a bibliography online, as well as in searchable form.
EAGLES (Expert Advisory Group on Language Engineering Standards) provides recommendations on corpus typology.
W3C Corpus Linguistics Page at the University of Essex
Our system (found under: /home/corpora) corresponds to the 2-letter language codes (ISO 639) found at The XML Cover Pages