The following resources are available to students on the Ling 884
course for investigating the WSD problem, if you have a question about
resources specific to your project, please ask. Not all resources are
local, and you will need to acknowledge sources (such as converted
SemCor) in any written work.
- Processed
Senseval-3 English all words task data: this has "s" tags added
around sentences, lemmas added from the top 300 parses by RASP (cutoff
needed to prevent crashing), and tags added from Elworthy's tagger
(CLAWS-II
tagset) — aligning the corpora / tags / parses was a lot
more difficult than expected, let me know if there are any obvious
errors. The relevant parses are also available.
- Senseval-3 English all words task test data (annotated with WN 1.7.1), including a key file.
- Senseval-3 English lexical sample task training and test data (annotated with WN 1.7.1) including a key file.
- Senseval-3 official scoring software.
- Senseval-4 / Semeval-1 coarse grained English all words task test data (annotated with WN 2.1), and key file.
- WordNet format README (html), and WordNet conversion README (text).
- Automatically created conversion between WordNet 3.0 and WordNet 2.1.
- Automatically created conversion between WordNet 3.0 and WordNet 1.7.1.
- Semcor 3.0, Semcor 2.1, Semcor 1.7.1 and Semcor 1.7 conversions created by Rada Mihalcea.
- Semcor 1.7.1 parsed with RASP: 100 parses maximum, GR output only.
- Open Mind Word Expert sense tagged data
- A web
corpus annotated with WordNet 1.6 normal senses is
available.
- Also available: RASP parser, which includes a tagger is available for download (free for research).
- Also available: WordNet files for WN 1.7.1 (a mapping from 1.6 to 1.7.1 is available from Princeton for mapping Semcor), WN 2.1 and WN 3.0.
- Also available: WordNet glosses corpus (for any version of WordNet), email if you want this (annotated definitions for WN 1.7.1 are available here).
-
- Also available: RASP parsed version of the BNC (parsed with multiple tags per word, unlimited number of parses, in the tree and GR mode, so everything is extractable), email if you want this.
- Also available: Thesaurus build on Lin's principles from the BNC, email if you want this.