| The following table briefly describes and points to some machine-readable linguistic resources available on the net: |
|
| Resource |
|
Description |
|
Notes |
 |
|
WordNet 2.0 |
|
A general-purpose, comprehensive computational lexicon of English whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets. |
|
Coverage of about 80,000 nouns, 18,500 adjectives, 13,000 verbs and 4,600 adverbs.
Mappings to older versions (1.5, 1.6, 1.7.1) are available.
APIs in different languages (Perl, Java, etc.) are available.
|
 |
SemCor |
|
A semantically tagged subset of the Brown Corpus. Each word in a text is tagged with a synset in the WordNet inventory of senses. |
|
Domain documents (sports, novels, etc.).
Small-size corpus (200,000 content words).
|
 |
MultiWordNet |
|
A multilingual lexical database in which the Italian WordNet is strictly aligned with Princeton WordNet 1.6. The Italian synsets are created in correspondence with the Princeton WordNet synsets, whenever possible, and semantic relations are imported from the corresponding English synsets. |
|
44,400 Italian lemmas organized into 35,400 synsets (compared to about 95,000 synsets of the English WordNet).
|
 |
MultiSemCor |
|
An English/Italian parallel corpus, aligned at the word level and annotated with PoS, lemma and word sense. |
|
- |
 |
Domain Labels |
|
Domain labels for WordNet synsets. |
|
Useful for text categorization, WSD or other tasks requiring domain knowledge. |
 |
CoreLex |
|
An ontology of 126 semantic types, covering around 40,000 WordNet nouns and defining a large number of systematic polysemous classes that are derived by a careful analysis of sense distributions in WordNet. |
|
Interesting for top-level (or meta-level) classification of concepts.
|
 |
FrameNet II |
|
A lexical resource that describes frames or underlying conceptual structures extracted from the British National Corpus, a large corpus of contemporary English. |
|
376 frames, and more than 6,800 lexical units (lemmata).
Low coverage of verb senses (at most two or three different frames for a certain verb).
Sentences annotated with frame roles.
No mapping to WordNet is provided.
|
 |
VerbNet |
|
A verb lexicon with syntactic and semantic information for English verbs, using Levin verb classes to systematically construct lexical entries. For each syntactic frame in a verb class, there is a set of semantic predicates associated with it. |
|
Java API provided.
About 200 "frames" are provided. Roles and semantic restrictions are encoded.
|
 |
PropBank |
|
A corpus of text (the Penn TreeBank) annotated with the predicate-argument structure of the verbs. |
|
Ongoing work. The resource is available only online. |
 |
Verb Index |
|
Combines the information from the VerbNet, PropBank, and FrameNet projects. |
|
Available only online and not as a stand-alone resource. |
 |
Lexical FreeNet 2.0 |
|
A combination of a thesaurus, rhyming dictionary, pun generator, and concept navigator. |
|
Trigger words from the  TTK (Trigger Toolkit) 1.0 project.
Instances from the Biographical Dictionary.
|
 |
LDC DSO Corpus 1.5 |
|
A corpus where only a (focus) word of each sentence is tagged with respect to the WordNet 1.5 sense inventory. |
|
Tagged words are either nouns and verbs taken from the Brown corpus and the Wall Street Journal.
Tags for only 190 different content words (occurring in 192,800 sentences) are provided.
|
 |
OpenMind Commonsense (OMCS) |
|
A generic commonsense corpus where only a (focus) word of each sentence is tagged with respect to the WordNet sense inventory. |
|
Similar to the LDC DSO Corpus. |
 |
ConceptNet |
|
A semantic network consisting of over 250,000 elements of common sense knowledge. An initiative related to the OpenMind project. |
|
Encodes semantic relations like propertyOf, capableOf, subEventOf, locationOf, is-a, partOf, etc. |
 |
OpenCyC |
|
The open source version of the CyC technology, a large general knowledge base and commonsense reasoning engine. |
|
Unclear consistency of the resource.
6,000 concepts, 60,000 assertions.
OWL files of older versions of the ontology are provided.
|
 |
dictionary .com |
|
It provides word definitions from many dictionaries (Webster, WordNet, American Heritage, but also domain glossaries). |
|
- |
 |
Thought Treasure |
|
A commonsense knowledge base and architecture for natural language processing that uses multiple representations including logic, finite automata, grids, and scripts. |
|
35,000 English words and phrases, 21,000 French words and phrases, 27,000 concepts, and 51,000 commonsense assertions about those concepts.
Unclear consistency of the resource.
Commonsense assertions encoded.
|
 |
Roget's Thesaurus (1913 edition) |
|
A popular thesaurus classifying words in 1,000 topic domains. The listings overlap and the categories are open rather than exhaustive. |
|
No mapping to WordNet is provided.
Some hierarchical taxonomy is encoded.
|
 |
Stanford LKB |
|
A grammar and lexicon development environment for use with constraint-based linguistic formalisms. |
|
- |
 |
Senseval |
|
Word Sense Disambiguation competitions providing training and test sets for different WSD tasks (lexical sample, all words, gloss WSD, semantic role tagging, etc.). |
|
For comparisons with other systems. |
 |
SUMO |
|
The Suggested Upper Merged Ontology is a proposal for a standard upper ontology to promote data interoperability, information search and retrieval, automated inferencing, and natural language processing. |
|
OWL encoding of the ontology.
Mappings to WordNet 1.5 and 1.6 are provided.
A mid-level ontology (MILO) is also provided.
|
 |
DMOZ |
|
A taxonomical classification of resources on the web. |
|
Complex and interesting taxonomy. |
 |
Legenda: Downloadable resource | Corpus (downloadable or online) | Online resource |