Download
Contents
Latest extracts
Latest extracts are available for download in turtle format. The data is modeled using ontolex vocabulary. For each language, the data is available in several files.
ll_dbnary_ontolex | The core data as it has been extracted from Wiktionary. This file contains Pages, Entries, Senses, Lexical Relations, Translations, … modeled using ontolex vocabulary. |
ll_dbnary_enhanced | Data computed from the core data. Mainly links that has been computed from Translations to the Sense(s) for which they are valid. |
ll_dbnary_morphology | Extensive morphology, i.e. a set of alternate forms for lexical entries, along with their linguistic annotations. This data is not available for all languages. |
ll_dbnary_lime | The Metadata description of each language edition, modeled using ontolex’s LIME submodule. |
ll_dbnary_statistics | Statistics on the extracted data. These are modeled using the datacube vocabulary. |
ll_dbnary_etymology | Etymological data (currently only available for English language edition) |
Core data
Data is provided as a set of turtle files (one per language) and may be downloaded here. This link will also give you access to all previous versions (either in lemon or in ontolex format).
Data uses an extended version of LEMON vocabulary. The OWL description of the data is available here. A human readable (HTML+RDFa) description is available here.
The turtle files are updated each time a wiktionary dump is made available (almost once every 10 days for each language). Latest data is available in the folder “latest“, while every extraction version is available under each languages folder.
Disambiguated Translations
At LREC 2014 (at Reyjkjvik, see publications section), we presented an experiment where additional links are given to disambiguate the source of translations. This experiment produces a set of links from Translation to a LexicalSense. Note that in the original dataset, translations are linked to lexical entries and that these new links are established using a non perfect heuristic with state of the art accuracy. This additional dataset is to be used in conjunction with the core dataset and is available along with the core dataset (and computed in sync with core updates).
Morphology
Since December 2014, morphological data has been extracted from French and German language edition. This data is currently stored in exhaustive version, meaning that every inflected form may be found in an a lemon:otherForm
property.
Non Wiktionary data
Since March 2016, we also provide data in lemon format that comes from other available datasets. The first such dataset comes from the DILAF project (Dictionaries for African Languages).
DILAF | Bambara | Hausa | Kanuri | Tamashek | Zarma |