Dataset – Dbnary

DBnary dataset is registered on the datahub.

The dataset contains extracts from 22 Wiktionary language editions. It also contains a set of additional data that is computed from extracted content. This is what I called enhancements. Up to now, the main enhancement (and the one you can reasonably count on) is a set of disambiguated translations (see files named ll_dbnary_enhancement.ttl.bz2 where ll is the language code). In this file you will find links from translation pairs to the specific word-sense(s) for which the translation is valid.

The dataset may be downloaded or accessed online.

Statistics

The Dashboard will allow you to see the number of Pages, Entries, Senses, Translations, Lexical Relation… that are available globally or in each language edition.

A Short History of DBnary

In August 2012, the first version of DBnary was released as a participation to the Monnet Challenge for Lexical Linked Data. At that time, there were a few language extracted (mainly English, French, German, Italian and Portuguese).

From the beginning, the extraction process has been designed as an ongoing process were each wiktionary dump is extracted as it is produced. This way, the dataset evolves with Wiktionary data (hence it also follows the evolution of languages). Moreover, new languages were introduced from time to time and we now maintain 22 different extractor.

In practice, this means that the dataset evolves twice a month.

Until July 2017, the dataset was modeled using the lemon vocabulary. At this date, all extractor switched to the ontolex vocabulary which extends over lemon and is now a W3C specification.

All extracted versions are still available for download if anybody wants to study the evolution of the extracted data. From now on, the early versions of DBnary (modeled using lemon) is only available on Zenodo as we have difficulties maintaining the full history on our servers. The later version (from July 2017 and going) is still available for download on this server.

As the extraction process goes on for years, the extractors and original data could become out of sync and extracted data will not reflect faithfully the wiktionary information anymore. In order to cope with this, statistics on extracted versions are computed and we use dashboard were the extraction history of each language may be studied. Usually, when the number of elements (pages, entries, translations, relations, etc.) decreases, it means that the Wiktionary community of the corresponding language edition has decided to change the way they represent the lexical information. When we detect such decrease, we try to adapt the extractor and re-synchronize them with the Wiktionary data. In the beginning, such stats were maintained in csv format and were external to the dataset. Now, all the history of statistics is available in RDF (using datacube vocabulary). These stats are available online and may be queried along with DBnary data through the SPARQL endpoint.