Dbnary – Wiktionary as Linguistic Linked Open Data

Dbnary is an effort to provide multilingual lexical data extracted from wiktionary. The extracted data is made available as LLOD (Linguistic Linked Open Data). This data set has won the Monnet challenge in 2012.

Linguistic data currently includes Bulgarian, Catalan, Chinese, Dutch, English, Finnish, French, Irish, German, Greek, Indonesian, Italian, Japanese, Kurdish, Latin, Lithuanian, Malagasy, Norvegian, Polish, Portuguese, Russian, Serbo-Croat, Spanish, Swedish and Turkish.

Licence

Dbnary is derived from Wiktionary and is distributed under Creative Commons Attribution-ShareAlike 3.0.

Attribution

If you use DBnary in a way or another, please link to this web page. When citing this work in a scientific article, please do cite:

Sérasset Gilles (2014). DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF. to appear in Semantic Web Journal (special issue on Multilingual Linked Open Data). [pdf]

Dataset

DBnary dataset is registered on the datahub.

The dataset contains extracts from 22 Wiktionary language editions. It also contains a set of additional data that is computed from extracted content. This is what I called enhancements. Up to now, the main enhancement (and the one you can reasonably count on) is a set of disambiguated translations (see files named ll_dbnary_enhancement.ttl.bz2 where ll is the language code). In this file you will find links from translation pairs to the specific word-sense(s) for which the translation is valid.

The dataset may be downloaded or accessed online.

Statistics

The Dashboard will allow you to see the number of Pages, Entries, Senses, Translations, Lexical Relation… that are available globally or in each language edition.

A Short History of DBnary

In August 2012, the first version of DBnary was released as a participation to the Monnet Challenge for Lexical Linked Data. At that time, there were a few language extracted (mainly English, French, German, Italian and Portuguese).

From the beginning, the extraction process has been designed as an ongoing process were each wiktionary dump is extracted as it is produced. This way, the dataset evolves with Wiktionary data (hence it also follows the evolution of languages). Moreover, new languages were introduced from time to time and we now maintain 22 different extractor.

In practice, this means that the dataset evolves twice a month.

Until July 2017, the dataset was modeled using the lemon vocabulary. At this date, all extractor switched to the ontolex vocabulary which extends over lemon and is now a W3C specification.

All extracted versions are still available for download if anybody wants to study the evolution of the extracted data. From now on, the early versions of DBnary (modeled using lemon) is only available on Zenodo as we have difficulties maintaining the full history on our servers. The later version (from July 2017 and going) is still available for download on this server.

As the extraction process goes on for years, the extractors and original data could become out of sync and extracted data will not reflect faithfully the wiktionary information anymore. In order to cope with this, statistics on extracted versions are computed and we use dashboard were the extraction history of each language may be studied. Usually, when the number of elements (pages, entries, translations, relations, etc.) decreases, it means that the Wiktionary community of the corresponding language edition has decided to change the way they represent the lexical information. When we detect such decrease, we try to adapt the extractor and re-synchronize them with the Wiktionary data. In the beginning, such stats were maintained in csv format and were external to the dataset. Now, all the history of statistics is available in RDF (using datacube vocabulary). These stats are available online and may be queried along with DBnary data through the SPARQL endpoint.

Home

Licence

Attribution

Dataset

Statistics

A Short History of DBnary