We are handling Thesaurus entries in English dataset

The English extractor has recently been extended to handle Thesaurus entries. This leads to more lexico-semantic relations. However, such relations originates and leads to Pages rather than to LexicalEntries. This shortcoming will be handled during data enhancement (still to come). Happy Linking !

Disambiguated Translation are now systematically computed

After a long wait, I finally managed to integrate the “Source Translation Disambiguation” experiment into DBnary extractor. Hence the sources of some translation are disambiguated each time a new dump is extracted.

What is this ?

In wiktionary, translations are sometimes included in a box which corresponds to a word sense. It allows for distinguishing between different translations. For instance, the French noun “bleu” may be translated in “blue” when it means a color, in “bruise” when it is used in the “small injury” acception and “rookie” when it designs a non experimented soldier.

In the original DBnary dataset, the Translation object is linked to the lexical entry and accompanied by a “gloss” that helps human disambiguate. Note that in all theses “bleu” examples, the lexical entry is the same.

The “Disambiguated Translation” dataset completes the core DBnary dataset by linking the translation object to their correct source word sens(es). In this case, “bruise” will be attached to the 12th word sense of entry “bleu” while rookie will be attached to the word sense 10a (these word sense numbering is correct at the time I write this post, but may evolve).

Why is it not in the core dataset ?

Well doing this “disambiguation” is not always very easy. The program may not always be correct in its disambiguation. We try to do our best, but the process is not error prone. Hence we add this data in an extension file. You may chose to use it or not.

Ok, but how does it work ?

Well the disambiguation process uses the “glosses” that are given in the original wiktionary page. There are 2 kinds of glosses :

  • Some are in a numeric form : “10.a“, i.e. a sense number as given in wiktionary. In this case, we are able to attach to the designated sense number (hoping the wiktionary data is correct).
  • Others are in a textual form : “Ecchymose“, i.e. a small explanation (gloss) that is to be understood as a shortcut for a definition (“12. (Par métonymie) Ecchymose, résultat d’un choc sur une région du corps humain, conduisant à l’apparition d’une couleur bleutée.” in our example). In this case we compute a distance between the glosses and all definitions then we attach the the closest one(s). This is where we are more likely to make some mistakes.

 Ok, but how do I know if the data is correct for my language ?

Well, we are able to evaluate the process when we have specific glosses that are both numeric and textual (e.g. “10.a Ecchymose”). In this case, we first disambiguate with the textual gloss then with the numeric gloss. We compare both results and count how many time the textual disambiguation gets the same result as the numerical disambiguation. We keep the numerical result (that we consider as the correct result) in the dataset.

Using the following chart, you may see if your language has many glosses and of which kind. Note that we can not disambiguate translations that do not have glosses.

From this chart, you should see that languages uses different strategies. German wiktionary exclusively uses numerical glosses, Russian edition uses only textual glosses while French language edition uses textual or mixed glosses.

The confidence we have in our algorithm may only be computed for languages using a significant number of glosses with numerical and textual data, but as it is the same for all languages we do hope its performance is language independent.

This csv file gives a view on the confidence computed for each language. It gives the precision and recall of our method, compared to the precision and recall of a random choice among word senses. Values with a precision of 0.0 are usually due to missing evaluation data.

 

 

DBnary is now using w3c Ontolex format

The DBnary data is now available using the ontolex vocabulary. New bugfixes and additional extracted data will now be available only using ontolex vocabulary.

Of course, as usual, preceding extracted data is still available in lemon format.

For a limited amount of time, the data will be extracted in both lemon and ontolex format. The online access will only be available in ontoloex format.

The DBnary specific vocabulary will be shortly adapted for this new base vocabulary. Say tuned for more information.

DBnary extraction program is now on bitbucket

Due to demands we decided to migrate the DBnary programs forge from our own forge to bitbucket and to use git.

If you want to develop new extractor or improve the existing ones, go to https://bitbucket.org/serasset/dbnary  .

Happy back to school period in France by the way 😉

Dbnary is offering a new dataset for African languages

We just took a first step towards expanding the DBnary dataset with dictionaries provided by the DILAF project. We have extracted a LEMON version of the DILAF Bambara dictionary and we give it available on the DBnary server. As usual, it is in lemon model, and URIs are dereferencable. Check for instance the  http://kaiko.getalp.org/dilaf/bam/daɲɛgafe1__n entry.

Tagged with: ,

Kaiko’s going evasive

The recent failure of kaiko web server was due to a flow of SPARQL requests to DBnary. A client launched 46 Million requests (in less than a week) to the sparql server in a very brief delay, leading to an overflow of the log file that filled up the root partition and broke the server.

May I remind everybody that the DBnary data is available easily by downloading turtle files that you may use either directly using adequate libraries (e.g. JENA in java or others in other programming languages) or upload in a local dbnary mirror.

If you overflow the public server, then it will not be available to serve others.

In order to avoid future problems, the public server is going evasive, meaning that such overflowding clients will be temporarily blocked. Allowing the server to remain available to others.

If this new setting breaks your app, do not hesitate to contact me.

We are back online !

After 3 days offline due to a major server failure, we are back online !

Major bug discovered and fixed

While using DBnary in conjunction with other lemon resources (mainly during the Lider datathon in Madrid), we discovered a small but major problem with DBnary data.

Until now, the lemon prefix was http://www.lemon-model.net/ while the official prefix was http://lemon-model.net/ leading to a poor mapping between DBnary data and other lemon datasets.

The prefix has been fixed in the extractor and I also fixed ALL PREVIOUS VERSIONS of the dataset. Meaning that from now on, even if you use an older dataset (provided that you re-download it), you’ll have the correct mapping.

The sparql endpoint data will also be updated soon.

METEOR with DBNARY

The DBnary dataset has been used for an experiment on a Machine Translation Quality measure based on METEOR. The research paper will be presented during MT-Summit 2016.

For work replication, we provide the sources of the experiment:

21 languages are now available

Latin is now part of the extracted languages. It is a rather small language, but, while we were participating to the first Summer Datathon (SD-LLOD2015) in Cercedilla (Spain), this language seemed quite expected by certain participants.

Top