After a long wait, I finally managed to integrate the “Source Translation Disambiguation” experiment into DBnary extractor. Hence the sources of some translation are disambiguated each time a new dump is extracted.
What is this ?
In wiktionary, translations are sometimes included in a box which corresponds to a word sense. It allows for distinguishing between different translations. For instance, the French noun “bleu” may be translated in “blue” when it means a color, in “bruise” when it is used in the “small injury” acception and “rookie” when it designs a non experimented soldier.
In the original DBnary dataset, the Translation object is linked to the lexical entry and accompanied by a “gloss” that helps human disambiguate. Note that in all theses “bleu” examples, the lexical entry is the same.
The “Disambiguated Translation” dataset completes the core DBnary dataset by linking the translation object to their correct source word sens(es). In this case, “bruise” will be attached to the 12th word sense of entry “bleu” while rookie will be attached to the word sense 10a (these word sense numbering is correct at the time I write this post, but may evolve).
Why is it not in the core dataset ?
Well doing this “disambiguation” is not always very easy. The program may not always be correct in its disambiguation. We try to do our best, but the process is not error prone. Hence we add this data in an extension file. You may chose to use it or not.
Ok, but how does it work ?
Well the disambiguation process uses the “glosses” that are given in the original wiktionary page. There are 2 kinds of glosses :
- Some are in a numeric form : “10.a“, i.e. a sense number as given in wiktionary. In this case, we are able to attach to the designated sense number (hoping the wiktionary data is correct).
- Others are in a textual form : “Ecchymose“, i.e. a small explanation (gloss) that is to be understood as a shortcut for a definition (“12. (Par métonymie) Ecchymose, résultat d’un choc sur une région du corps humain, conduisant à l’apparition d’une couleur bleutée.” in our example). In this case we compute a distance between the glosses and all definitions then we attach the the closest one(s). This is where we are more likely to make some mistakes.
Ok, but how do I know if the data is correct for my language ?
Well, we are able to evaluate the process when we have specific glosses that are both numeric and textual (e.g. “10.a Ecchymose”). In this case, we first disambiguate with the textual gloss then with the numeric gloss. We compare both results and count how many time the textual disambiguation gets the same result as the numerical disambiguation. We keep the numerical result (that we consider as the correct result) in the dataset.
Using the following chart, you may see if your language has many glosses and of which kind. Note that we can not disambiguate translations that do not have glosses.
From this chart, you should see that languages uses different strategies. German wiktionary exclusively uses numerical glosses, Russian edition uses only textual glosses while French language edition uses textual or mixed glosses.
The confidence we have in our algorithm may only be computed for languages using a significant number of glosses with numerical and textual data, but as it is the same for all languages we do hope its performance is language independent.
This csv file gives a view on the confidence computed for each language. It gives the precision and recall of our method, compared to the precision and recall of a random choice among word senses. Values with a precision of 0.0 are usually due to missing evaluation data.