4 more languages immediately available

Thanks to Malick Diagne and Steve Roques, DBnary now extracts data from Dutch, Lithuanian, Serbo-Croat and Swedish editions. The Serbo-Croat language extractor also extracts morphological informations.

WikDict, a web service based on DBnary data

Karl Bartel has created a very simple web service to lookup translation of many languages.

The data powering wikDict is provided by DBnary. The service is still preliminary, but I’m quite sure it will improve with the time.

Go to wikDict home page…

Improvement in Russian extractor

Several bug fixes were made on the Russian extractor. Definitions should now be extracted in a more complete way and empty definitions should not be extracted anymore.

Translation extraction has also been modified to avoid (rather unfrequent) special cases where links and usage notes where not extracted correctly.

French and German morphology is now extracted

Since December 2014, morphological data has been extracted from French and German language edition. This data is currently stored in exhaustive version, meaning that every inflected form may be found in an a lemon:otherForm property.

German morphology extraction is still rather preliminar, but French version should already be in a form useful for DBnary users.

English extraction improved

English extraction has been improved slightly.

  • More part of speeches are now extracted (mainly affixes, proverbs, …),
  • Extracted data is now more precisely typed and described, using lexinfo vocabulary,

Additionally, All lexical entries are now explicitly typed as LexicalEntry while they were previously typed as lemon:Word or lemon:Phrase or lemon:Adjective (all of which are sub classes of lemon:LexicalEntry).

Usage examples are now extracted and attached to word senses

During the French “TALN” workshop, several French researchers asked me if I could add usage examples in the extracted data. This has been done in French, with the addition of a new property (named dbnary:exampleSource) that gives the source of the example when available.

The French extractor still shows some problems when examples are splitter on several lines using te HTML br tag, but this is a minor issue (it appears a than sands times on a set of more than 252 thousands usage examples).

Examples in other languages should still be verified. The extraction of English usage examples has been neutralized as the English language edition uses a quite tedious representation involving several lines…

Stay tuned for more/better data, and do not hesitate to go to the forge and submit feature requests or bugs.

The Polish Language Extractor is now online

DBnary now contains extraction from 13 wiktionary language editions, with Polish being added today. Polish data is available in the very same format as other languages. The extraction work has been really tedious as the Polish language edition uses a rather different micro-structure for its pages. For instance, all lexical entries definitions are gathered in a unique definitions sections (with different part os speech gathered as sub sections). This leads to a more tricky extractor.

Moreover, the Part Of Speech information is rather detailed in Polish language edition, with many sub-categorization options. Almost all of these subcategorization info is extracted and rendered using the lexinfo standard vocabulary.

Setting up a virtuoso-opensource server to mirror DBnary data

This post details the steps necessary to install virtuoso server and bulk load wiktionary data. You’ll have to adapt for you own settings.

Download and Compile virtuoso-opensource

First, you’ll have to install several development libraries (this is for Debian):

sudo aptitude install autoconf autoheader automake bison flex gawk gperf libtool build-essential autotools-dev 
sudo aptitude install libssl-dev libxml2 libxml2-dev imagemagick libreadline-dev libldap-dev libmagickwand5 libmagickwand-dev libwbxml2-dev libwbxml2-0

Then, clone dbpedia and virtuoso-opensource git repo, setup the i18n version of dbpedia (not sure this is really useful…), configure, make, make install, pour coffee…

git clone https://github.com/dbpedia/dbpedia-vad-i18n
git clone git://github.com/openlink/virtuoso-opensource.git
cd virtuoso-opensource/
git checkout develop/7
cd binsrc/
mv dbpedia dbpedia.orig
cp -r ../../dbpedia-vad-i18n/dbpedia .
./autogen.sh
CFLAGS="-O2 -m64"
export CFLAGS
cd ..
./configure --prefix=/opt/virtuoso-opensource --with-readline --enable-dbpedia-vad --enable-fct-vad --enable-rdfmappers-vad --with-port=2222
make
sudo make install

The option with-port=2222 is to be used if you compile a new version of virtuoso while another instance is already running with default settings.

Patch virtuoso facetted browser if necessary

Display labels for all languages

By default, the description.vsp program installed by default in virtuoso does not display the literal strings that are in a language which is not the language of the user.

The language of the user is either taken from the HTTP header (Accept-Language) or from the url (lang=xx in url get arguments). A value of “*” will display all languages.

This is inconvenient when using this linked data viewer for multilingual dictionary data (as in dbnary). Hence you have to modify this part of description.vsp page.

To force virtuoso to display all languages, you can patch the binsrc/b3s/rdfdesc/description.vsp file.

Find the part that tries to get the language of the user \u2014 something like:

 langs := http_request_header_full (lines, 'Accept-Language', 'en');
 ua := http_request_header (lines, 'User-Agent');
 all_langs := b3s_get_lang_acc (lines);
 lang_parm := get_keyword ('lang', params, '');
 if (length (lang_parm))
 {
 all_langs := vector (lang_parm, 1.0);
 langs := lang_parm;
 }

Add the following lines just before the “if (length (lang_parm))”:

-- GS: force all language strings to be displayed
lang_parm := '*';

Alternatively, you may edit the file after server deployment by using the DAV browser to navigate to file DAV/VAD/fct/rdfdesc and edit the file named “description.vsp”

Fix some remaining encoding issues

DBnary uses IRI negotiation. This allows to use international characters inside node names in RDF (aka. URI/IRI). However, facetted browser is not really tolerant to such use.

Among problems, the navigation used in facetted browsing will use an ill formed non UTF-8 encoded value as the IRI. The symptom is when you browse an entry that has a non ascii char in its IRI and click the “Next” button, you’ll get “no other information”.

To fix this, also modify the description.vsp file and change the line:

 <input type="hidden" name="url" value="<?V gr ?>" />

to

 <input type="hidden" name="url" value="<?V page_resource_uri ?>" />

Setup the virtuoso database directory

mkdir -p /opt/virtuoso/
cd /opt/virtuoso-opensource/var/lib/virtuoso/
mv db /opt/virtuoso/
ln -s /opt/virtuoso/db .
cd /opt/virtuoso/db
vim virtuoso.ini

Edit the .ini file:

  • no need to change the db file declaration (a symbolic link has been used).
  • DirsAllowed                     = ., /opt/virtuoso-opensource/share/virtuoso/vad, /opt/datasets/dbnary/
  • adjust memory settings to fit you computer’s configuration
  • add “ShortenLongURIs = 1” in SPARQL section
  • modify MaxCheckpointRemap in database section to 1/4th NumberOfBuffers

Setup automatic startup

sudo cp debian/init.d /etc/init.d/virtuoso-opensource
sudo chmod +x /etc/init.d/virtuoso-opensource
sudo vim /etc/init.d/virtuoso-opensource

Modify:

PATH, DAEMON (put the prefix you use at configure step…)
DBBASE: use the folder you configured in the previous step…

sudo update-rc.d virtuoso-opensource defaults

Setup apache for an external server

Put the following proxy passes in your apache conf file.

# The URL to the explicative website<
  Alias /about-dbnary /opt/www/kaiko/dbnary/
  ProxyPass /describe http://localhost:8890/describe
  ProxyPassReverse /describe http://localhost:8890/describe
  ProxyPass /conductor http://localhost:8890/conductor
  ProxyPassReverse /conductor http://localhost:8890/conductor
  ProxyPass /dbnary http://localhost:8890/dbnary
  ProxyPassReverse /dbnary http://localhost:8890/dbnary
# This is mandatory as the virtuoso server redirects to this url (that should be handled by apache).
  ProxyPassReverse /about-dbnary http://localhost:8890/about-dbnary
  ProxyPass /sparql http://localhost:8890/sparql connectiontimeout=300 timeout=300
  ProxyPassReverse /sparql http://localhost:8890/sparql
  ProxyPass /isparql http://localhost:8890/isparql
  ProxyPassReverse /isparql http://localhost:8890/isparql
  ProxyRequests Off
  #ProxyHTMLLogVerbose On
  #LogLevel Debug
<Location /fct>
    ProxyPass               http://localhost:8890/fct
    ProxyPassReverse        /fct
    # SetOutputFilter proxy-html
    # ProxyHTMLEnable         On
    # Apply rewrite rule to css and javascripts
    # ProxyHTMLExtended On
    # convert URLs in CSS and JS
    # ProxyHTMLURLMap "localhost:8890" "kaiko.getalp.org"
    # ProxyHTMLURLMap http://localhost:8890 http://kaiko.getalp.org

    # convert URLs in CSS and JS
    #ProxyHTMLURLMap "\"/fct" "\"/dbnary/fct" 
    #  Enable rewrite rules
    #ProxyHTMLURLMap         /fct /dbnary/fct
    #ProxyHTMLURLMap         http://localhost:8890/fct /dbnary/fct
    # Uncomment this when EnabledGzipContent=1 in virtuoso.ini
    #SetOutputFilter         INFLATE;DEFLATE
</Location>

Prepare database

Launch virtuoso open source and go to http://localhost:8890/conductor/
  • System Admin -> User account: modify dav et dba passwords (default values are dab and dba…)
  • System Admin -> Packages: install package “fct”
  • System Admin -> Packages: install package “isparql” (to get an advance SPARQL interface…)

The script below  will do so remaining setup automatically:

  • setup the /dbnary path for linked data access, with content negotiation;
  • Add the BDnary namespace in the list of known namespaces;
DB.DBA.VHOST_REMOVE (
lhost=>'*ini*',
vhost=>'*ini*',
lpath=>'/dbnary'
);

DB.DBA.VHOST_DEFINE (
lhost=>'*ini*',
vhost=>'*ini*',
lpath=>'/dbnary',
ppath=>'/DAV/',
is_dav=>1,
def_page=>'',
vsp_user=>'dba',
ses_vars=>0,
opts=>vector ('browse_sheet', '', 'url_rewrite', 'http_rule_list_1'),
is_default_host=>0
);

DB.DBA.URLREWRITE_CREATE_RULELIST (
'http_rule_list_1', 1,
vector ('http_rule_1', 'http_rule_2', 'http_rule_3', 'http_rule_4'));

DB.DBA.URLREWRITE_CREATE_REGEX_RULE (
'http_rule_1', 1,
'^/(.*)$',
vector ('par_1'),
1,
'/sparql?query=DESCRIBE%%20%%3Chttp%%3A%%2F%%2Fkaiko.getalp.org%%2F%U%%3E&format=%U',
vector ('par_1', '*accept*'),
NULL,
'(text/rdf.n3)|(application/rdf.xml)',
2,
303,
''
);

DB.DBA.URLREWRITE_CREATE_REGEX_RULE (
'http_rule_2', 1,
'^/(.*)$',
vector ('par_1'),
1,
'/describe/?url=http%%3A%%2F%%2Fkaiko.getalp.org%%2F%s',
vector ('par_1'),
NULL,
'(text/html)|(\\*/\\*)',
0,
303,
''
);

DB.DBA.URLREWRITE_CREATE_REGEX_RULE (
'http_rule_3', 1,
'^/dbnary/*$',
vector (),
0,
'/about-dbnary/lemon/dbnary-doc/index.html',
vector (),
NULL,
'(text/html)|(\\*/\\*)',
0,
303,
''
);

DB.DBA.URLREWRITE_CREATE_REGEX_RULE (
'http_rule_4', 1,
'^/dbnary/*$',
vector (),
0,
'/about-dbnary/lemon/latest/dbnary.owl',
vector (),
NULL,
'(text/rdf.n3)|(application/rdf.xml)',
0,
303,
''
);
-- Create namespaces for dbnary

DB.DBA.XML_SET_NS_DECL ('lexinfo', 'http://www.lexinfo.net/ontology/2.0/lexinfo#', 2);
DB.DBA.XML_SET_NS_DECL ('lexvo', 'http://lexvo.org/id/iso639-3/', 2);
DB.DBA.XML_SET_NS_DECL ('dcterms', 'http://purl.org/dc/terms/', 2);
DB.DBA.XML_SET_NS_DECL ('lemon', 'http://lemon-model.net/lemon#', 2);
DB.DBA.XML_SET_NS_DECL ('dbnary', 'http://kaiko.getalp.org/dbnary#', 2);
DB.DBA.XML_SET_NS_DECL ('dbnary-fra', 'http://kaiko.getalp.org/dbnary/fra/', 2);
DB.DBA.XML_SET_NS_DECL ('dbnary-eng', 'http://kaiko.getalp.org/dbnary/eng/', 2);
DB.DBA.XML_SET_NS_DECL ('dbnary-ita', 'http://kaiko.getalp.org/dbnary/ita/', 2);
DB.DBA.XML_SET_NS_DECL ('dbnary-rus', 'http://kaiko.getalp.org/dbnary/rus/', 2);
DB.DBA.XML_SET_NS_DECL ('dbnary-deu', 'http://kaiko.getalp.org/dbnary/deu/', 2);
DB.DBA.XML_SET_NS_DECL ('dbnary-por', 'http://kaiko.getalp.org/dbnary/por/', 2);
DB.DBA.XML_SET_NS_DECL ('dbnary-fin', 'http://kaiko.getalp.org/dbnary/fin/', 2);
DB.DBA.XML_SET_NS_DECL ('dbnary-ell', 'http://kaiko.getalp.org/dbnary/ell/', 2);
DB.DBA.XML_SET_NS_DECL ('dbnary-tur', 'http://kaiko.getalp.org/dbnary/tur/', 2);
DB.DBA.XML_SET_NS_DECL ('dbnary-jpn', 'http://kaiko.getalp.org/dbnary/jpn/', 2);
DB.DBA.XML_SET_NS_DECL ('dbnary-spa', 'http://kaiko.getalp.org/dbnary/spa/', 2);
DB.DBA.XML_SET_NS_DECL ('dbnary-bul', 'http://kaiko.getalp.org/dbnary/bul/', 2);
DB.DBA.XML_SET_NS_DECL ('dbnary-pol', 'http://kaiko.getalp.org/dbnary/pol/', 2);

You may now stop virtuoso and duplicate the database directory that may be reused afterwards as a bootstrap for a new version of DBnary.

Load DBnary data

Go to /opt/datasets/dbnary and uncompress all turtle files here. Create xxx.ttl.graph files that should contain the URI of the graph in which each xxx file will be added. E.g.: http://kaiko.getalp.org/dbnary/fra may be put into  fr_dbnary_lemon.ttl.graph.

For the remaining, you’ll have to launch isql (using screen or under an nx session as it may be long to process).

screen isql
-- we are in sql mode now

ld_dir ('/opt/datasets/dbnary/', '*.ttl', 'http://kaiko.getalp.org/dbnary');

-- do the following to see which files were registered to be added:
SELECT * FROM DB.DBA.LOAD_LIST;
-- if unsatisfied use:
-- delete from DB.DBA.LOAD_LIST;
rdf_loader_run();

-- do nothing too heavy while data is loading
checkpoint;
commit WORK;
checkpoint;
EXIT;

This will take a long time. Do not overload your server during this loading. After this, relaunch isql to update caches and setup facetted browsing:

isql
sparql SELECT COUNT(*) WHERE { ?s ?p ?o } ;
sparql SELECT ?g COUNT(*) { GRAPH ?g {?s ?p ?o.} } GROUP BY ?g ORDER BY DESC 2;

-- Build Full Text Indexes by running the following commands using the Virtuoso isql program 
RDF_OBJ_FT_RULE_ADD (null, null, 'All');
VT_INC_INDEX_DB_DBA_RDF_OBJ ();
-- Run the following procedure using the Virtuoso isql program to populate label lookup tables periodically and activate the Label text box of the Entity Label Lookup tab:
urilbl_ac_init_db();
-- Run the following procedure using the Virtuoso isql program to calculate the IRI ranks. Note this should be run periodically as the data grows to re-rank the IRIs.
s_rank();
Tagged with:

Translations are now connected to word senses

A translation is supposed to connect a source word sense to a target word sense. However, current DBnary data connects a lexical entry to a target string. By using available glosses, we are able to connect translations to their source word sense. We evaluate the accuracy of these relations to over 80%. Hence, we provide these new relation as an additional dataset available here.

DBnary is now supporting 12 language editions

With the addition of Bulgarian and Spanish, the DBnary data now contains 12 language editions. More to come we hope !

Top