DBnary extractor’s Command Line Interface

The DBnary maven project produces a jar file with all dependencies. Hence all command line interfaces may be invoked using :

java -cp path/to/dbnary-VERSIONNUMBER-jar-with-dependencies.jar org.getalp.dbnary.cli.COMMANDNAME ARGS ...

where VERSIONNUMBER should be substituted by the current DBnary version number and COMMANDNAME and ARGS are described below.

Downloading and preparing the dump files

Most DBnary commands assume that a language dump file is available to extract data from. Wiktionary dumps may be downloaded from dumps.wikimedia.org or a mirror, then they should be converted to UTF-16 format.

The usual file layout is the following :

mainfolder
 +– dumps
 |   +- en
 |   |   +- enwkt-20141223.xml
 |   +- fr
 |   |   +- frwkt-20141221.xml
 |   ...
 +- extracts
     +- lemon
         +- en
         |  +- en_dbnary_lemon_20141125.ttl
         |  +- en_dbnary_lemon_20141223.ttl
         +- fr
         |  +- fr_dbnary_lemon_20141201.ttl
         |  +- fr_dbnary_lemon_20141221.ttl
         ...

As this is a repetitive process, we provide the UpdateAndExtractDump command line that makes the process easy.

UpdateAndExtractDumps

Usage

usage: java -cp /path/to/dbnary.jar org.getalp.dbnary.cli.UpdateAndExtractDumps [OPTIONS]
 languageCode...
With OPTIONS in:
 -d <arg> directory containing the wiktionary dumps and
 extracts. . by default
 --enable <feature> Enable additional extraction features.
 -f force the updating even if a file with the same
 name already exists in the output directory.
 false by default.
 -h Prints usage and exits.
 -k <arg> number of dumps to be kept in dump directory. 5
 by default
 -m <arg> model of the extracts (LMF or LEMON) extracts.
 lemon by default
 -n Do not use the ftp network, but decompress and
 extract.
 -s <arg> give the URL pointing to a wikimedia mirror.
 ftp://ftpmirror.your.org/pub/wikimedia/dumps/ by
 default.
 -z compress the output file using bzip2. true by
 default
languageCode is the wiktionary code for a language (usually a 2 letter
code).

UpdateAndExtractDumps looks in the dump and checks (using ftp) if a new dump is available online (unless -n option is given). If available it download and prepare the dumps. Then it extracts the dump, if no extraction file is already available for this dump.

the --enable option currently only recognize the “morpho” feature that extracts morphological information if the extractor supports it.

ExtractWiktionary

To fully extract an individual already prepared dump, use the ExtractWiktionary command line.

usage: java -cp /path/to/dbnary.jar
 org.getalp.dbnary.cli.ExtractWiktionary [OPTIONS] dumpFile
With OPTIONS in:
 -f <arg> Output format (graphml, raw, rdf, turtle, ntriple,
 n3, ttl or rdfabbrev). ttl by default.
 -h,--help Prints usage and exits.
 -l <arg> Language (fra, eng, deu or por). fra by default.
 -M,--morpho <file> Output file for morphology data. Undefined by
 default.
 -m <arg> Ontology Model used (lmf or lemon). Only useful
 with rdf base formats.lemon by default.
 -o <arg> Output file. extract by default
 -s Add a unique suffix to output file.
 -x Extract foreign Languages
 -z <arg> Compress the output using bzip2 (value: yes/no or
 true/false). no by default.
dumpFile must be a Wiktionary dump file in UTF-16 encoding. dumpFile
directory must be writable to store the index.

If the -M or --morpho argument is provided then the extractor will try to extract morphological data and dump the resulting graph in the provided file.

GetExtractedSemnet

To extract the lexical network from a specific entry or set of entries, use the GetExtractedSemnet command.

usage: java -cp /path/to/dbnary.jar
            org.getalp.dbnary.cli.GetExtractedSemnet [OPTIONS] dumpFile
            entryname ...
With OPTIONS in:
 -f       Output format (graphml, raw, rdf, turtle, ntriple, n3, ttl
               or rdfabbrev). ttl by default.
 -h            Prints usage and exits.
 -l       Language (fr, en,it,pt de, fi or ru). fra by default.
 -M,--morpho   extract morphology data.
 -m       Ontology Model used  (lmf or lemon). Only useful with rdf
               base formats.lemon by default.
 -x            Extract foreign languages
dumpFile must be a Wiktionary dump file in UTF-16 encoding. dumpFile
directory must be writable to store the index.
Displays the extracted semnet of the wiktionary page(s) named "entryname",
...

This command wil output the lexical network in required format to stdout. If the -M or --morpho argument is provided then the extractor will also try to extract morphological data and will also append the resulting lexical network to stdout.

This command is interesting for debugging purposes.

GetRawEntry

To display the raw text of the wiktionary page named “entryname”, use the GetRawEntry command.

Usage: 
java -cp /path/to/dbnary.jar
		org.getalp.dbnary.cli.GetRawEntry [OPTIONS] wiktionaryDumpFile entryname ...
OPTIONS:
  --all (-a): Display all the xml elements defining the page.
  --        : Stops the sequence of options and start the sequence of entrynames.
              This option is usefull when the wiktionaryDumpFile begins with a "-".

This command is interesting for view content of page.

GrepInWiktionary

To Display the title of the first entry text of the wiktionary page named “entryname”, use the GrepInWiktionary command.

Usage:
java -cp /path/to/dbnary.jar
		org.getalp.dbnary.cli.GrepInWiktionary pattern wiktionaryDumpFile
OPTIONS:
  --all (-a): Display all the xml elements defining the page.
  --        : Stops the sequence of options and start the sequence of entrynames.
              This option is usefull when the wiktionaryDumpFile begins with a "-".

This command is interesting to find page that contains the “pattern”.