DBnary extractor’s Command Line Interface

Usage of the command line

Easiest way is to use the Command Line Interfaces packaged as a java app.

You should either install DBnary extractor using homebrew or, if you are debugging the extractor, make sure the dbnary shell script defined at YOUR_SOURCE_DIRECTORY/dbnary/dbnary-commands/target/appassembler/bin/dbnary is selected first by your PATH. Note, this file only exists after a full mvn package run from the dbnary source code root folder.

$ dbnary --version
3.0.9
$ dbnary help
Usage: dbnary [-hvV] [--dir=<dbnaryDir>] [--debug=<debug>[,<debug>...]]...
              [--trace=<trace>]... [@<filename>...] [COMMAND]
DBnary is a set of tools used to extract lexical data from several editions of
wiktionaries. All extracted data is made available as Linked Open Data, using
ontolex, lexinfo, olia and several other specialized vocabularies.
      [@<filename>...]    One or more argument files containing options.
      --debug=<debug>[,<debug>...]

      --dir=<dbnaryDir>
  -h, --help              Show this help message and exit.
      --trace=<trace>
  -v                      Print extra information.
  -V, --version           Print version information and exit.

Commands:

The dbnary commands are:
  check    check the mediawiki syntax of all pages of a dump.
  extract  extract all pages from a dump and write resulting RDF files.
  help     Displays help information about the specified command
  update   Update dumps for all specified languages, then extract them.
  sample   extract the specified pages from a dump and write resulting RDF
             files to stdout.
  tree     Parse the specified entries wikitext and display the parse tree to
             stdout.
  source   get the wikitext source of the specified pages.
  compare  fetch and compare extracts from different dates.
  grep     grep a given pattern in all pages of a dump.

All subcommands are also documented using the help subcommand. E.g.

$ dbnary help grep
grep a given pattern in all pages of a dump.
Usage: dbnary grep [-hlvV] [--all-matches] [--[no-]compress] [--[no-]tdb]
                   [--plain] [--dir=<dbnaryDir>] [-F=NUMBER] [-T=NUMBER]
                   [--debug=<debug>[,<debug>...]]... [--trace=<trace>]...
                   <dumpFile> <pattern>
This command looks for a given pattern in all pages of a dump and output the
matching pages.
      <dumpFile>          The dump file of the wiki to be extracted.
      <pattern>           The pattern to be searched for.
      --all-matches       show all matches.
      --debug=<debug>[,<debug>...]

      --dir=<dbnaryDir>
  -F, --frompage=NUMBER   Begin the extraction at the specified page number.
  -h, --help              Show this help message and exit.
  -l, --pagename          only show the name of the page.
      --[no-]compress     Compress the resulting extracted files using BZip2.
                            set by default.
      --[no-]tdb          Use TDB2 (temporary file storage for extracted
                            models, usefull/necessary for big dumps. set by
                            default.
      --plain             match is displayed without specific formatting.
  -T, --topage=NUMBER     Stop the extraction at the specified page number.
      --trace=<trace>
  -v                      Print extra information.
  -V, --version           Print version information and exit.

Organisation of the dump files

Most DBnary commands assume that a language dump file is available to extract data from. Wiktionary dumps may be downloaded from dumps.wikimedia.org or a mirror, then they should be converted to UTF-16 format. This is automatically done by the dbnary update sub command.

The usual file layout is the following :

mainfolder
 +– dumps
 |   +- en
 |   |   +- enwkt-20141223.xml
 |   +- fr
 |   |   +- frwkt-20141221.xml
 |   ...
 +- extracts
     +- ontolex
         +- en
         |  +- en_dbnary_ontolex_20201120.ttl
         |  +- en_dbnary_ontolex_20201220.ttl
         +- fr
         |  +- fr_dbnary_ontolex_20201201.ttl
         |  +- fr_dbnary_ontolex_20201221.ttl
         ...