DBnary extractor’s Command Line Interface
Usage of the command line
Easiest way is to use the Command Line Interfaces packaged as a java app.
You should either install DBnary extractor using homebrew or, if you are debugging the extractor, make sure the dbnary
shell script defined at YOUR_SOURCE_DIRECTORY/dbnary/dbnary-commands/target/appassembler/bin/dbnary
is selected first by your PATH
. Note, this file only exists after a full mvn package
run from the dbnary source code root folder.
$ dbnary --version
3.0.9
$ dbnary help
Usage: dbnary [-hvV] [--dir=<dbnaryDir>] [--debug=<debug>[,<debug>...]]...
[--trace=<trace>]... [@<filename>...] [COMMAND]
DBnary is a set of tools used to extract lexical data from several editions of
wiktionaries. All extracted data is made available as Linked Open Data, using
ontolex, lexinfo, olia and several other specialized vocabularies.
[@<filename>...] One or more argument files containing options.
--debug=<debug>[,<debug>...]
--dir=<dbnaryDir>
-h, --help Show this help message and exit.
--trace=<trace>
-v Print extra information.
-V, --version Print version information and exit.
Commands:
The dbnary commands are:
check check the mediawiki syntax of all pages of a dump.
extract extract all pages from a dump and write resulting RDF files.
help Displays help information about the specified command
update Update dumps for all specified languages, then extract them.
sample extract the specified pages from a dump and write resulting RDF
files to stdout.
tree Parse the specified entries wikitext and display the parse tree to
stdout.
source get the wikitext source of the specified pages.
compare fetch and compare extracts from different dates.
grep grep a given pattern in all pages of a dump.
All subcommands are also documented using the help subcommand. E.g.
$ dbnary help grep
grep a given pattern in all pages of a dump.
Usage: dbnary grep [-hlvV] [--all-matches] [--[no-]compress] [--[no-]tdb]
[--plain] [--dir=<dbnaryDir>] [-F=NUMBER] [-T=NUMBER]
[--debug=<debug>[,<debug>...]]... [--trace=<trace>]...
<dumpFile> <pattern>
This command looks for a given pattern in all pages of a dump and output the
matching pages.
<dumpFile> The dump file of the wiki to be extracted.
<pattern> The pattern to be searched for.
--all-matches show all matches.
--debug=<debug>[,<debug>...]
--dir=<dbnaryDir>
-F, --frompage=NUMBER Begin the extraction at the specified page number.
-h, --help Show this help message and exit.
-l, --pagename only show the name of the page.
--[no-]compress Compress the resulting extracted files using BZip2.
set by default.
--[no-]tdb Use TDB2 (temporary file storage for extracted
models, usefull/necessary for big dumps. set by
default.
--plain match is displayed without specific formatting.
-T, --topage=NUMBER Stop the extraction at the specified page number.
--trace=<trace>
-v Print extra information.
-V, --version Print version information and exit.
Organisation of the dump files
Most DBnary commands assume that a language dump file is available to extract data from. Wiktionary dumps may be downloaded from dumps.wikimedia.org or a mirror, then they should be converted to UTF-16 format. This is automatically done by the dbnary update
sub command.
The usual file layout is the following :
mainfolder +– dumps | +- en | | +- enwkt-20141223.xml | +- fr | | +- frwkt-20141221.xml | ... +- extracts +- ontolex +- en | +- en_dbnary_ontolex_20201120.ttl | +- en_dbnary_ontolex_20201220.ttl +- fr | +- fr_dbnary_ontolex_20201201.ttl | +- fr_dbnary_ontolex_20201221.ttl ...