Setting up the development environment for DBnary extractor

These are quick instructions to help you set up the development environment with which you may launch DBnary extraction programs and eventually debug the extractors.

Install java 8 + IntelliJ IDEA

The IDE is not useful at all as the DBnary project uses maven, you can chose to use any code editor (emacs, Sublime Text, you name it) and simply launch maven command in the terminal. You may also use any IDE you like, but IDEA already contains most tools you need.

You’ll need

  • Java 8
  • maven 3
  • git
  • an java IDE supporting git and maven or a text editor + console

I won’t help you set up these tools as they are mainstream tools and usually, if you cannot install them, you cannot use them…

Clone the project

DBnary is available on bitbucket : https://bitbucket.org/serasset/dbnary/

Pay attention that I use the git flow development scheme, hence the most up to date version is available in development branch and the master branch is usually out of sync with the kaiko server.

Compile

Command line version

cd dbnary
mvn install

The good thing with maven is that it pulls all the necessary libraries and I use tons of them.

IDE version

In intelliJ you just have to open the pom.xml file in the root of the project, or open the project folder. Is it supports maven it will fetch all libraries and allow you to browse the code without any problem.

Sometimes you’ll have to run mvn install in a command line to allow the IDE to work correctly.

Run

Running the extractor is a mater of launching the commands (command line interface) programs given by DBnary. Here is an (almost correct) description of the commands. These commands will fail with an explanation of their arguments when launched without any arg.

Debug

If you can run it in the IDE, you just have to debug it after adding some breakpoints. You have command lines to launch a full extractions and other to extract a set of specific pages. The later is specifically fitted to debug the extraction process.

Understanding the program

Well this will depend on your skill in java, however, there are a few things to understand in order to find your way in this rather complicated program.

  • One project, several sub-projects
    • dbnary-ontology : makes the DBnary, ontolex, and other RDF vocabularies available as Java constant to ensure correct production of data.
    • build-tools : tools that allow the previous project to compile (you should not go there unless you know what you are doing).
    • dbnary-commons : a common API to manipulate languages
    • dbnary-extractor : all extraction work is done here !!! This is probably what you are looking for.
    • dbnary-enhancer : adds some enhancement to the extracted data after it has been extracted.
  • dbnary-extractor module, org.getalp.dbnary namespace
    • cli : the command line programs
    • tools, wiki: contains some common classes to maintain the original wiktionary data or parse the mediawiki content
    • eng, fra, deu, …: contains the specific code for the extraction of a specific language.
      The classes found here are subclasses of the generic extraction classes. They are automatically instanciated when a specific language is extracted and they are subclasses of the generic ones.

      • WiktionaryExtractor: the main extraction class, it is supposed to process a page (identify the different sections of the page and extracts the corresponding data)
      • WiktionaryDataHandler: this implements the high level calls of the preceding class that are supposed to create elements and relations of the extracted data.

Happy coding and debugging.

Top