Class AbstractWiktionaryExtractor

    • Method Detail

      • getWiktionaryPageName

        protected String getWiktionaryPageName()
      • setWiktionaryPageName

        protected void setWiktionaryPageName​(String wiktionaryPageName)
      • removeXMLComments

        public static String removeXMLComments​(String s)
      • filterOutPage

        public boolean filterOutPage​(String pagename)
        Parameters:
        pagename - the name of the page
        Returns:
        returns true iff the pagename should be ignored during extraction.
      • extractData

        public abstract void extractData()
      • extractDefinitions

        protected void extractDefinitions​(int startOffset,
                                          int endOffset)
      • extractDefinition

        public void extractDefinition​(Matcher definitionMatcher)
      • extractDefinition

        public void extractDefinition​(String definition,
                                      int defLevel)
      • cleanUpMarkup

        public static String cleanUpMarkup​(String group)
      • extractExample

        public void extractExample​(Matcher definitionMatcher)
      • validateAndStandardizeLanguageCode

        protected String validateAndStandardizeLanguageCode​(String language)
        Standardize a wiktionary language code into a "valid" language code. As language editions use codes that may differ from ISO-639-3. Sometimes these codes are referring to languages that are not represented in the iso standard and there are some that may lead to invalid turtle dumps (either because IRI become malformed, but also because the language tag of string values is invalid.

        In this common implementation, we only consider ISO language codes.

        Language extractor may refine this method or just add new language to the NON_STANDARD_LANGUAGE_MAPPINGS map.

        Parameters:
        language - the language code to be checked
        Returns:
        the String representing the standardized representation for the language (usable as a language tag in RDF) or null if language is invalid
      • extractExample

        public void extractExample​(String example)
      • cleanUpMarkup

        public static String cleanUpMarkup​(String str,
                                           boolean humanReadable)
        cleans up the wiktionary markup from a string in the following manner:
        str is the string to be cleaned up. the result depends on the value of humanReadable. Wiktionary macros are always discarded. xml/xhtml comments are always discarded. Wiktionary links are modified depending on the value of humanReadable. e.g. str = "{{a Macro}} will be [[discard]]ed and [[feed|fed]] to the [[void]]." if humanReadable is true, it will produce: "will be discarded and fed to the void." if humanReadable is false, it will produce: "will be #{discard|discarded}# and #{feed|fed}# to the #{void|void}#."
        Parameters:
        str - is the String to be cleaned up
        humanReadable - a boolean
        Returns:
        a String
      • convertToHumanReadableForm

        public static String convertToHumanReadableForm​(String def)
      • getHumanReadableForm

        public static String getHumanReadableForm​(String id)
      • extractOrthoAlt

        protected void extractOrthoAlt​(int startOffset,
                                       int endOffset)
      • computeRegionEnd

        protected int computeRegionEnd​(int blockStart,
                                       Matcher m)
      • extractNyms

        protected void extractNyms​(String synRelation,
                                   int startOffset,
                                   int endOffset)
      • stripParentheses

        public static String stripParentheses​(String s)
      • postProcessModel

        public void postProcessModel​(org.apache.jena.rdf.model.Model enhancementModel,
                                     org.apache.jena.rdf.model.Model sourceModel,
                                     String dumpFileVersion)