Package org.getalp.dbnary.languages
Class AbstractWiktionaryExtractor
- java.lang.Object
-
- org.getalp.dbnary.languages.AbstractWiktionaryExtractor
-
- All Implemented Interfaces:
IWiktionaryExtractor
- Direct Known Subclasses:
FunctionalWiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
,WiktionaryExtractor
public abstract class AbstractWiktionaryExtractor extends Object implements IWiktionaryExtractor
-
-
Field Summary
Fields Modifier and Type Field Description protected static String
debutOrfinDecomPatternString
protected static Map<String,String>
NON_STANDARD_LANGUAGE_MAPPINGS
protected String
pageContent
protected IWiktionaryDataHandler
wdh
protected WiktionaryPageSource
wi
protected static Pattern
xmlCommentPattern
-
Constructor Summary
Constructors Constructor Description AbstractWiktionaryExtractor(IWiktionaryDataHandler wdh)
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description static String
cleanUpMarkup(String group)
static String
cleanUpMarkup(String str, boolean humanReadable)
cleans up the wiktionary markup from a string in the following manner:
str is the string to be cleaned up.protected int
computeRegionEnd(int blockStart, Matcher m)
void
computeStatistics(String dumpVersion)
static String
convertToHumanReadableForm(String def)
abstract void
extractData()
void
extractData(String wiktionaryPageName, String pageContent)
void
extractDefinition(String definition, int defLevel)
void
extractDefinition(Matcher definitionMatcher)
protected void
extractDefinitions(int startOffset, int endOffset)
void
extractExample(String example)
void
extractExample(Matcher definitionMatcher)
protected void
extractNyms(String synRelation, int startOffset, int endOffset)
protected void
extractOrthoAlt(int startOffset, int endOffset)
boolean
filterOutPage(String pagename)
static String
getHumanReadableForm(String id)
protected String
getWiktionaryPageName()
void
populateMetadata(String dumpFilename, String extractorVersion)
void
postProcessData(String dumpFileVersion)
void
postProcessModel(org.apache.jena.rdf.model.Model enhancementModel, org.apache.jena.rdf.model.Model sourceModel, String dumpFileVersion)
static String
removeXMLComments(String s)
void
setWiktionaryIndex(WiktionaryPageSource wi)
protected void
setWiktionaryPageName(String wiktionaryPageName)
static String
stripParentheses(String s)
protected String
validateAndStandardizeLanguageCode(String language)
Standardize a wiktionary language code into a "valid" language code.
-
-
-
Field Detail
-
pageContent
protected String pageContent
-
wdh
protected IWiktionaryDataHandler wdh
-
wi
protected WiktionaryPageSource wi
-
debutOrfinDecomPatternString
protected static final String debutOrfinDecomPatternString
-
xmlCommentPattern
protected static final Pattern xmlCommentPattern
-
-
Constructor Detail
-
AbstractWiktionaryExtractor
public AbstractWiktionaryExtractor(IWiktionaryDataHandler wdh)
-
-
Method Detail
-
setWiktionaryIndex
public void setWiktionaryIndex(WiktionaryPageSource wi)
- Specified by:
setWiktionaryIndex
in interfaceIWiktionaryExtractor
-
getWiktionaryPageName
protected String getWiktionaryPageName()
-
setWiktionaryPageName
protected void setWiktionaryPageName(String wiktionaryPageName)
-
extractData
public void extractData(String wiktionaryPageName, String pageContent)
- Specified by:
extractData
in interfaceIWiktionaryExtractor
-
filterOutPage
public boolean filterOutPage(String pagename)
- Parameters:
pagename
- the name of the page- Returns:
- returns true iff the pagename should be ignored during extraction.
-
extractData
public abstract void extractData()
-
extractDefinitions
protected void extractDefinitions(int startOffset, int endOffset)
-
extractDefinition
public void extractDefinition(Matcher definitionMatcher)
-
extractDefinition
public void extractDefinition(String definition, int defLevel)
-
extractExample
public void extractExample(Matcher definitionMatcher)
-
validateAndStandardizeLanguageCode
protected String validateAndStandardizeLanguageCode(String language)
Standardize a wiktionary language code into a "valid" language code. As language editions use codes that may differ from ISO-639-3. Sometimes these codes are referring to languages that are not represented in the iso standard and there are some that may lead to invalid turtle dumps (either because IRI become malformed, but also because the language tag of string values is invalid.In this common implementation, we only consider ISO language codes.
Language extractor may refine this method or just add new language to the NON_STANDARD_LANGUAGE_MAPPINGS map.
- Parameters:
language
- the language code to be checked- Returns:
- the String representing the standardized representation for the language (usable as a language tag in RDF) or null if language is invalid
-
extractExample
public void extractExample(String example)
-
cleanUpMarkup
public static String cleanUpMarkup(String str, boolean humanReadable)
cleans up the wiktionary markup from a string in the following manner:
str is the string to be cleaned up. the result depends on the value of humanReadable. Wiktionary macros are always discarded. xml/xhtml comments are always discarded. Wiktionary links are modified depending on the value of humanReadable. e.g. str = "{{a Macro}} will be [[discard]]ed and [[feed|fed]] to the [[void]]." if humanReadable is true, it will produce: "will be discarded and fed to the void." if humanReadable is false, it will produce: "will be #{discard|discarded}# and #{feed|fed}# to the #{void|void}#."- Parameters:
str
- is the String to be cleaned uphumanReadable
- a boolean- Returns:
- a String
-
extractOrthoAlt
protected void extractOrthoAlt(int startOffset, int endOffset)
-
computeRegionEnd
protected int computeRegionEnd(int blockStart, Matcher m)
-
extractNyms
protected void extractNyms(String synRelation, int startOffset, int endOffset)
-
postProcessData
public void postProcessData(String dumpFileVersion)
- Specified by:
postProcessData
in interfaceIWiktionaryExtractor
-
postProcessModel
public void postProcessModel(org.apache.jena.rdf.model.Model enhancementModel, org.apache.jena.rdf.model.Model sourceModel, String dumpFileVersion)
-
computeStatistics
public void computeStatistics(String dumpVersion)
- Specified by:
computeStatistics
in interfaceIWiktionaryExtractor
-
populateMetadata
public void populateMetadata(String dumpFilename, String extractorVersion)
- Specified by:
populateMetadata
in interfaceIWiktionaryExtractor
-
-