Class RefactoredTableExtractor

    • Constructor Detail

      • RefactoredTableExtractor

        public RefactoredTableExtractor​(String entryName,
                                        String language,
                                        List<String> context)
    • Method Detail

      • getCell

        protected org.jsoup.nodes.Element getCell​(int i,
                                                  int j)
      • parseTable

        public Set<LexicalForm> parseTable​(org.jsoup.nodes.Element tableElement)
      • shouldProcessCell

        protected boolean shouldProcessCell​(org.jsoup.nodes.Element cell)
        true if the cell should be processed by the extractor. This is called for a normal cell and not for a header cell, it allows specific subclasses to further filter out cells based on their content.
        Parameters:
        cell - the td element to be examined
        Returns:
        true if the cell should be processed
      • handleSimpleCell

        protected Set<LexicalForm> handleSimpleCell​(int i,
                                                    int j,
                                                    org.jsoup.nodes.Element cell,
                                                    List<String> context)
      • handleNestedTables

        protected Set<LexicalForm> handleNestedTables​(int i,
                                                      int j,
                                                      org.jsoup.nodes.Element cell,
                                                      List<String> context)
      • isHeaderCell

        protected boolean isHeaderCell​(org.jsoup.nodes.Element cell)
      • getRowAndColumnContext

        protected List<String> getRowAndColumnContext​(int nrow,
                                                      int ncol,
                                                      ArrayMatrix<org.jsoup.nodes.Element> columnHeaders)
      • addToContext

        protected boolean addToContext​(ArrayMatrix<org.jsoup.nodes.Element> columnHeaders,
                                       int i,
                                       int j,
                                       List<String> res)
      • getLexicalFormsFromCell

        protected Set<LexicalForm> getLexicalFormsFromCell​(int i,
                                                           int j,
                                                           org.jsoup.nodes.Element cell,
                                                           List<String> context)
        returns the set of lexical forms that correspond to current cell and context

        The context is a list of String that corresponds to all column and row headers + section headers in which the cell appears.

        Parameters:
        i - the line number of the cell in the table
        j - the column number of the cell in the table
        context - a list of Strings that represent the celle context
        Returns:
        The set of lexical forms corresponding to the context
      • getInflectionSchemeFromContext

        protected abstract InflectionScheme getInflectionSchemeFromContext​(List<String> context)
        returns the inflection that correspond to current cell context

        The cell context is a list of String that corresponds to all column and row headers + section headers in which the cell appears.

        Parameters:
        context - a list of Strings that represent the celle context
        Returns:
        The set of lexical forms corresponding to the context
      • getInflectedForms

        protected Set<LexicalForm> getInflectedForms​(org.jsoup.nodes.Element cell,
                                                     InflectionScheme infl)
        Extract wordforms from table cell
        Splits cell content by <br\> or comma and removes HTML formatting
        Parameters:
        cell - the current cell in the inflection table
        infl - the inflection scheme corresponding to the current cell
        Returns:
        Set of wordforms (Strings) from this cell
      • elementIsAValidForm

        protected boolean elementIsAValidForm​(org.jsoup.nodes.Element anchor)
      • standardizeValue

        protected String standardizeValue​(String value)