Skip to content

cmkumar87/corvid

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

corvid

Table semantics and aggregation!

Installation

This project requires Python 3.6. We recommend you set up a conda environment:

conda create -n corvid python=3.6
source activate corvid

The dependencies are listed in the requirements.in file:

pip install -r requirements.in

After installing, you can run all the unit tests:

pytest tests/

Project structure

|-- corvid/
|   |-- table/
|   |   |-- table.py
|   |   |-- table_loader.py
|   |-- semantic_table/
|   |   |-- semantic_table.py
|   |   |-- evaluate.py
|   |-- table_aggregation/
|   |   |-- schema_matcher.py
|   |   |-- evaluate.py
|-- tests/
|-- requirements.in

A few important things:

  • table.py contains the Table class, which is the data structure used to represent Tables. It's fine to think of Table as a wrapper around a 2D numpy array, where each [i,j] element represents a cell in the Table.

  • semantic_table.py contains the SemanticTable class. It takes a Table object as input and learns a normalization of it, which can be accessed via .normalized_table.

  • schema_matcher.py contains the SchemaMatcher class. The .aggregate_tables() method takes a list of Table objects and finds alignments between columns. For example, a column "p" in Table 1 could be aligned with another column "precision" in Table 2. The .map_tables() method uses these alignments to build a single aggregate Table.

  • evaluate.py contains a function evaluate() which computes a suite of performance metrics on a given a Gold Table and Predicted Table pair. The semantic_table and table_aggregation modules have their own respective evaluation methods.

Usage / API

table

First, instantiate a Table object:

from corvid.table.table import Cell, Table
cells = [
    Cell(tokens=['a'], index_topleft_row=0, index_topleft_col=0, rowspan=1, colspan=1),
    Cell(tokens=['b'], index_topleft_row=0, index_topleft_col=1, rowspan=1, colspan=1),
    Cell(tokens=['c'], index_topleft_row=1, index_topleft_col=0, rowspan=1, colspan=1),
    Cell(tokens=['d'], index_topleft_row=1, index_topleft_col=1, rowspan=1, colspan=1),
]
table = Table(cells=cells, nrow=2, ncol=2)

You can access certain elements by indexing like you would a 2D array:

# visualize
print(table)

# shape
table.nrow; table.ncol; table.dim

# indexing via grid
first_row = table[0,:]
first_col = table[:,0]
bottom_right_element = table[-1, -1]

# indexing via cells
first_cell = table[0]

You can serialize this object to JSON:

import json
with open('myfilename', 'w') as f:
    json.dump(table.to_json(), f)

You can load it back in from JSON using the Loader classes:

from corvid.table.table_loader import CellLoader, TableLoader
cell_loader = CellLoader(cell_type=Cell)
table_loader = TableLoader(table_type=Table, cell_loader=cell_loader)

with open('myfilename', 'r') as f:
    table = table_loader.from_json(json.load(f))

You can extend all of these classes to contain augmented information:

class ColorfulCell(Cell):
    def __init__(self, color: str, ...):
        super().__init__(...)
        self.color = color

class ColorfulTableWithCaption(Table):
    def __init__(self, caption: str, ...):
        super().__init__(...)
        self.caption = caption

cells = [ColorfulCell(color='red', ...), ColorfulCell(color='blue', ...), ...]
table = ColorfulTableWithCaption(cells=cells, nrow=2, ncol=2, caption='red and blue cells')

Serialization of these objects is similar, but requires specification of the correct Cell and Table types:

with open('myfilename', 'w') as f:
    json.dump(table.to_json(), f)
    
cell_loader = CellLoader(cell_type=ColorfulCell)
table_loader = TableLoader(table_type=ColorfulTableWithCaption, cell_loader=cell_loader)

with open('myfilename', 'r') as f:
    table = table_loader.from_json(json.load(f))

semantic_table

Normalize an existing Table object by creating a SemanticTable object:

from corvid.semantic_table.semantic_table import SemanticTable
semantic_table = SemanticTable(raw_table=table)

print(semantic_table.normalized_table)

table_aggregation

Aggregate Table objects using a SchemaMatcher:

from corvid.table_aggregation.schema_matcher import ColNameSchemaMatcher
schema_matcher = ColNameSchemaMatcher()

First, construct a list of Tables. For best results, use normalized_tables from SemanticTable, but everything works on raw_tables as well.

normalized_source_tables = [SemanticTable(raw_table=t).normalized_table for t in tables]

Second, build a "Schema" by initializing a Table object, which only has a single row containing column header strings. For example:

schema_cells = [Cell(tokens=['header1'], ...), Cell(tokens=['header2'], ...)]    
schema_table = Table(cells=schema_cells, nrow=1, ncol=2)

Third, build list of PairwiseMappings which indicate the column alignments between pairs of Tables.

pairwise_mappings = schema_matcher.map_tables(
    tables=normalized_source_tables,
    target_schema=schema_table
)

Finally, use these PairwiseMappings to build a single Table object that has the columns specified by the "Schema" Table.

aggregate_table = schema_matcher.aggregate_tables(
    pairwise_mappings=pairwise_mappings,
    target_schema=schema_table
)

To evaluate this aggregation, use:

from corvid.table_aggregation.evaluate import evaluate
evaluate(gold_table=gold_table, pred_table=aggregate_table)

TODO

semantic_table

  • cell-wise classification of raw_table Cells
  • evaluation for semantic table

Future

  • latex source to table (for training/evaluation)

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.5%
  • C++ 1.5%
  • Other 1.0%