corvid

Table semantics and aggregation!

Installation

This project requires Python 3.6. We recommend you set up a conda environment:

conda create -n corvid python=3.6
source activate corvid

The dependencies are listed in the requirements.in file:

pip install -r requirements.in

After installing, you can run all the unit tests:

pytest tests/

Project structure

|-- corvid/
|   |-- table/
|   |   |-- table.py
|   |   |-- table_loader.py
|   |-- semantic_table/
|   |   |-- semantic_table.py
|   |   |-- evaluate.py
|   |-- table_aggregation/
|   |   |-- schema_matcher.py
|   |   |-- evaluate.py
|-- tests/
|-- requirements.in

A few important things:

table.py contains the Table class, which is the data structure used to represent Tables. It's fine to think of Table as a wrapper around a 2D numpy array, where each [i,j] element represents a cell in the Table.
semantic_table.py contains the SemanticTable class. It takes a Table object as input and learns a normalization of it, which can be accessed via .normalized_table.
schema_matcher.py contains the SchemaMatcher class. The .aggregate_tables() method takes a list of Table objects and finds alignments between columns. For example, a column "p" in Table 1 could be aligned with another column "precision" in Table 2. The .map_tables() method uses these alignments to build a single aggregate Table.
evaluate.py contains a function evaluate() which computes a suite of performance metrics on a given a Gold Table and Predicted Table pair. The semantic_table and table_aggregation modules have their own respective evaluation methods.

Usage / API

`table`

First, instantiate a Table object:

from corvid.table.table import Cell, Table
cells = [
    Cell(tokens=['a'], index_topleft_row=0, index_topleft_col=0, rowspan=1, colspan=1),
    Cell(tokens=['b'], index_topleft_row=0, index_topleft_col=1, rowspan=1, colspan=1),
    Cell(tokens=['c'], index_topleft_row=1, index_topleft_col=0, rowspan=1, colspan=1),
    Cell(tokens=['d'], index_topleft_row=1, index_topleft_col=1, rowspan=1, colspan=1),
]
table = Table(cells=cells, nrow=2, ncol=2)

You can access certain elements by indexing like you would a 2D array:

# visualize
print(table)

# shape
table.nrow; table.ncol; table.dim

# indexing via grid
first_row = table[0,:]
first_col = table[:,0]
bottom_right_element = table[-1, -1]

# indexing via cells
first_cell = table[0]

You can serialize this object to JSON:

import json
with open('myfilename', 'w') as f:
    json.dump(table.to_json(), f)

You can load it back in from JSON using the Loader classes:

from corvid.table.table_loader import CellLoader, TableLoader
cell_loader = CellLoader(cell_type=Cell)
table_loader = TableLoader(table_type=Table, cell_loader=cell_loader)

with open('myfilename', 'r') as f:
    table = table_loader.from_json(json.load(f))

You can extend all of these classes to contain augmented information:

class ColorfulCell(Cell):
    def __init__(self, color: str, ...):
        super().__init__(...)
        self.color = color

class ColorfulTableWithCaption(Table):
    def __init__(self, caption: str, ...):
        super().__init__(...)
        self.caption = caption

cells = [ColorfulCell(color='red', ...), ColorfulCell(color='blue', ...), ...]
table = ColorfulTableWithCaption(cells=cells, nrow=2, ncol=2, caption='red and blue cells')

Serialization of these objects is similar, but requires specification of the correct Cell and Table types:

with open('myfilename', 'w') as f:
    json.dump(table.to_json(), f)
    
cell_loader = CellLoader(cell_type=ColorfulCell)
table_loader = TableLoader(table_type=ColorfulTableWithCaption, cell_loader=cell_loader)

with open('myfilename', 'r') as f:
    table = table_loader.from_json(json.load(f))

`semantic_table`

Normalize an existing Table object by creating a SemanticTable object:

from corvid.semantic_table.semantic_table import SemanticTable
semantic_table = SemanticTable(raw_table=table)

print(semantic_table.normalized_table)

`table_aggregation`

Aggregate Table objects using a SchemaMatcher:

from corvid.table_aggregation.schema_matcher import ColNameSchemaMatcher
schema_matcher = ColNameSchemaMatcher()

First, construct a list of Tables. For best results, use normalized_tables from SemanticTable, but everything works on raw_tables as well.

normalized_source_tables = [SemanticTable(raw_table=t).normalized_table for t in tables]

Second, build a "Schema" by initializing a Table object, which only has a single row containing column header strings. For example:

schema_cells = [Cell(tokens=['header1'], ...), Cell(tokens=['header2'], ...)]    
schema_table = Table(cells=schema_cells, nrow=1, ncol=2)

Third, build list of PairwiseMappings which indicate the column alignments between pairs of Tables.

pairwise_mappings = schema_matcher.map_tables(
    tables=normalized_source_tables,
    target_schema=schema_table
)

Finally, use these PairwiseMappings to build a single Table object that has the columns specified by the "Schema" Table.

aggregate_table = schema_matcher.aggregate_tables(
    pairwise_mappings=pairwise_mappings,
    target_schema=schema_table
)

To evaluate this aggregation, use:

from corvid.table_aggregation.evaluate import evaluate
evaluate(gold_table=gold_table, pred_table=aggregate_table)

TODO

`semantic_table`

cell-wise classification of raw_table Cells
evaluation for semantic table

Future

latex source to table (for training/evaluation)

Name		Name	Last commit message	Last commit date
Latest commit History 254 Commits
corvid		corvid
omnipage		omnipage
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
env.sh		env.sh
requirements.in		requirements.in
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corvid

corvid

omnipage

omnipage

tests

tests

.gitignore

.gitignore

.pylintrc

.pylintrc

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

env.sh

env.sh

requirements.in

requirements.in

setup.py

setup.py

Repository files navigation

corvid

Installation

Project structure

Usage / API

`table`

`semantic_table`

`table_aggregation`

TODO

`semantic_table`

Future

About

Releases

Packages

Languages

License

cmkumar87/corvid

Folders and files

Latest commit

History

Repository files navigation

corvid

Installation

Project structure

Usage / API

table

semantic_table

table_aggregation

TODO

semantic_table

Future

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

`table`

`semantic_table`

`table_aggregation`

`semantic_table`