TableSemantifier

Annotates a table with type of each column, binary relation between columns and disambiguation of cell text to Knowledgebase.
The software adds meaning to your tables.

Introduction

The implementation is loosly based on Annotating and Searching Web Tables Using Entities, Types and Relationships, a 2010 VLDB publication. I did not perform any training for parameter learning as explained in the publication. I like it more when a data-agnostic model performs equally well. Besides, I am not totally convinced why training is required except for upping the values for a publication.
Also, unlike the system described in the paper that uses YAGO, this uses Wikidata instead. Because, Freebase recently migrated to Wikidata (need I say more?)

Usage

The system works by making several requests to Wikidata in parallel and can have choking effect on the endpoint. Please use it sparingly. If you want to semantify large number of tables, then consider setting up your version of the Server. It is nice of Wikidata to not put query limits, so let's behave!
Start with TableSemantifier.main for some test examples. In the meanwhile, I will work towards a better API.

Examples

Some example tables and their semantified versions are shown below. A table is added with five most likely annotations for each of: cell text, headers and binary relations.

Authors and their notable work.

Input

Output

Rowling	Harry Potter
Tolkien	Lord of the Rings
Martin	Game of Thrones
Dickens	Great Expectations
Mark	Huckleberry Finn
Leo	War and Peace
Conan Doyle	Sherlock Holmes
William	Hamlet
Dante	Inferno

"languages spoken, written or signed"@en:::"English"@en "occupation"@en:::"writer"@en "occupation"@en:::"screenwriter"@en "instance of"@en:::"male given name"@en "instance of"@en:::"human"@en	"original language of work"@en:::"English"@en "instance of"@en:::"book"@en "country of origin"@en:::"United States of America"@en "genre"@en:::"film adaptation"@en "instance of"@en:::"film"@en
JK Rowling "Joanne K. Rowling"@en null "Magic Beyond Words"@en "Scorpius Malfoy"@en null	Harry Potter "Harry Potter"@en "Harry Potter and the Philosopher's Stone"@en null "Harry Potter universe"@en "Harry Potter fandom"@en
Tolkien "J. R. R. Tolkien"@en "Christopher Tolkien"@en null "Tolkien's legendarium"@en "Tim Tolkien"@en	Lord of the Rings "The Lord of the Rings"@en "The Lord of the Rings: The Fellowship of the Ring"@en "The Lord of the Rings: The Two Towers"@en "The Lord of the Rings: The Return of the King"@en "The History of The Lord of the Rings"@en
Martin null "Martin"@en "Martín"@en "Martin"@en "Martin"@en	Game of Thrones "Game of Thrones"@en null "Game of Thrones character"@en "Game of Thrones (season 1)"@en "Game of Thrones (season 2)"@en
Dickens "Charles Dickens"@en "Kim Dickens"@en null "Mary Dickens"@en "Catherine Dickens"@en	Great Expectations "Great Expectations"@en "Great Expectations"@en "Genome sequences and great expectations"@en null "Great Expectations"@en
Mark "Mark Twain"@en null "Mark"@en "Márk"@en "Mark Municipality"@en	Huckleberry Finn "Adventures of Huckleberry Finn"@en "Huckleberry Finn"@en null "Huckleberry Finn"@en "Jim"@en
Leo "Leo McGarry"@en "Leo McCarey"@en null "Leo"@en "Léo"@en	War and Peace "War and Peace"@en null "War and Peace"@en "War and Peace: 1796–1815"@en "War and Peace"@en
Conan Doyle "Arthur Conan Doyle"@en "Adrian Conan Doyle"@en null "Jean Conan Doyle"@en "Denis Conan Doyle"@en	Sherlock Holmes "canon of Sherlock Holmes"@en "Sherlock Holmes Baffled"@en "Sherlock Holmes"@en "The Adventures of Sherlock Holmes"@en null
William "William Shakespeare"@en "William Johnson"@en null "William"@en "William Arthur Jobson Archbold"@en	Hamlet "Hamlet"@en null "hamlet"@en "Hamlet, Prince of Denmark"@en "Hamlet"@en
Dante "Dante Gabriel Rossetti"@en "Joe Dante"@en null "Dante"@en "Dante Alighieri"@en	Inferno null "Inferno"@en "Inferno"@en "Phoenix Inferno"@en "Inferno"@en

Possible relations between column 0 and 1 shown in the books table is ["notable work"@en, "fictional universe described in"@en, "instance of"@en, "from fictional universe"@en, "present in work"@en]

Indian Cricketers

Input

Output


Sachin
Sehwag
Mr. Dependable
Yuvi
Dhoni
lakshmipathi

"occupation"@en:::"cricketer"@en "sport"@en:::"cricket"@en "award received"@en:::"Arjuna Award"@en "award received"@en:::"Padma Shri"@en "country of citizenship"@en:::"India"@en
Sachin "Sachin Tendulkar"@en null "Sachin"@en "Sachin Bhowmick"@en "Sachin"@en
Sehwag "Virender Sehwag"@en null "Template:User Virender Sehwag"@en "Cricket: Sehwag dominates"@en null
Mr. Dependable "Rahul Dravid"@en null
Yuvi "Yuvraj Singh"@en null null null
Dhoni "Mahendra Singh Dhoni"@en null "M.S.Dhoni"@en "Dhoni"@en "Sakshi Dhoni"@en
lakshmipathi "Lakshmipathy Balaji"@en null "Rukmini Lakshmipathi"@en "Lakshmipati"@en

PadmaVibhushan winners in 2016 (except Krishnamurthy)

Input

Output


Krishnamurthy
Avinash Dixit
Rajinikanth
Ramoji
Ravi Shankar
Ambani

"award received"@en:::"Padma Vibhushan"@en "occupation"@en:::"actor"@en "country of citizenship"@en:::"India"@en "occupation"@en:::"film producer"@en "sex or gender"@en:::"male"@en
Krishnamurthy null "Kalki Krishnamurthy"@en "Hunsur Krishnamurthy"@en "Kavita Krishnamurthy"@en "K. E. Krishnamurthy"@en
Avinash Dixit "Avinash Dixit"@en null
Rajinikanth "Rajinikanth"@en null "Soundarya Rajinikanth"@en "Latha Rajinikanth"@en "Rajinikanth: The Definitive Biography"@en
Ramoji "Ramoji Rao"@en null "Ramoji Film City"@en "Ramoji Group"@en
Ravi Shankar "Ravi Shankar"@en null "K. Ravi Shankar"@en "P. Ravi Shankar"@en "Ravi Shankar Vyas"@en
Ambani "Dhirubhai Ambani"@en null "Tina Ambani"@en "Mukesh Ambani"@en "Nita Ambani"@en

Nobel prize winning physicists

Input

Output


Feynman
Einstein
Peter
Raman
Boyle

"languages spoken, written or signed"@en:::"English"@en "occupation"@en:::"physicist"@en "sex or gender"@en:::"male"@en "instance of"@en:::"human"@en "occupation"@en:::"actor"@en
Feynman "Richard Feynman"@en null "7495 Feynman"@en "Joan Feynman"@en "Surely You're Joking, Mr. Feynman!"@en
Einstein "Albert Einstein"@en null "Eduard Einstein"@en "Hans Albert Einstein"@en "Carl Einstein"@en
Peter "Peter Goldblatt"@en null "Peter"@en "Péter"@en "Hans-Peter"@en
Raman "C. V. Raman"@en null "Raman"@en "Vimala Raman"@en "Raman"@en
Boyle "Danny Boyle"@en null "Peter Boyle"@en "Lara Flynn Boyle"@en "Boyle"@en

Time complexity

The bottleneck is in accessing the network to fetch related content. The requests to the network are all parallelized and cached in this implementation. The typical running time varies depending on your machine and network. On my (10Mbps) network and on my 2.5Ghz Intelx64 Core i5 machine, the books table took around 2 minutes and the rest took less than a minute.

Need to fix

The caching implemented in Wikidata has issues. It sometimes throws java.io.OptionalDataException and hence not read completely because of which some requests are not cached at all.
The suggestions retrieved in the initial stage of pipeline based on a cell text are crucial. Suggestions for entities work fine but fails for nouns such as red, alpha(greek alphabet) etc. One possible fix is to leverage the column text in header if available to direct the search query. It should be possible to fetch suggestions in the context of other cells in the column. For example, corect suggestions for Alpha can be fetched from the fact that it appears in the same column as Beta and Gamma.
Several corner cases are not handled such as when processing requests to knowledge base or when printing a table etc.
Needs a lot more documentation and proper interface for easier access.
The system fetches suggestions for text spanning the entire cell, it is straight forward to extend it to select the right chunk in cell text. For example, cell text Stacy loves red color should be treated as two cells since there are two possible chunks, Stacy and red.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pom.xml

pom.xml

Repository files navigation

TableSemantifier

Introduction

Usage

Examples

Authors and their notable work.

Indian Cricketers

PadmaVibhushan winners in 2016 (except Krishnamurthy)

Nobel prize winning physicists

Time complexity

Need to fix

About

Releases

Packages

Languages

License

vihari/TableSemantifier

Folders and files

Latest commit

History

Repository files navigation

TableSemantifier

Introduction

Usage

Examples

Authors and their notable work.

Indian Cricketers

PadmaVibhushan winners in 2016 (except Krishnamurthy)

Nobel prize winning physicists

Time complexity

Need to fix

About

Topics

Resources

License

Stars

Watchers

Forks

Languages