Skip to content

vihari/TableSemantifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TableSemantifier

Annotates a table with type of each column, binary relation between columns and disambiguation of cell text to Knowledgebase.
The software adds meaning to your tables.

Introduction

The implementation is loosly based on Annotating and Searching Web Tables Using Entities, Types and Relationships, a 2010 VLDB publication. I did not perform any training for parameter learning as explained in the publication. I like it more when a data-agnostic model performs equally well. Besides, I am not totally convinced why training is required except for upping the values for a publication.
Also, unlike the system described in the paper that uses YAGO, this uses Wikidata instead. Because, Freebase recently migrated to Wikidata (need I say more?)

Usage

The system works by making several requests to Wikidata in parallel and can have choking effect on the endpoint. Please use it sparingly. If you want to semantify large number of tables, then consider setting up your version of the Server. It is nice of Wikidata to not put query limits, so let's behave!
Start with TableSemantifier.main for some test examples. In the meanwhile, I will work towards a better API.

Examples

Some example tables and their semantified versions are shown below. A table is added with five most likely annotations for each of: cell text, headers and binary relations.

Authors and their notable work.

InputOutput
RowlingHarry Potter
TolkienLord of the Rings
MartinGame of Thrones
DickensGreat Expectations
MarkHuckleberry Finn
LeoWar and Peace
Conan DoyleSherlock Holmes
WilliamHamlet
DanteInferno

"languages spoken, written or signed"@en:::"English"@en
"occupation"@en:::"writer"@en
"occupation"@en:::"screenwriter"@en
"instance of"@en:::"male given name"@en
"instance of"@en:::"human"@en

"original language of work"@en:::"English"@en
"instance of"@en:::"book"@en
"country of origin"@en:::"United States of America"@en
"genre"@en:::"film adaptation"@en
"instance of"@en:::"film"@en
JK Rowling
"Joanne K. Rowling"@en
null
"Magic Beyond Words"@en
"Scorpius Malfoy"@en
null
Harry Potter
"Harry Potter"@en
"Harry Potter and the Philosopher's Stone"@en
null
"Harry Potter universe"@en
"Harry Potter fandom"@en
Tolkien
"J. R. R. Tolkien"@en
"Christopher Tolkien"@en
null
"Tolkien's legendarium"@en
"Tim Tolkien"@en
Lord of the Rings
"The Lord of the Rings"@en
"The Lord of the Rings: The Fellowship of the Ring"@en
"The Lord of the Rings: The Two Towers"@en
"The Lord of the Rings: The Return of the King"@en
"The History of The Lord of the Rings"@en
Martin
null
"Martin"@en
"Martín"@en
"Martin"@en
"Martin"@en
Game of Thrones
"Game of Thrones"@en
null
"Game of Thrones character"@en
"Game of Thrones (season 1)"@en
"Game of Thrones (season 2)"@en
Dickens
"Charles Dickens"@en
"Kim Dickens"@en
null
"Mary Dickens"@en
"Catherine Dickens"@en
Great Expectations
"Great Expectations"@en
"Great Expectations"@en
"Genome sequences and great expectations"@en
null
"Great Expectations"@en
Mark
"Mark Twain"@en
null
"Mark"@en
"Márk"@en
"Mark Municipality"@en
Huckleberry Finn
"Adventures of Huckleberry Finn"@en
"Huckleberry Finn"@en
null
"Huckleberry Finn"@en
"Jim"@en
Leo
"Leo McGarry"@en
"Leo McCarey"@en
null
"Leo"@en
"Léo"@en
War and Peace
"War and Peace"@en
null
"War and Peace"@en
"War and Peace: 1796–1815"@en
"War and Peace"@en
Conan Doyle
"Arthur Conan Doyle"@en
"Adrian Conan Doyle"@en
null
"Jean Conan Doyle"@en
"Denis Conan Doyle"@en
Sherlock Holmes
"canon of Sherlock Holmes"@en
"Sherlock Holmes Baffled"@en
"Sherlock Holmes"@en
"The Adventures of Sherlock Holmes"@en
null
William
"William Shakespeare"@en
"William Johnson"@en
null
"William"@en
"William Arthur Jobson Archbold"@en
Hamlet
"Hamlet"@en
null
"hamlet"@en
"Hamlet, Prince of Denmark"@en
"Hamlet"@en
Dante
"Dante Gabriel Rossetti"@en
"Joe Dante"@en
null
"Dante"@en
"Dante Alighieri"@en
Inferno
null
"Inferno"@en
"Inferno"@en
"Phoenix Inferno"@en
"Inferno"@en
Possible relations between column 0 and 1 shown in the books table is ["notable work"@en, "fictional universe described in"@en, "instance of"@en, "from fictional universe"@en, "present in work"@en]

Indian Cricketers

InputOutput
Sachin
Sehwag
Mr. Dependable
Yuvi
Dhoni
lakshmipathi

"occupation"@en:::"cricketer"@en
"sport"@en:::"cricket"@en
"award received"@en:::"Arjuna Award"@en
"award received"@en:::"Padma Shri"@en
"country of citizenship"@en:::"India"@en
Sachin
"Sachin Tendulkar"@en
null
"Sachin"@en
"Sachin Bhowmick"@en
"Sachin"@en
Sehwag
"Virender Sehwag"@en
null
"Template:User Virender Sehwag"@en
"Cricket: Sehwag dominates"@en
null
Mr. Dependable
"Rahul Dravid"@en
null
Yuvi
"Yuvraj Singh"@en
null
null
null
Dhoni
"Mahendra Singh Dhoni"@en
null
"M.S.Dhoni"@en
"Dhoni"@en
"Sakshi Dhoni"@en
lakshmipathi
"Lakshmipathy Balaji"@en
null
"Rukmini Lakshmipathi"@en
"Lakshmipati"@en

PadmaVibhushan winners in 2016 (except Krishnamurthy)

InputOutput
Krishnamurthy
Avinash Dixit
Rajinikanth
Ramoji
Ravi Shankar
Ambani

"award received"@en:::"Padma Vibhushan"@en
"occupation"@en:::"actor"@en
"country of citizenship"@en:::"India"@en
"occupation"@en:::"film producer"@en
"sex or gender"@en:::"male"@en
Krishnamurthy
null
"Kalki Krishnamurthy"@en
"Hunsur Krishnamurthy"@en
"Kavita Krishnamurthy"@en
"K. E. Krishnamurthy"@en
Avinash Dixit
"Avinash Dixit"@en
null
Rajinikanth
"Rajinikanth"@en
null
"Soundarya Rajinikanth"@en
"Latha Rajinikanth"@en
"Rajinikanth: The Definitive Biography"@en
Ramoji
"Ramoji Rao"@en
null
"Ramoji Film City"@en
"Ramoji Group"@en
Ravi Shankar
"Ravi Shankar"@en
null
"K. Ravi Shankar"@en
"P. Ravi Shankar"@en
"Ravi Shankar Vyas"@en
Ambani
"Dhirubhai Ambani"@en
null
"Tina Ambani"@en
"Mukesh Ambani"@en
"Nita Ambani"@en

Nobel prize winning physicists

InputOutput
Feynman
Einstein
Peter
Raman
Boyle

"languages spoken, written or signed"@en:::"English"@en
"occupation"@en:::"physicist"@en
"sex or gender"@en:::"male"@en
"instance of"@en:::"human"@en
"occupation"@en:::"actor"@en
Feynman
"Richard Feynman"@en
null
"7495 Feynman"@en
"Joan Feynman"@en
"Surely You're Joking, Mr. Feynman!"@en
Einstein
"Albert Einstein"@en
null
"Eduard Einstein"@en
"Hans Albert Einstein"@en
"Carl Einstein"@en
Peter
"Peter Goldblatt"@en
null
"Peter"@en
"Péter"@en
"Hans-Peter"@en
Raman
"C. V. Raman"@en
null
"Raman"@en
"Vimala Raman"@en
"Raman"@en
Boyle
"Danny Boyle"@en
null
"Peter Boyle"@en
"Lara Flynn Boyle"@en
"Boyle"@en

Time complexity

The bottleneck is in accessing the network to fetch related content. The requests to the network are all parallelized and cached in this implementation. The typical running time varies depending on your machine and network. On my (10Mbps) network and on my 2.5Ghz Intelx64 Core i5 machine, the books table took around 2 minutes and the rest took less than a minute.

Need to fix

  1. The caching implemented in Wikidata has issues. It sometimes throws java.io.OptionalDataException and hence not read completely because of which some requests are not cached at all.
  2. The suggestions retrieved in the initial stage of pipeline based on a cell text are crucial. Suggestions for entities work fine but fails for nouns such as red, alpha(greek alphabet) etc. One possible fix is to leverage the column text in header if available to direct the search query. It should be possible to fetch suggestions in the context of other cells in the column. For example, corect suggestions for Alpha can be fetched from the fact that it appears in the same column as Beta and Gamma.
  3. Several corner cases are not handled such as when processing requests to knowledge base or when printing a table etc.
  4. Needs a lot more documentation and proper interface for easier access.
  5. The system fetches suggestions for text spanning the entire cell, it is straight forward to extend it to select the right chunk in cell text. For example, cell text Stacy loves red color should be treated as two cells since there are two possible chunks, Stacy and red.

About

Annotates a table with type of each column, binary relation between columns and disambiguation of cell text to Knowledgebase

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages