T2K Match

T2K Match [1] is matching algorithm optimised to match millions of web tables against a central knowledge base.

Many web sites provide data in the form of HTML tables. Millions of such data tables have been extracted from the CommonCrawl web corpus by the Web Data Commons project [3]. Data from these tables can be used to fill missing values in large cross-domain knowledge bases such as DBpedia [2]. This project is an example of how pre-defined building blocks from the WInte.r framework are combined into an advanced, use-case specific integration method. The algorithm is optimized to match millions of Web tables against a central knowledge base describing millions of instances belonging to hundreds of different classes (such a people or locations) [2].

How to run

To run T2K Match, use the run_t2k_match script in the scripts directory.

Copy the compiled T2K Match jar file to the /lib/ directory in your home or change the path in the script file

JAR="$HOME/lib/t2kmatch-2.0-jar-with-dependencies.jar"

Unzip the files in the data directory

gunzip data/dbpedia/*
gunzip data/*.gz

Run the script

./scripts/run_t2k_match

Acknowledgements

This project is a re-implementation of the original T2K Match algorithm developed at the Data and Web Science Group at the University of Mannheim using the WInte.r framework.

License

T2K Match can be used under the Apache 2.0 License.

References

[1] Ritze, D., Lehmberg, O., & Bizer, C. (2015, July). Matching html tables to dbpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics (p. 10). ACM.

[2] Ritze, D., Lehmberg, O., Oulabi, Y., & Bizer, C. (2016, April). Profiling the potential of web tables for augmenting cross-domain knowledge bases. In Proceedings of the 25th International Conference on World Wide Web (pp. 251-261). International World Wide Web Conferences Steering Committee.

[3] Lehmberg, O., Ritze, D., Meusel, R., & Bizer, C. (2016, April). A large public corpus of web tables containing time and context metadata. In Proceedings of the 25th International Conference Companion on World Wide Web (pp. 75-76). International World Wide Web Conferences Steering Committee.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

scripts

scripts

src

src

LICENSE

LICENSE

README.md

README.md

pom.xml

pom.xml

Repository files navigation

T2K Match

How to run

Acknowledgements

License

References

About

Releases

Packages

Contributors 2

Languages

License

olehmberg/T2KMatch

Folders and files

Latest commit

History

Repository files navigation

T2K Match

How to run

Acknowledgements

License

References

About

Resources

License

Stars

Watchers

Forks

Languages