Synthesizing N-ary relations from Web tables (SNoW)

Web tables have been used in many approaches to extend existing data sources with additional data. But so far, all approaches are limited to the extraction of binary relations. We find, however, that a large number of web tables actually are n-ary relations. This means that, in many cases, the data can only be understood if more than two columns of the web tables are considered.

The SNoW system identifies and synthesizes n-ary relations from web tables and integrates them with a knowledge base.

Given a set of web tables and a target knowledge base, the SNoW method extends each web table with additional context columns, stitches matching web tables into larger tables, and applies functional dependency discovery to identify the relations that are represented in the web tables. Further, it normalises the stitched tables, guided by the schema of the knowledge base, to create an integrated schema.

The SNoW system has been described and applied in [1,2], please cite these references if you use it in any publication.

System Overview

The following figure visualises the integration process of the SNoW system. First, each web table is extended with context columns, which are generated from the web page that contains the web table. Then, web tables with identical schema are stitched into union tables. These union tables contain more tuples than the original web tables and improve the performance of schema matchers, which are executed in the next step. The schema matchers create a mapping among the union tables, to identify matching schemata which are expressed with different column headers, and a mapping from the union tables to the knowledge base. Based on the mapping among the union tables, the stitching is repeated to merge all union tables with matching schemata into stitched union tables. The stitched union tables contain the tuples of all web tables with the respective schema and enable the discovery of functional dependencies, which is infeasible on the much smaller original tables. Finally, all stitched union tables which are mapped to the same knowledge base class are merged into a stitched universal relation, which is normalised to create an integrated schema. This normalisation is guided by the mapping to the knowledge base and produces a star-shaped schema with the existing classes from the knowledge base at the centres and the n-ary relations that were discovered in the web tables referring to them.

As an example, consider the web table in the next figure, which contains employment statistics for a certain profession in several U.S. states at a given time. While a human can easily understand this, it is hard for an algorithm, as the table contains only very few examples (the figure shows the complete table) and important attributes, such as the profession and the date, are missing from the table. Existing approaches use recognisers or matchers to find column pairs which represent an existing or an unknown binary relation. In the example, such systems would extract a binary relation {state, employment}, which can be found in thousands of web tables from the same web site. As a result, the extracted binary relation contains about 2 000 different employment numbers for a single state and no possibility to choose a single, correct value, or to understand any individual value. To extract a meaningful relation for the employment attribute, we must include all attributes which determine its value, i.e., discover the functional dependency {state, page title, URI 1, URI 2}→{employment}. This is only possible if we take additional data from the context of the web table into account and combine the observations from multiple web tables, which contain employment numbers for different professions and dates.

Project Structure

SNoW Datasets

The datasets and annotations used for evaluation can be found in the datasets/ directory. Every sub-directory is a dataset for a single web site and contains the pre-processed union tables in union_dedup_json/ and the annotations in the evaluation/ sub-directory.

SNoW Evaluation

We run the SNoW system on our datasets in different configurations. The first two configurations do not apply any matchers and use the schema correspondences from the ground truth.

Binary detects an entity label column in each table and assumes all other attributes only depend on this column. This configuration uses the manually annotated schema mapping as input and extracts a binary relation for each reference relation and resembles a perfect extractor for binary relations as baseline.
N-Ary uses functional dependency discovery instead of binary relations, but does not generate context attributes.
N-Ary + C uses functional dependency discovery and generates context attributes.
Match: The SNoW system (N-ary + C), using the matching components instead of the correspondences from the annotations.
Norm: The end-to-end configuration of the SNoW system. Uses functional dependency discovery, generates context attributes, uses the matching components and applies normalisation.

Table 1: Dataset statistics and Evaluation. A_O=original attributes, A_T=total attributes (original & context attributes), A_U=universal attributes.

	Web Tables	Union Tables			Annotation		Experiments - F1-measure
Host	Tables	Tables	A_O	A_T	A_U	FDs	Schema	Binary	N-Ary	N-ary +C	Match	Norm
d3football.com	40,584	12	63	145	41	18	0.779	0.495	0.480	0.851	0.765	0.765
data.bls.gov	10,824	12	61	181	51	12	0.895	0.415	0.485	0.984	0.900	0.900
flightaware.com	2,888	6	22	72	35	13	0.982	0.459	0.405	1.000	0.942	0.733
itunes.apple.com	42,729	76	470	1,095	59	6	0.827	0.680	0.674	0.953	0.809	0.763
seatgeek.com	157,578	72	266	714	71	30	0.999	0.988	0.992	0.984	0.962	0.962
www.amoeba.com	5,529	65	227	712	42	13	0.933	0.812	0.394	0.907	0.885	0.885
www.cia.gov	30,569	213	562	2,225	323	162	0.976	0.858	0.836	0.821	0.660	0.635
www.nndb.com	23,522	29	123	299	29	10	1.000	1.000	1.000	1.000	0.999	0.999
www.vgchartz.com	23,258	8	39	87	36	13	1.000	0.448	0.253	1.000	1.000	1.000
Sum / Macro Avg.	337,481	493	1,833	5,530	687	277	0.932	0.684	0.613	0.944	0.880	0.849

Running SNoW

After cloning the repository, use the unzip script to extract the datasets:

./unzip

Before running snow, the paths in SET_VARS must be adjusted for your environment:

J8="path to your java 8 executable"
JAR="path to the SNoW jar and its dependencies"
VMARGS="-Xmx300G"
TANE_HOME="path to the TANE implementation"

Before running SNoW, you must download the implementation of the TANE algorithm on your machine.

To run the snow system in different configurations, use one of the following scripts, which accept the path to the dataset as parameter:

./run_snow datasets/d3football.com

run_snow_b: configuration Binary
run_snow_noContextColumns: configuration N-Ary
run_snow: configuration N-Ary + C
run_snow_match: configuration Match
run_snow_match_normalised: configuration Norm
run_snow_reference: Creates a reference extraction based on the annotations

To run the evaluation, use evaluate_containment and provide the path to the dataset as parameter:

./evaluate_containment datasets/d3football.com

Or use evaluate_containment_all to run the evaluation on all datasets and configurations.

Creating New Datasets

To create new datasets, the clustered union tables must be created before running SNoW. Copy all web tables for a single web site in a directory and use create_clustered_union to create the union tables:

./create_clustered_union directory_containing_a_single_web_site

If you want to create the union tables for multiple web sites, create a directory with one sub-directory for every web site and run run_clustered_union with the directory as parameter:

./run_clustered_union directory_containing_all_web_sites

During the creation of the union tables, a directory clustered_union_correspondences is created, which contains schema correspondences among the generated context columns. These correspondences must be copied into the evaluation sub-directory of the dataset using the file name context_correspondences.tsv.

References

[1] Oliver Lehmberg and Christian Bizer. 2019. Synthesizing N-ary Relations from Web Tables. In 9th International Conference on Web Intelligence, Mining and Semantics (WIMS2019), June 26-28, 2019, Seoul, Republic of Korea. ACM, New York, NY, USA.

[2] Oliver Lehmberg and Christian Bizer. 2019. Profiling the Semantics of N-aryWeb Table Data. In The InternationalWorkshop on Semantic Big Data (SBD’19), July 5, 2019, Amsterdam, Netherlands. ACM, New York, NY, USA.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
datasets		datasets
img		img
src/main/java/de/uni_mannheim/informatik/dws		src/main/java/de/uni_mannheim/informatik/dws
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SET_VARS		SET_VARS
create_clustered_union		create_clustered_union
evaluate_containment		evaluate_containment
evaluate_containment_all		evaluate_containment_all
knowledgebase.tar.gz		knowledgebase.tar.gz
pom.xml		pom.xml
run_all		run_all
run_clustered_union		run_clustered_union
run_snow		run_snow
run_snow_b		run_snow_b
run_snow_match		run_snow_match
run_snow_match_normalised		run_snow_match_normalised
run_snow_noContextColumns		run_snow_noContextColumns
run_snow_reference		run_snow_reference
unzip		unzip
zip		zip

License

olehmberg/snow

Folders and files

Latest commit

History

Repository files navigation

Synthesizing N-ary relations from Web tables (SNoW)

System Overview

Project Structure

SNoW Datasets

SNoW Evaluation

Running SNoW

Creating New Datasets

References

About

Resources

License

Stars

Watchers

Forks

Languages