Skip to content

olehmberg/snow

Repository files navigation

Synthesizing N-ary relations from Web tables (SNoW)

Web tables have been used in many approaches to extend existing data sources with additional data. But so far, all approaches are limited to the extraction of binary relations. We find, however, that a large number of web tables actually are n-ary relations. This means that, in many cases, the data can only be understood if more than two columns of the web tables are considered.

The SNoW system identifies and synthesizes n-ary relations from web tables and integrates them with a knowledge base.

Given a set of web tables and a target knowledge base, the SNoW method extends each web table with additional context columns, stitches matching web tables into larger tables, and applies functional dependency discovery to identify the relations that are represented in the web tables. Further, it normalises the stitched tables, guided by the schema of the knowledge base, to create an integrated schema.

The SNoW system has been described and applied in [1,2], please cite these references if you use it in any publication.

System Overview

The following figure visualises the integration process of the SNoW system. First, each web table is extended with context columns, which are generated from the web page that contains the web table. Then, web tables with identical schema are stitched into union tables. These union tables contain more tuples than the original web tables and improve the performance of schema matchers, which are executed in the next step. The schema matchers create a mapping among the union tables, to identify matching schemata which are expressed with different column headers, and a mapping from the union tables to the knowledge base. Based on the mapping among the union tables, the stitching is repeated to merge all union tables with matching schemata into stitched union tables. The stitched union tables contain the tuples of all web tables with the respective schema and enable the discovery of functional dependencies, which is infeasible on the much smaller original tables. Finally, all stitched union tables which are mapped to the same knowledge base class are merged into a stitched universal relation, which is normalised to create an integrated schema. This normalisation is guided by the mapping to the knowledge base and produces a star-shaped schema with the existing classes from the knowledge base at the centres and the n-ary relations that were discovered in the web tables referring to them.

SNoW Process Overview

As an example, consider the web table in the next figure, which contains employment statistics for a certain profession in several U.S. states at a given time. While a human can easily understand this, it is hard for an algorithm, as the table contains only very few examples (the figure shows the complete table) and important attributes, such as the profession and the date, are missing from the table. Existing approaches use recognisers or matchers to find column pairs which represent an existing or an unknown binary relation. In the example, such systems would extract a binary relation {state, employment}, which can be found in thousands of web tables from the same web site. As a result, the extracted binary relation contains about 2 000 different employment numbers for a single state and no possibility to choose a single, correct value, or to understand any individual value. To extract a meaningful relation for the employment attribute, we must include all attributes which determine its value, i.e., discover the functional dependency {state, page title, URI 1, URI 2}→{employment}. This is only possible if we take additional data from the context of the web table into account and combine the observations from multiple web tables, which contain employment numbers for different professions and dates.

Example Web Table

Project Structure

SNoW Datasets

The datasets and annotations used for evaluation can be found in the datasets/ directory. Every sub-directory is a dataset for a single web site and contains the pre-processed union tables in union_dedup_json/ and the annotations in the evaluation/ sub-directory.

SNoW Evaluation

We run the SNoW system on our datasets in different configurations. The first two configurations do not apply any matchers and use the schema correspondences from the ground truth.

  • Binary detects an entity label column in each table and assumes all other attributes only depend on this column. This configuration uses the manually annotated schema mapping as input and extracts a binary relation for each reference relation and resembles a perfect extractor for binary relations as baseline.

  • N-Ary uses functional dependency discovery instead of binary relations, but does not generate context attributes.

  • N-Ary + C uses functional dependency discovery and generates context attributes.

  • Match: The SNoW system (N-ary + C), using the matching components instead of the correspondences from the annotations.

  • Norm: The end-to-end configuration of the SNoW system. Uses functional dependency discovery, generates context attributes, uses the matching components and applies normalisation.

Table 1: Dataset statistics and Evaluation. A_O=original attributes, A_T=total attributes (original & context attributes), A_U=universal attributes.

Web TablesUnion TablesAnnotationExperiments - F1-measure
HostTablesTablesAOATAUFDsSchemaBinaryN-AryN-ary +CMatchNorm
d3football.com 40,584 12 63 145 41180.7790.4950.4800.8510.7650.765
data.bls.gov 10,824 12 61 181 51120.8950.4150.4850.9840.9000.900
flightaware.com 2,888 6 22 72 35130.9820.4590.4051.0000.9420.733
itunes.apple.com 42,729 76 470 1,095 5960.8270.6800.6740.9530.8090.763
seatgeek.com 157,578 72 266 714 71300.9990.9880.9920.9840.9620.962
www.amoeba.com 5,529 65 227 712 42130.9330.8120.3940.9070.8850.885
www.cia.gov 30,569 213 562 2,225 3231620.9760.8580.8360.8210.6600.635
www.nndb.com 23,522 29 123 299 29101.0001.0001.0001.0000.9990.999
www.vgchartz.com 23,258 8 39 87 36131.0000.4480.2531.0001.0001.000
Sum / Macro Avg. 337,481 493 1,833 5,530 6872770.9320.6840.6130.9440.8800.849

Running SNoW

After cloning the repository, use the unzip script to extract the datasets:

./unzip

Before running snow, the paths in SET_VARS must be adjusted for your environment:

J8="path to your java 8 executable"
JAR="path to the SNoW jar and its dependencies"
VMARGS="-Xmx300G"
TANE_HOME="path to the TANE implementation"

Before running SNoW, you must download the implementation of the TANE algorithm on your machine.

To run the snow system in different configurations, use one of the following scripts, which accept the path to the dataset as parameter:

./run_snow datasets/d3football.com

  • run_snow_b: configuration Binary
  • run_snow_noContextColumns: configuration N-Ary
  • run_snow: configuration N-Ary + C
  • run_snow_match: configuration Match
  • run_snow_match_normalised: configuration Norm
  • run_snow_reference: Creates a reference extraction based on the annotations

To run the evaluation, use evaluate_containment and provide the path to the dataset as parameter:

./evaluate_containment datasets/d3football.com

Or use evaluate_containment_all to run the evaluation on all datasets and configurations.

Creating New Datasets

To create new datasets, the clustered union tables must be created before running SNoW. Copy all web tables for a single web site in a directory and use create_clustered_union to create the union tables:

./create_clustered_union directory_containing_a_single_web_site

If you want to create the union tables for multiple web sites, create a directory with one sub-directory for every web site and run run_clustered_union with the directory as parameter:

./run_clustered_union directory_containing_all_web_sites

During the creation of the union tables, a directory clustered_union_correspondences is created, which contains schema correspondences among the generated context columns. These correspondences must be copied into the evaluation sub-directory of the dataset using the file name context_correspondences.tsv.

References

[1] Oliver Lehmberg and Christian Bizer. 2019. Synthesizing N-ary Relations from Web Tables. In 9th International Conference on Web Intelligence, Mining and Semantics (WIMS2019), June 26-28, 2019, Seoul, Republic of Korea. ACM, New York, NY, USA.

[2] Oliver Lehmberg and Christian Bizer. 2019. Profiling the Semantics of N-aryWeb Table Data. In The InternationalWorkshop on Semantic Big Data (SBD’19), July 5, 2019, Amsterdam, Netherlands. ACM, New York, NY, USA.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published