Comprehensive Analysis of Freebase and Dataset Creation for Robust Evaluation of Knowledge Graph Link Prediction Models

Overview

This repository contains the dataset, preprocessing scripts, and experiment results of the paper Comprehensive Analysis of Freebase and Dataset Creation for Robust Evaluation of Knowledge Graph Link Prediction Models, where we lay out a comprehensive analysis of the challenges and impacts associated with three idiosyncrasies(Reverse Triples, Mediator Nodes, and Type System) of Freebase, a large-scale, open-domain knowledge graph on Knowledge Graph Completion tasks such as link prediction.

Freebase is amongst the largest public cross-domain KGs that store common facts. It possesses several data modeling idiosyncrasies rarely found in comparable datasets such as Wikidata, YAGO, and so on. Though closed in 2015, Freebase still serves as an important knowledge graph in intelligent tasks. We checked all full-length papers that use datasets commonly used for link prediction and were published in 12 top conferences during their latest versions, in 2022. The 12 conferences are AAAL, IJCAI, WWW, KDD, ICML, ACL, EMNLP, NAACL, SIGIR, NeurIPS, SIGMOD, and VLDB. That amounts to 53 papers. 48 out of the 53 papers used datasets produced from Freebase, while only 8 used datasets from Wikidata. The papers and the datasets used in the papers are listed in the file papers.xlsx.

Reverse Triples

When a new fact was included in Freebase, it would be added as a pair of reverse triples. For instance, (A Room With A View,167 /film/film/directed_by, James Ivory) and (James Ivory, film/director/film, A Room With A View) form a pair of reverse triples. They have the same semantic meaning.

Mediator Nodes

Mediator nodes, also called CVT nodes, are used in Freebase to represent n-ary relationships. The figure below shows a CVT node connected to an award, a nominee, and a work. This or similar approach is necessary for accurate modeling of the real world.

Type System

Freebase categorizes each topic into one or more types and each type into one domain. Furthermore, the triple instances satisfy pseudo constraints as if they are governed by a rigorous type system. Specifically, 1) given a node, its types set up constraints on the labels of its properties; the type segment in the label of an edge (which is different from the edge type) in most cases belongs to one of the types of the subject node. 2) Given an edge type and its edge instances, there is almost a function that maps from the edge type to a type that all subjects in the edge instances belong to, and similarly almost such a function for objects.

Dataset

Four variants of the Freebase dataset are provided by the inclusion/exclusion of various data modeling idiosyncrasies, which enables researchers to leverage or avoid such features based on the nature of their tasks. The dataset can be downloaded from this link.

Dataset Statistics

variant	CVT nodes	reverse triples	#entities	#properties	#triples
FB-CVT-REV	removed	removed	46,069,321	3,055	125,124,274
FB-CVT+REV	removed	retained	46,077,533	5,028	238,981,274
FB+CVT-REV	retained	removed	59,894,890	2,641	134,213,735
FB+CVT+REV	retained	retained	59,896,902	4,425	244,112,599

Dataset Details

The dataset consists of the four variants of the Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:

Subject matter triples file
- fb+/-CVT+/-REV One folder for each variant. In each folder, there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt Subject matter triples are the triples that belong to subject matter domains—domains describing real-world facts.
  - Example of a row in train.txt, valid.txt, and test.txt
    - 2, 192, 0
  - Example of a row in entity2id.txt:
    - /g/112yfy2xr, 2
  - Example of a row in relation2id.txt:
    - /music/album/release_type, 192
  - Explanation
    - "/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects.
Type system file
- freebase_endtypes: Each row maps an edge type to its required subject type and object type.
  - Example
    - 92, 47178872, 90
  - Explanation
    - "92" and "90" are the type id of the subject and object which has the relationship id "47178872".
Metadata files
- object_types: Each row maps the MID of a Freebase object to a type it belongs to.
  - Example
    - /g/11b41c22g, /type/object/type, /people/person
  - Explanation
    - The entity with MID "/g/11b41c22g" has a type "/people/person"
- object_names: Each row maps the MID of a Freebase object to its textual label.
  - Example
    - /g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en
  - Explanation
    - The entity with MID "/g/11b78qtr5m" has the name "Viroliano Tries Jazz" in English.
- object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
  - Example
    - /m/05v3y9r, /type/object/id, "/music/live_album/concert"
  - Explanation
    - The entity with MID "/m/05v3y9r" can be interpreted by humans as a music concert live album.
- domains_id_label: Each row maps the MID of a Freebase domain to its label.
  - Example
    - /m/05v4pmy, geology, 77
  - Explanation
    - The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset.
- types_id_label: Each row maps the MID of a Freebase type to its label.
  - Example
    - /m/01xljxh, /government/political_party, 147
  - Explanation
    - The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset.
- entities_id_label: Each row maps the MID of a Freebase entity to its label.
  - Example
    - /g/11b78qtr5m, Viroliano Tries Jazz, 2234
  - Explanation
    - The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset.
- properties_id_label: Each row maps the MID of a Freebase property to its label.
  - Example
    - /m/010h8tp2, /comedy/comedy_group/members, 47178867
  - Explanation
    - The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has the label "/comedy/comedy_group/members" and has id "47178867" in our dataset.
- uri_original2simplified and uri_simplified2original: The mapping between the original URI and simplified URI and the mapping between simplified URI and original URI respectively.
  - Example
    - uri_original2simplified
      - "http://rdf.freebase.com/ns/type.property.unique": "/type/property/unique"
      (the URI directs to nothing because Freebase has been closed)
    - uri_simplified2original
      - "/type/property/unique": "http://rdf.freebase.com/ns/type.property.unique"
      (the URI directs to nothing because Freebase has been closed)
  - Explanation
    - The URI "http://rdf.freebase.com/ns/type.property.unique" in the original Freebase RDF dataset is simplified into "/type/property/unique" in our dataset.
    - The identifier "/type/property/unique" in our dataset has URI http://rdf.freebase.com/ns/type.property.unique in the original Freebase RDF dataset.

Experiments & Results

We conducted all the link prediction experiments on four datasets using the DGL-KE framework (Zheng et al.,2020).

The hyperparameters used for each experiment, its training/test time, and more details can be found in the script provided for each dataset.

The results of these experiments on our datasets are shown in the table below.

	FB-CVT-REV				FB-CVT+REV				FB+CVT-REV				FB+CVT+REV
Model	MRR	MR	H1	H10	MRR	MR	H1	H10	MRR	MR	H1	H10	MRR	MR	H1	H10
TransE	0.806	5.869	0.757	0.884	0.976	1.529	0.968	0.988	0.781	4.850	0.708	0.902	0.970	1.464	0.957	0.989
DistMult	0.703	70.498	0.664	0.775	0.952	9.239	0.941	0.970	0.612	81.841	0.562	0.704	0.927	12.924	0.913	0.951
ComplEx	0.719	67.740	0.684	0.783	0.958	8.437	0.950	0.972	0.624	83.205	0.577	0.708	0.928	13.278	0.915	0.951
TransR	0.663	58.553	0.620	0.743	0.944	5.982	0.931	0.967	0.640	47.524	0.580	0.754	0.935	6.071	0.916	0.969
RotatE	0.804	75.721	0.780	0.845	0.962	10.431	0.956	0.974	0.736	68.436	0.699	0.807	0.948	10.263	0.938	0.969

Another way of evaluating embedding models is to find their performance on triple classification. This task is the binary classification of triples regarding whether they are true or false facts. The results of our triple classification task are shown in the tables below.

	consistent h				inconsistent h
Model	Precision	Recall	Acc	F1	Precision	Recall	Acc	F1
RESCAL	0.59	0.37	0.55	0.45	0.95	0.83	0.89	0.89
TransE	0.52	0.59	0.52	0.55	0.81	0.69	0.76	0.74
DistMult	0.53	0.51	0.53	0.52	0.94	0.87	0.91	0.90
ComplEx	0.54	0.48	0.53	0.51	0.94	0.88	0.91	0.91
ConvE	0.54	0.53	0.54	0.53	0.57	0.72	0.59	0.64
RotatE	0.52	0.53	0.52	0.52	0.89	0.83	0.87	0.86
	consistent t				inconsistent t
Model	Precision	Recall	Acc	F1	Precision	Recall	Acc	F1
RESCAL	0.64	0.45	0.60	0.53	0.95	0.86	0.91	0.90
TransE	0.58	0.54	0.57	0.56	0.90	0.82	0.86	0.86
DistMult	0.59	0.55	0.58	0.57	0.95	0.89	0.92	0.92
ComplEx	0.60	0.56	0.59	0.58	0.95	0.90	0.93	0.92
ConvE	0.62	0.41	0.58	0.49	0.95	0.83	0.89	0.88
RotatE	0.60	0.47	0.58	0.53	0.87	0.78	0.83	0.82

The experiments on triple classification were done using the LibKGE framework. (Broscheit et al.,2020)

Scripts

Data Preparation Scripts

parse_triples.sh script is used for URI simplification.
FBDataDump.sh is a script that runs parse_triples.sh and creates different MySQL tables from Freebase data dump. For example, tables for domains, types, properties, and entities. Command to run FBDataDump.sh:

./FBDataDump.sh mysql_username mysql_password

After running FBDataDump.sh, you may want to run one of the four scripts provided for each variant. All these four scripts detach the subject matter triples from the metadata and administrative triples. In addition, all these scripts create a type system for the final dataset. Command to run FBx.sh, where x ∈ {1,2,3,4}:

./FBx.sh mysql_username mysql_password
If you need to remove all the reverse triples as well as all the CVT nodes, you can run FB1.sh.
To keep the reverse triples but remove the CVT nodes, you can run FB2.sh.
To keep the CVT nodes but to remove the reverse triples, you can run FB3.sh.
To keep both CVT nodes and reverse triples, you can run script FB4.sh.

Experiments Scripts

We did experiments on the four variants of Freebase as well as FB15K and FB-15K-237 using link prediction models like TransE, DistMult, ComplEx, RotatE, etc. The scripts to run the experiments are at ExperimentsScripts/ ending with .sh. An example of running the DistMult model on FB1 is as below.

dglke_train --model_name DistMult --dataset Freebase --data_path ./data --format udd_hrt \ --data_files entity2id.txt relation2id.txt train.txt valid.txt test.txt --batch_size 1024 --neg_sample_size 256 --hidden_dim 400 --gamma 143.0 \ --lr 0.08 --batch_size_eval 1000 --test -adv --mix_cpu_gpu --num_proc 8 --gpu 0 1 --max_step 300000 --neg_sample_size_eval 1000 \ --eval_interval 100000 --log_interval 1000 --async_update --rel_part --force_sync_interval 10000 --num_thread 4 --no_save_emb --delimiter ,

Related Work

Please feel free to check out another paper of ours related to this topic: Realistic re-evaluation of knowledge graph completion methods: An experimental study

License

The dataset and code are made available under the CC0 1.0 Universal.

Note: Freebase Data Dumps is provided free of charge for any purpose. It is distributed under the Creative Commons Attribution (aka CC-BY) and the usage is subject to the Terms of Service.

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
DataPreparationScripts		DataPreparationScripts
Datasets		Datasets
ExperimentsScripts		ExperimentsScripts
Appendix.pdf		Appendix.pdf
LICENSE		LICENSE
README.md		README.md
papers.xlsx		papers.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataPreparationScripts

DataPreparationScripts

Datasets

Datasets

ExperimentsScripts

ExperimentsScripts

Appendix.pdf

Appendix.pdf

LICENSE

LICENSE

README.md

README.md

papers.xlsx

papers.xlsx

Repository files navigation

Comprehensive Analysis of Freebase and Dataset Creation for Robust Evaluation of Knowledge Graph Link Prediction Models

Overview

Dataset

Dataset Statistics

Dataset Details

Experiments & Results

Scripts

Data Preparation Scripts

Experiments Scripts

Related Work

License

About

Releases

Packages

Contributors 4

Languages

License

idirlab/freebases

Folders and files

Latest commit

History

Repository files navigation

Comprehensive Analysis of Freebase and Dataset Creation for Robust Evaluation of Knowledge Graph Link Prediction Models

Overview

Dataset

Dataset Statistics

Dataset Details

Experiments & Results

Scripts

Data Preparation Scripts

Experiments Scripts

Related Work

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages