Managing FAIR Knowledge Graphs as Polyglot Data End Points: A Benchmark based on the rdf2pg Framework and Plant Biology Data

This repository contains code to benchmark three different graph databases and graph query languages, against plant biology datasets, which are conceptually aligned (based on the same data model) in the different database/language flavours.

This is work by the KnetMiner team and Carlos Bobed.

The alignment is produced by means of the rdf2pg framework, and this work contributes to assess the benefits of managing data in multiple data languages and formats, by means of our rdf2pg tools.

This work is an extension of previous work by the KnetMiner team, which we presented at SWAT4LS 2018 (old presentation here).

Test settings

We have tested three combinations of graph database, graph query language, data formats:

SPARQL on the Virtuoso triple store, dealing with RDF data (and the corresponding model).
Cypher on the Neo4j graph database, with data directly imported into the database from our rdf2neo tool.
Gremlin on ArcadeDB, with data imported from files in graphML format.

Details on the test settings used are in the dataset loading results report.

Test datasets

For each of the graph databases mentioned above, we have tested the loading and the query performance of three datasets:

Biopax: a small dataset, mostly containing data about the Arabidopsis model organisms, including pathways from AraCyc and gene annotations from Gene Ontology.
Arabidopsis: a medium-size dataset, containing more data about Arabidopsis, including AraCyc, Gene Ontology, gene annotations from ENSEMBL Plants and TAIR, protein annotations from UniProt, scientific publications from PubMed.
Poaceae: a large dataset with integrated data about different cereals (wheat, rice and barley), obtained from a variety of sources, including the ones mentioned above, plus genome-wide study data from AraGWAS and more. Partial access to this dataset is available via KnetMiner programmatic data access endpoints.

Data schematisation

The figure below shows the main types contained in each dataset:

These model was encoded based on BioKNO, an application ontology, defined within the KnetMiner platform, to represent the data we deal with in the KnetMiner platform. This models common plant biology entities, some specific pattern used by KnetMiner applications and mappings to existing biology ontology and life science standards.

Test approach

We have done two types of tests:

Data Loading tests

Loading tests, where we tested the time taken to populate each dataset with each of the tested datasets. See the linked report for details

Querying tests

After loading each dataset, we performed querying tests, where, for each dataset, we tested all of the chosen databases and query languages, each time timing the same set of queries. More precisely, for each of the tested query languages, we wrote conceptually equivalent queries.

While "conceptually equivalent" is difficult to define precisely, informally, it means the best effort to search for data that have the same semantics and equivalent representations in the different technologies and formats being tested. It also means writing queries that, across different technologies, present similar levels of complexity and search engine challenges.

For example, where it is easy for Neo4 to return a node property or an empty value (because they are attached to the nodes), we have translated this as OPTIONAL matches in SPARQL (since looking for a resource property is a triple pattern like any other).

Test results

The (Jupyter-based) reports linked above has more test details and detailed results linked above.

TODO: Updates about ArcadeDB

We have started testing ArcadeDB with its SQL dialect, using the same datasets and the same queries. This is a preliminary result, work to be continued.

Query List

Like the data, the queries listed below are based on the already-mentioned BioKNO ontology. We have split the benchmark queries into categories that take into account both the query semantics and the kind of challenge it puts on the query engines.

Regarding the semantic motif queries, these produce patterns that occur often in KnetMiner, when we want to associate genes to relevant other entities (such as encoded proteins, biological processes, publications about genes or processes). In practice, a semantic motif query is a 'chain' pattern, it tries to follow a linear path from a gene to another entity, through a known chain of relations (eg, Gene -> encodes -> Protein -> participates -> Process -> mentioend -> Publication). Details in the KnetMiner Wiki and in the KnetMiner paper

WARNING: do not edit what follows! It is automatically generated via this code.

Category: counts

Common counts of elements like number of nodes, number of relations, etc.

cnt: Counts instances, SPARQL, Cypher, Gremlin
cntType: Instances of a given type, SPARQL, Cypher, Gremlin
cntRel: Count relations, SPARQL, Cypher, Gremlin
cntRelType: Count relations of a given type, SPARQL, Cypher, Gremlin

Category: selects

Queries that selects elements, including simple joins.

sel: Select entity and properties, SPARQL, Cypher, Gremlin
join: Simple Join, SPARQL, Cypher, Gremlin
joinRel: Join literal properties of reified relations, SPARQL, Cypher, Gremlin
joinFilter: Simple join + attribute filter, SPARQL, Cypher, Gremlin
joinRe: Simple join + regex search, SPARQL, Cypher, Gremlin
joinReif: Join through relation property, SPARQL, Cypher, Gremlin

Category: unions

Queries that perform graph pattern and subquery unions.

2union: 2 unions, no nesting, SPARQL, Cypher, Gremlin
2union1Nest: 2 unions, 1 nesting, SPARQL, Cypher, Gremlin
2union1Nest+: 2 unions, 1 nesting (with Cypher CALL), SPARQL, Cypher, Gremlin
pway: Complex union of paths over pathways, SPARQL, Cypher, Gremlin
exist: Not exists, SPARQL, Cypher, Gremlin
existAg: Not exists + aggregation, SPARQL, Cypher, Gremlin

Category: aggregation

Queries that perform data grouping and aggregations.

grp: Group by, SPARQL, Cypher, Gremlin
grpAg: Group by + 2 aggregation functions, SPARQL, Cypher, Gremlin
mulGrpAg: Multiple subqueries having aggregations , SPARQL, Cypher, Gremlin
nestAg: Nested and outer aggregations (see Q6 from the Berlin benchmark), SPARQL, Cypher, Gremlin

Category: paths

Queries that select and traverse paths.

varPathC: Variable path query (fixed len), SPARQL, Cypher, Gremlin
varPath: Variable path query (unbound len and restricted on top), SPARQL, Cypher, Gremlin
shrtSmf: Short Semantic Motif, SPARQL, Cypher, Gremlin
medSmf: Medium length Semantic Motif, SPARQL, Cypher, Gremlin
lngSmf: Long and Complex Semantic Motif, SPARQL, Cypher, Gremlin

Name		Name	Last commit message	Last commit date
Latest commit History 244 Commits
arcadedb		arcadedb
janus		janus
neo4j		neo4j
rdf2pg/ondex_config		rdf2pg/ondex_config
results		results
src		src
utils		utils
virtuoso		virtuoso
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Rothamsted/graphdb-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Managing FAIR Knowledge Graphs as Polyglot Data End Points: A Benchmark based on the rdf2pg Framework and Plant Biology Data

Test settings

Test datasets

Data schematisation

Test approach

Data Loading tests

Querying tests

Test results

TODO: Updates about ArcadeDB

Query List

Category: counts

Category: selects

Category: unions

Category: aggregation

Category: paths

About

Topics

Resources

Stars

Watchers

Forks

Languages