Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
feat(gravsearch): Optimise Gravsearch queries using topological sort …
…(DSP-1327) (#1813)
  • Loading branch information
Benjamin Geer committed Mar 2, 2021
1 parent 7ce4b65 commit efbecee
Show file tree
Hide file tree
Showing 19 changed files with 2,815 additions and 473 deletions.
58 changes: 58 additions & 0 deletions docs/03-apis/api-v2/query-language.md
Expand Up @@ -1211,3 +1211,61 @@ CONSTRUCT {
}
ORDER BY (?int)
```

## Query Optimization by Dependency

The query performance of triplestores, such as Fuseki, is highly dependent on the order of query
patterns. To improve performance, Gravsearch automatically reorders the
statement patterns in the WHERE clause according to their dependencies on each other, to minimise
the number of possible matches for each pattern.
This optimization can be controlled using `gravsearch-dependency-optimisation`
[feature toggle](../feature-toggles.md), which is turned on by default.

Consider the following Gravsearch query:

```sparql
PREFIX beol: <http://0.0.0.0:3333/ontology/0801/beol/v2#>
PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
CONSTRUCT {
?letter knora-api:isMainResource true .
?letter ?linkingProp1 ?person1 .
?letter ?linkingProp2 ?person2 .
?letter beol:creationDate ?date .
} WHERE {
?letter beol:creationDate ?date .
?letter ?linkingProp1 ?person1 .
FILTER(?linkingProp1 = beol:hasAuthor || ?linkingProp1 = beol:hasRecipient )
?letter ?linkingProp2 ?person2 .
FILTER(?linkingProp2 = beol:hasAuthor || ?linkingProp2 = beol:hasRecipient )
?person1 beol:hasIAFIdentifier ?gnd1 .
?gnd1 knora-api:valueAsString "(DE-588)118531379" .
?person2 beol:hasIAFIdentifier ?gnd2 .
?gnd2 knora-api:valueAsString "(DE-588)118696149" .
} ORDER BY ?date
```

Gravsearch optimises the performance of this query by moving these statements
to the top of the WHERE clause:

```
?gnd1 knora-api:valueAsString "(DE-588)118531379" .
?gnd2 knora-api:valueAsString "(DE-588)118696149" .
```

The rest of the WHERE clause then reads:

```
?person1 beol:hasIAFIdentifier ?gnd1 .
?person2 beol:hasIAFIdentifier ?gnd2 .
?letter ?linkingProp1 ?person1 .
FILTER(?linkingProp1 = beol:hasAuthor || ?linkingProp1 = beol:hasRecipient )
?letter ?linkingProp2 ?person2 .
FILTER(?linkingProp2 = beol:hasAuthor || ?linkingProp2 = beol:hasRecipient )
?letter beol:creationDate ?date .
```
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
164 changes: 163 additions & 1 deletion docs/05-internals/design/api-v2/gravsearch.md
Expand Up @@ -332,4 +332,166 @@ replaces `knora-api:standoffTagHasStartAncestor` with `knora-base:standoffTagHas

The triplestore-specific transformers in `SparqlTransformer.scala` can run optimisations on the generated SPARQL, in
the method `optimiseQueryPatterns` inherited from `WhereTransformer`. For example, `moveLuceneToBeginning` moves
Lucene queries to the beginning of the block in which they occur.
Lucene queries to the beginning of the block in which they occur.

## Query Optimization by Topological Sorting of Statements

GraphDB seems to have inherent algorithms to optimize the query time, however query performance of Fuseki highly depends
on the order of the query statements. For example, a query such as the one below:

```sparql
PREFIX beol: <http://0.0.0.0:3333/ontology/0801/beol/v2#>
PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
CONSTRUCT {
?letter knora-api:isMainResource true .
?letter ?linkingProp1 ?person1 .
?letter ?linkingProp2 ?person2 .
?letter beol:creationDate ?date .
} WHERE {
?letter beol:creationDate ?date .
?letter ?linkingProp1 ?person1 .
FILTER(?linkingProp1 = beol:hasAuthor || ?linkingProp1 = beol:hasRecipient )
?letter ?linkingProp2 ?person2 .
FILTER(?linkingProp2 = beol:hasAuthor || ?linkingProp2 = beol:hasRecipient )
?person1 beol:hasIAFIdentifier ?gnd1 .
?gnd1 knora-api:valueAsString "(DE-588)118531379" .
?person2 beol:hasIAFIdentifier ?gnd2 .
?gnd2 knora-api:valueAsString "(DE-588)118696149" .
} ORDER BY ?date
```

takes a very long time with Fuseki. The performance of this query can be improved
by moving up the statements with literal objects that are not dependent on any other statement:

```
?gnd1 knora-api:valueAsString "(DE-588)118531379" .
?gnd2 knora-api:valueAsString "(DE-588)118696149" .
```

The rest of the query then reads:

```
?person1 beol:hasIAFIdentifier ?gnd1 .
?person2 beol:hasIAFIdentifier ?gnd2 .
?letter ?linkingProp1 ?person1 .
FILTER(?linkingProp1 = beol:hasAuthor || ?linkingProp1 = beol:hasRecipient )
?letter ?linkingProp2 ?person2 .
FILTER(?linkingProp2 = beol:hasAuthor || ?linkingProp2 = beol:hasRecipient )
?letter beol:creationDate ?date .
```

Since we cannot expect clients to know about performance of triplestores in order to write efficient queries, we have
implemented an optimization method to automatically rearrange the statements of the given queries.
Upon receiving the Gravsearch query, the algorithm converts the query to a graph. For each statement pattern,
the subject of the statement is the origin node, the predicate is a directed edge, and the object
is the target node. For the query above, this conversion would result in the following graph:

![query_graph](figures/query_graph.png)

The [Graph for Scala](http://www.scala-graph.org/) library is used to construct the graph and sort it using [Kahn's
topological sorting algorithm](https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm).

The algorithm returns the nodes of the graph ordered in several layers, where the
root element `?letter` is in layer 0, `[?date, ?person1, ?person2]` are in layer 1, `[?gnd1, ?gnd2]` in layer 2, and the
leaf nodes `[(DE-588)118531379, (DE-588)118696149]` are given in the last layer (i.e. layer 3).
According to Kahn's algorithm, there are multiple valid permutations of the topological order. The graph in the example
above has 24 valid permutations of topological order. Here are two of them (nodes are ordered from left to right with the highest
order to the lowest):

- `(?letter, ?date, ?person2, ?person1, ?gnd2, ?gnd1, (DE-588)118696149, (DE-588)118531379)`
- `(?letter, ?date, ?person1, ?person2, ?gnd1, ?gnd2, (DE-588)118531379, (DE-588)118696149)`.

From all valid topological orders, one is chosen based on certain criteria; for example, the leaf should node should not
belong to a statement that has predicate `rdf:type`, since that could match all resources of the specified type.
Once the best order is chosen, it is used to re-arrange the query
statements. Starting from the last leaf node, i.e.
`(DE-588)118696149`, the method finds the statement pattern which has this node as its object, and brings this statement
to the top of the query. This rearrangement continues so that the statements with the fewest dependencies on other
statements are all brought to the top of the query. The resulting query is as follows:

```sparql
PREFIX beol: <http://0.0.0.0:3333/ontology/0801/beol/v2#>
PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
CONSTRUCT {
?letter knora-api:isMainResource true .
?letter ?linkingProp1 ?person1 .
?letter ?linkingProp2 ?person2 .
?letter beol:creationDate ?date .
} WHERE {
?gnd2 knora-api:valueAsString "(DE-588)118696149" .
?gnd1 knora-api:valueAsString "(DE-588)118531379" .
?person2 beol:hasIAFIdentifier ?gnd2 .
?person1 beol:hasIAFIdentifier ?gnd1 .
?letter ?linkingProp2 ?person2 .
?letter ?linkingProp1 ?person1 .
?letter beol:creationDate ?date .
FILTER(?linkingProp1 = beol:hasAuthor || ?linkingProp1 = beol:hasRecipient )
FILTER(?linkingProp2 = beol:hasAuthor || ?linkingProp2 = beol:hasRecipient )
} ORDER BY ?date
```

Note that position of the FILTER statements does not play a significant role in the optimization.

If a Gravsearch query contains statements in `UNION`, `OPTIONAL`, `MINUS`, or
`FILTER NOT EXISTS`, they are reordered
by defining a graph per block. For example, consider the following query with `UNION`:

```sparql
{
?thing anything:hasRichtext ?richtext .
FILTER knora-api:matchText(?richtext, "test")
?thing anything:hasInteger ?int .
?int knora-api:intValueAsInt 1 .
}
UNION
{
?thing anything:hasText ?text .
FILTER knora-api:matchText(?text, "test")
?thing anything:hasInteger ?int .
?int knora-api:intValueAsInt 3 .
}
```
This would result in one graph per block of the `UNION`. Each graph is then sorted, and the statements of its
block are rearranged according to the topological order of graph. This is the result:

```sparql
{
?int knora-api:intValueAsInt 1 .
?thing anything:hasRichtext ?richtext .
?thing anything:hasInteger ?int .
FILTER(knora-api:matchText(?richtext, "test"))
} UNION {
?int knora-api:intValueAsInt 3 .
?thing anything:hasText ?text .
?thing anything:hasInteger ?int .
FILTER(knora-api:matchText(?text, "test"))
}
```

### Cyclic Graphs

The topological sorting algorithm can only be used for DAGs (directed acyclic graphs). However,
a Gravsearch query can contains statements that result in a cyclic graph, e.g.:

```
PREFIX anything: <http://0.0.0.0:3333/ontology/0001/anything/simple/v2#>
PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>
CONSTRUCT {
?thing knora-api:isMainResource true .
} WHERE {
?thing anything:hasOtherThing ?thing1 .
?thing1 anything:hasOtherThing ?thing2 .
?thing2 anything:hasOtherThing ?thing .
```

In this case, the algorithm tries to break the cycles in order to sort the graph. If this is not possible,
the query statements are not reordered.
4 changes: 4 additions & 0 deletions third_party/dependencies.bzl
Expand Up @@ -137,6 +137,9 @@ def dependencies():
# Additional Selenium libraries besides the ones pulled in during init
# of io_bazel_rules_webtesting
"org.seleniumhq.selenium:selenium-support:3.141.59",

# Graph for Scala
"org.scala-graph:graph-core_2.12:1.13.1",
],
repositories = [
"https://repo.maven.apache.org/maven2",
Expand Down Expand Up @@ -187,6 +190,7 @@ BASE_TEST_DEPENDENCIES = [
"@maven//:org_scalatest_scalatest_shouldmatchers_2_12",
"@maven//:org_scalatest_scalatest_compatible",
"@maven//:org_scalactic_scalactic_2_12",
"@maven//:org_scala_graph_graph_core_2_12",
]

BASE_TEST_DEPENDENCIES_WITH_JSON = BASE_TEST_DEPENDENCIES + [
Expand Down
15 changes: 15 additions & 0 deletions webapi/src/main/resources/application.conf
Expand Up @@ -292,6 +292,21 @@ app {
"Benjamin Geer <benjamin.geer@dasch.swiss>"
]
}

gravsearch-dependency-optimisation {
description = "Optimise Gravsearch queries by reordering query patterns according to their dependencies."

available-versions = [ 1 ]
default-version = 1
enabled-by-default = yes
override-allowed = yes
expiration-date = "2021-12-01T00:00:00Z"

developer-emails = [
"Sepideh Alassi <sepideh.alassi@dasch.swiss>"
"Benjamin Geer <benjamin.geer@dasch.swiss>"
]
}
}

shacl {
Expand Down
Expand Up @@ -44,5 +44,6 @@ scala_library(
"@maven//:org_scala_lang_scala_reflect",
"@maven//:org_slf4j_slf4j_api",
"@maven//:org_springframework_security_spring_security_core",
"@maven//:org_scala_graph_graph_core_2_12",
],
)

0 comments on commit efbecee

Please sign in to comment.