Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gravsearch): Optimise Gravsearch queries using topological sort (DSP-1327) #1813

Merged
merged 42 commits into from Mar 2, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
7f4aa5a
feat(gravsearch): Start implementation of topological sort.
Feb 4, 2021
a431f02
Merge branch 'main' into wip/DSP-1327-gravsearch
Feb 5, 2021
84938b9
feat(gravsearch) create a graph from statement patterns
SepidehAlassi Feb 8, 2021
b9da0ac
Merge branch 'main' into wip/DSP-1327-gravsearch
SepidehAlassi Feb 8, 2021
c01c796
fix (grvsearch): immutable graph
SepidehAlassi Feb 8, 2021
2d2afd7
fix (gravsearch) is not cyclic, sort the graph
SepidehAlassi Feb 8, 2021
41c8be2
fix (gravsearch): use input query for sorting
SepidehAlassi Feb 8, 2021
e514050
feat(gravsearch) use directed hyper edge
SepidehAlassi Feb 8, 2021
2778e57
feat(gravearch) change to DiHyperedge
SepidehAlassi Feb 9, 2021
8aaccdb
Merge branch 'main' into wip/DSP-1327-gravsearch
SepidehAlassi Feb 9, 2021
41147b5
feat(gravsearch): sort statements
SepidehAlassi Feb 9, 2021
106f335
fix (gravsearch): correctly preserve the order of statements as indic…
SepidehAlassi Feb 9, 2021
6d7168a
test (gravsearch) test the recursive function for sorting statements …
SepidehAlassi Feb 10, 2021
68a3bbd
fix the failing test
SepidehAlassi Feb 10, 2021
ce490e1
feat(gravsearch): break cycles in graph
SepidehAlassi Feb 10, 2021
29aba9d
refactor (gravsearch) clean up
SepidehAlassi Feb 11, 2021
d0290a0
move the topological sort to prequery generator
SepidehAlassi Feb 15, 2021
4eca3b5
style(gravsearch): Clean up a few things.
Feb 15, 2021
cfbc750
test(gravsearch): Improve test.
Feb 15, 2021
cde90b6
feat(gravsearch): Add utility for finding all topological orders of a…
Feb 16, 2021
f092bf3
feat(gravsearch): Fix topological sort bugs, add tests.
Feb 16, 2021
1a07c99
Merge branch 'main' into wip/DSP-1327-gravsearch
Feb 17, 2021
431bd75
test(gravsearch): Fix test.
Feb 17, 2021
bb5e48b
feat(gravsearch): Prefer topological orders that don't put rdf:type s…
Feb 17, 2021
9497151
Merge branch 'main' into wip/DSP-1327-gravsearch
Feb 17, 2021
f33a90d
fix(gravsearch): Correctly handle standoff classes in optimisations.
Feb 18, 2021
5e96dba
Merge branch 'main' into wip/DSP-1327-gravsearch
Feb 18, 2021
1339b0c
test(gravsearch): Update toplogical reordering tests.
Feb 18, 2021
47e74d4
test(gravsearch): Fix test.
Feb 19, 2021
2bf4b0a
feat(gravsearch): Add feature toggle for topological sort optimisation.
Feb 19, 2021
018dd47
test(gravsearch): Clean up test.
Feb 19, 2021
4eb807e
style(gravsearch): Use import wildcard.
Feb 19, 2021
35587e5
style(test): Add copyright.
Feb 23, 2021
420f873
style(gravsearch): Add comment.
Feb 23, 2021
2f84b69
Merge branch 'main' into wip/DSP-1327-gravsearch
Feb 25, 2021
14f0648
Merge branch 'main' into wip/DSP-1327-gravsearch
Feb 25, 2021
75e9023
feat (gravsearch) get all permutations of topological order according…
SepidehAlassi Mar 1, 2021
94f577b
Merge branch 'wip/DSP-1327-gravsearch' of https://github.com/dasch-sw…
SepidehAlassi Mar 1, 2021
f3d8de6
fix(gravsearch) fix the failing test
SepidehAlassi Mar 1, 2021
885bb33
style(TopologicalSortUtil): Improve style a little bit.
Mar 1, 2021
f9fb4aa
doc (gravsearch) documentation about optimization of gravsearches wit…
SepidehAlassi Mar 1, 2021
b2272c6
style(docs): Improve style a bit.
Mar 2, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
58 changes: 58 additions & 0 deletions docs/03-apis/api-v2/query-language.md
Expand Up @@ -1211,3 +1211,61 @@ CONSTRUCT {
}
ORDER BY (?int)
```

## Query Optimization by Dependency

The query performance of triplestores, such as Fuseki, is highly dependent on the order of query
patterns. To improve performance, Gravsearch automatically reorders the
statement patterns in the WHERE clause according to their dependencies on each other, to minimise
the number of possible matches for each pattern.
This optimization can be controlled using `gravsearch-dependency-optimisation`
[feature toggle](../feature-toggles.md), which is turned on by default.

Consider the following Gravsearch query:

```sparql
PREFIX beol: <http://0.0.0.0:3333/ontology/0801/beol/v2#>
PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>

CONSTRUCT {
?letter knora-api:isMainResource true .
?letter ?linkingProp1 ?person1 .
?letter ?linkingProp2 ?person2 .
?letter beol:creationDate ?date .
} WHERE {
?letter beol:creationDate ?date .

?letter ?linkingProp1 ?person1 .
FILTER(?linkingProp1 = beol:hasAuthor || ?linkingProp1 = beol:hasRecipient )

?letter ?linkingProp2 ?person2 .
FILTER(?linkingProp2 = beol:hasAuthor || ?linkingProp2 = beol:hasRecipient )

?person1 beol:hasIAFIdentifier ?gnd1 .
?gnd1 knora-api:valueAsString "(DE-588)118531379" .

?person2 beol:hasIAFIdentifier ?gnd2 .
?gnd2 knora-api:valueAsString "(DE-588)118696149" .
} ORDER BY ?date
```

Gravsearch optimises the performance of this query by moving these statements
to the top of the WHERE clause:

```
?gnd1 knora-api:valueAsString "(DE-588)118531379" .
?gnd2 knora-api:valueAsString "(DE-588)118696149" .
```

The rest of the WHERE clause then reads:

```
?person1 beol:hasIAFIdentifier ?gnd1 .
?person2 beol:hasIAFIdentifier ?gnd2 .
?letter ?linkingProp1 ?person1 .
FILTER(?linkingProp1 = beol:hasAuthor || ?linkingProp1 = beol:hasRecipient )

?letter ?linkingProp2 ?person2 .
FILTER(?linkingProp2 = beol:hasAuthor || ?linkingProp2 = beol:hasRecipient )
?letter beol:creationDate ?date .
```
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
164 changes: 163 additions & 1 deletion docs/05-internals/design/api-v2/gravsearch.md
Expand Up @@ -332,4 +332,166 @@ replaces `knora-api:standoffTagHasStartAncestor` with `knora-base:standoffTagHas

The triplestore-specific transformers in `SparqlTransformer.scala` can run optimisations on the generated SPARQL, in
the method `optimiseQueryPatterns` inherited from `WhereTransformer`. For example, `moveLuceneToBeginning` moves
Lucene queries to the beginning of the block in which they occur.
Lucene queries to the beginning of the block in which they occur.

## Query Optimization by Topological Sorting of Statements

GraphDB seems to have inherent algorithms to optimize the query time, however query performance of Fuseki highly depends
on the order of the query statements. For example, a query such as the one below:

```sparql
PREFIX beol: <http://0.0.0.0:3333/ontology/0801/beol/v2#>
PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>

CONSTRUCT {
?letter knora-api:isMainResource true .
?letter ?linkingProp1 ?person1 .
?letter ?linkingProp2 ?person2 .
?letter beol:creationDate ?date .
} WHERE {
?letter beol:creationDate ?date .

?letter ?linkingProp1 ?person1 .
FILTER(?linkingProp1 = beol:hasAuthor || ?linkingProp1 = beol:hasRecipient )

?letter ?linkingProp2 ?person2 .
FILTER(?linkingProp2 = beol:hasAuthor || ?linkingProp2 = beol:hasRecipient )

?person1 beol:hasIAFIdentifier ?gnd1 .
?gnd1 knora-api:valueAsString "(DE-588)118531379" .

?person2 beol:hasIAFIdentifier ?gnd2 .
?gnd2 knora-api:valueAsString "(DE-588)118696149" .
} ORDER BY ?date
```

takes a very long time with Fuseki. The performance of this query can be improved
by moving up the statements with literal objects that are not dependent on any other statement:

```
?gnd1 knora-api:valueAsString "(DE-588)118531379" .
?gnd2 knora-api:valueAsString "(DE-588)118696149" .
```

The rest of the query then reads:

```
?person1 beol:hasIAFIdentifier ?gnd1 .
?person2 beol:hasIAFIdentifier ?gnd2 .
?letter ?linkingProp1 ?person1 .
FILTER(?linkingProp1 = beol:hasAuthor || ?linkingProp1 = beol:hasRecipient )

?letter ?linkingProp2 ?person2 .
FILTER(?linkingProp2 = beol:hasAuthor || ?linkingProp2 = beol:hasRecipient )
?letter beol:creationDate ?date .
```

Since we cannot expect clients to know about performance of triplestores in order to write efficient queries, we have
implemented an optimization method to automatically rearrange the statements of the given queries.
Upon receiving the Gravsearch query, the algorithm converts the query to a graph. For each statement pattern,
the subject of the statement is the origin node, the predicate is a directed edge, and the object
is the target node. For the query above, this conversion would result in the following graph:

![query_graph](figures/query_graph.png)

The [Graph for Scala](http://www.scala-graph.org/) library is used to construct the graph and sort it using [Kahn's
topological sorting algorithm](https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm).

The algorithm returns the nodes of the graph ordered in several layers, where the
root element `?letter` is in layer 0, `[?date, ?person1, ?person2]` are in layer 1, `[?gnd1, ?gnd2]` in layer 2, and the
leaf nodes `[(DE-588)118531379, (DE-588)118696149]` are given in the last layer (i.e. layer 3).
According to Kahn's algorithm, there are multiple valid permutations of the topological order. The graph in the example
above has 24 valid permutations of topological order. Here are two of them (nodes are ordered from left to right with the highest
order to the lowest):

- `(?letter, ?date, ?person2, ?person1, ?gnd2, ?gnd1, (DE-588)118696149, (DE-588)118531379)`
- `(?letter, ?date, ?person1, ?person2, ?gnd1, ?gnd2, (DE-588)118531379, (DE-588)118696149)`.

From all valid topological orders, one is chosen based on certain criteria; for example, the leaf should node should not
belong to a statement that has predicate `rdf:type`, since that could match all resources of the specified type.
Once the best order is chosen, it is used to re-arrange the query
statements. Starting from the last leaf node, i.e.
`(DE-588)118696149`, the method finds the statement pattern which has this node as its object, and brings this statement
to the top of the query. This rearrangement continues so that the statements with the fewest dependencies on other
statements are all brought to the top of the query. The resulting query is as follows:

```sparql
PREFIX beol: <http://0.0.0.0:3333/ontology/0801/beol/v2#>
PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>

CONSTRUCT {
?letter knora-api:isMainResource true .
?letter ?linkingProp1 ?person1 .
?letter ?linkingProp2 ?person2 .
?letter beol:creationDate ?date .
} WHERE {
?gnd2 knora-api:valueAsString "(DE-588)118696149" .
?gnd1 knora-api:valueAsString "(DE-588)118531379" .
?person2 beol:hasIAFIdentifier ?gnd2 .
?person1 beol:hasIAFIdentifier ?gnd1 .
?letter ?linkingProp2 ?person2 .
?letter ?linkingProp1 ?person1 .
?letter beol:creationDate ?date .
FILTER(?linkingProp1 = beol:hasAuthor || ?linkingProp1 = beol:hasRecipient )
FILTER(?linkingProp2 = beol:hasAuthor || ?linkingProp2 = beol:hasRecipient )
} ORDER BY ?date
```

Note that position of the FILTER statements does not play a significant role in the optimization.

If a Gravsearch query contains statements in `UNION`, `OPTIONAL`, `MINUS`, or
`FILTER NOT EXISTS`, they are reordered
by defining a graph per block. For example, consider the following query with `UNION`:

```sparql
{
?thing anything:hasRichtext ?richtext .
FILTER knora-api:matchText(?richtext, "test")
?thing anything:hasInteger ?int .
?int knora-api:intValueAsInt 1 .
}
UNION
{
?thing anything:hasText ?text .
FILTER knora-api:matchText(?text, "test")
?thing anything:hasInteger ?int .
?int knora-api:intValueAsInt 3 .
}
```
This would result in one graph per block of the `UNION`. Each graph is then sorted, and the statements of its
block are rearranged according to the topological order of graph. This is the result:

```sparql
{
?int knora-api:intValueAsInt 1 .
?thing anything:hasRichtext ?richtext .
?thing anything:hasInteger ?int .
FILTER(knora-api:matchText(?richtext, "test"))
} UNION {
?int knora-api:intValueAsInt 3 .
?thing anything:hasText ?text .
?thing anything:hasInteger ?int .
FILTER(knora-api:matchText(?text, "test"))
}
```

### Cyclic Graphs

The topological sorting algorithm can only be used for DAGs (directed acyclic graphs). However,
a Gravsearch query can contains statements that result in a cyclic graph, e.g.:

```
PREFIX anything: <http://0.0.0.0:3333/ontology/0001/anything/simple/v2#>
PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>

CONSTRUCT {
?thing knora-api:isMainResource true .
} WHERE {
?thing anything:hasOtherThing ?thing1 .
?thing1 anything:hasOtherThing ?thing2 .
?thing2 anything:hasOtherThing ?thing .

```

In this case, the algorithm tries to break the cycles in order to sort the graph. If this is not possible,
the query statements are not reordered.
4 changes: 4 additions & 0 deletions third_party/dependencies.bzl
Expand Up @@ -137,6 +137,9 @@ def dependencies():
# Additional Selenium libraries besides the ones pulled in during init
# of io_bazel_rules_webtesting
"org.seleniumhq.selenium:selenium-support:3.141.59",

# Graph for Scala
"org.scala-graph:graph-core_2.12:1.13.1",
],
repositories = [
"https://repo.maven.apache.org/maven2",
Expand Down Expand Up @@ -187,6 +190,7 @@ BASE_TEST_DEPENDENCIES = [
"@maven//:org_scalatest_scalatest_shouldmatchers_2_12",
"@maven//:org_scalatest_scalatest_compatible",
"@maven//:org_scalactic_scalactic_2_12",
"@maven//:org_scala_graph_graph_core_2_12",
]

BASE_TEST_DEPENDENCIES_WITH_JSON = BASE_TEST_DEPENDENCIES + [
Expand Down
15 changes: 15 additions & 0 deletions webapi/src/main/resources/application.conf
Expand Up @@ -292,6 +292,21 @@ app {
"Benjamin Geer <benjamin.geer@dasch.swiss>"
]
}

gravsearch-dependency-optimisation {
description = "Optimise Gravsearch queries by reordering query patterns according to their dependencies."

available-versions = [ 1 ]
default-version = 1
enabled-by-default = yes
override-allowed = yes
expiration-date = "2021-12-01T00:00:00Z"

developer-emails = [
"Sepideh Alassi <sepideh.alassi@dasch.swiss>"
"Benjamin Geer <benjamin.geer@dasch.swiss>"
]
}
}

shacl {
Expand Down
Expand Up @@ -44,5 +44,6 @@ scala_library(
"@maven//:org_scala_lang_scala_reflect",
"@maven//:org_slf4j_slf4j_api",
"@maven//:org_springframework_security_spring_security_core",
"@maven//:org_scala_graph_graph_core_2_12",
],
)