Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gravsearch): improve gravsearch performance by using unions in prequery (DEV-492) #2045

Merged
merged 36 commits into from May 10, 2022
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
01b0409
Update SparqlTransformer.scala
BalduinLandolt Apr 16, 2022
ebfa4a6
feat: add superPropertyOf map to ontology cache
BalduinLandolt Apr 16, 2022
b0fcf11
refactor: reduce logging noise
BalduinLandolt Apr 16, 2022
f5e73c0
chore: add clean-sbt target to makefile
BalduinLandolt Apr 16, 2022
71a2965
feat: replace property path query statements with unions for subPrope…
BalduinLandolt Apr 16, 2022
9c71166
feat: use unions for subclasses
BalduinLandolt Apr 16, 2022
bc273b1
refactor: tidy up some old mess
BalduinLandolt Apr 16, 2022
c2a6c7b
refactor: add more logging
BalduinLandolt Apr 16, 2022
307f0dc
add limiting param to transformer to reduce inference
BalduinLandolt Apr 16, 2022
76f93bd
ignore failing test for now
BalduinLandolt Apr 16, 2022
e433422
feat: start working on reducing union options on basis of the query
BalduinLandolt Apr 16, 2022
efc86e1
tidy up
BalduinLandolt Apr 16, 2022
d01bd0f
minor improvements
BalduinLandolt Apr 16, 2022
ac55e77
get tests to pass
BalduinLandolt Apr 16, 2022
d465807
feat: limit subclasses
BalduinLandolt Apr 19, 2022
5b2569d
feat: include optimization in count query
BalduinLandolt Apr 19, 2022
5b59d03
test: minimal test for compound objects with gravsearch
BalduinLandolt Apr 19, 2022
2ac09e6
test: test simulated inference with union patterns
BalduinLandolt Apr 19, 2022
b3605cc
refactor start tidying up
BalduinLandolt Apr 19, 2022
cb55379
refactor: more tidying up
BalduinLandolt Apr 19, 2022
5256319
refactor: tidy up
BalduinLandolt Apr 19, 2022
b7dbcef
docs: start documenting the changes
BalduinLandolt Apr 19, 2022
1602e06
refactor: remove unused code
BalduinLandolt Apr 19, 2022
471fca9
docs: update documentation
BalduinLandolt Apr 21, 2022
3171675
refactor: remove some code smells
BalduinLandolt Apr 21, 2022
3807b45
Merge branch 'main' into wip/DEV-492-gravsearch-performance-attempt-2
BalduinLandolt Apr 26, 2022
633dc40
Merge branch 'main' into wip/DEV-492-gravsearch-performance-attempt-2
BalduinLandolt Apr 28, 2022
d8dadba
Merge branch 'main' into wip/DEV-492-gravsearch-performance-attempt-2
BalduinLandolt May 5, 2022
91e966c
refactor: tidy up, improve variable naming and add documentation
BalduinLandolt May 5, 2022
a75993e
refactor: format sparqlTransformarSpec.scala
BalduinLandolt May 5, 2022
183275d
Merge branch 'main' into wip/DEV-492-gravsearch-performance-attempt-2
BalduinLandolt May 9, 2022
e133edb
Merge branch 'main' into wip/DEV-492-gravsearch-performance-attempt-2
BalduinLandolt May 10, 2022
3407500
Apply suggestions from code review
BalduinLandolt May 10, 2022
99b4b48
tidy up
BalduinLandolt May 10, 2022
faa64b3
wrap up according to review
BalduinLandolt May 10, 2022
e5f0626
Merge branch 'main' into wip/DEV-492-gravsearch-performance-attempt-2
BalduinLandolt May 10, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
7 changes: 7 additions & 0 deletions Makefile
Expand Up @@ -280,6 +280,13 @@ clean-local-tmp:
@rm -rf .tmp
@mkdir .tmp

.PHONY: clean-metals
clean-metals: ## clean SBT and Metals related stuff
@rm -rf .bloop
@rm -rf .bsp
@rm -rf .metals
@rm -rf target

clean: docs-clean clean-local-tmp clean-docker clean-sipi-tmp ## clean build artifacts
@rm -rf .env

Expand Down
2 changes: 1 addition & 1 deletion docs/01-introduction/what-is-knora.md
Expand Up @@ -74,7 +74,7 @@ and can regenerate the original XML document at any time.

DSP-API provides a search language, [Gravsearch](../03-apis/api-v2/query-language.md),
that is designed to meet the needs of humanities researchers. Gravsearch supports DSP-API's
humanites-focused data structures, including calendar-independent dates and standoff markup, as well
humanities-focused data structures, including calendar-independent dates and standoff markup, as well
as fast full-text searches. This allows searches to combine text-related criteria with any other
criteria. For example, you could search for a text that contains a certain word
and also mentions a person who lived in the same city as another person who is the
Expand Down
52 changes: 23 additions & 29 deletions docs/03-apis/api-v2/query-language.md
Expand Up @@ -13,15 +13,15 @@ criteria) while avoiding their drawbacks in terms of performance and
security (see [The Enduring Myth of the SPARQL
Endpoint](https://daverog.wordpress.com/2013/06/04/the-enduring-myth-of-the-sparql-endpoint/)).
It also has the benefit of enabling clients to work with a simpler RDF
data model than the one Knora actually uses to store data in the
data model than the one the API actually uses to store data in the
triplestore, and makes it possible to provide better error-checking.

Rather than being processed directly by the triplestore, a Gravsearch query
is interpreted by Knora, which enforces certain
is interpreted by the API, which enforces certain
restrictions on the query, and implements paging and permission
checking. The API server generates SPARQL based on the Gravsearch query
submitted, queries the triplestore, filters the results according to the
user's permissions, and returns each page of query results as a Knora
user's permissions, and returns each page of query results as an
API response. Thus, Gravsearch is a hybrid between a RESTful API and a
SPARQL endpoint.

Expand Down Expand Up @@ -80,14 +80,14 @@ If a gravsearch query times out, a `504 Gateway Timeout` will be returned.
A Gravsearch query can be written in either of the two
[DSP-API v2 schemas](introduction.md#api-schema). The simple schema
is easier to work with, and is sufficient if you don't need to query
anything below the level of a Knora value. If your query needs to refer to
anything below the level of a DSP-API value. If your query needs to refer to
standoff markup, you must use the complex schema. Each query must use a single
schema, with one exception (see [Date Comparisons](#date-comparisons)).

Gravsearch query results can be requested in the simple or complex schema;
see [API Schema](introduction.md#api-schema).

All examples hereafter run with Knora started locally as documented in the section [Getting Started with DSP-API](../../04-publishing-deployment/getting-started.md). If you access another Knora-Stack, you can check the IRI of the ontology you are targeting by requesting the [ontologies metadata](ontology-information.md#querying-ontology-metadata).
All examples hereafter run with the DSP stack started locally as documented in the section [Getting Started with DSP-API](../../04-publishing-deployment/getting-started.md). If you access another stack, you can check the IRI of the ontology you are targeting by requesting the [ontologies metadata](ontology-information.md#querying-ontology-metadata).

### Using the Simple Schema

Expand All @@ -100,8 +100,7 @@ PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>
PREFIX incunabula: <http://0.0.0.0:3333/ontology/0803/incunabula/simple/v2#>
```

In the simple schema, Knora values are represented as literals, which
can be used `FILTER` expressions
In the simple schema, DSP-API values are represented as literals, which can be used `FILTER` expressions
(see [Filtering on Values in the Simple Schema](#filtering-on-values-in-the-simple-schema)).

### Using the Complex Schema
Expand All @@ -115,7 +114,7 @@ PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
PREFIX incunabula: <http://0.0.0.0:3333/ontology/0803/incunabula/v2#>
```

In the complex schema, Knora values are represented as objects belonging
In the complex schema, DSP-API values are represented as objects belonging
to subclasses of `knora-api:Value`, e.g. `knora-api:TextValue`, and have
predicates of their own, which can be used in `FILTER` expressions
(see [Filtering on Values in the Complex Schema](#filtering-on-values-in-the-complex-schema)).
Expand Down Expand Up @@ -182,7 +181,7 @@ permission to see a matching dependent resource, the link value is hidden.
## Paging

Gravsearch results are returned in pages. The maximum number of main
resources per page is determined by Knora (and can be configured
resources per page is determined by the API (and can be configured
in `application.conf` via the setting `app/v2/resources-sequence/results-per-page`).
If some resources have been filtered out because the user does not have
permission to see them, a page could contain fewer results, or no results.
Expand All @@ -195,25 +194,20 @@ one at a time, until the response does not contain `knora-api:mayHaveMoreResults
## Inference

Gravsearch queries are understood to imply a subset of
[RDFS reasoning](https://www.w3.org/TR/rdf11-mt/). Depending on the
triplestore being used, this may be implemented using the triplestore's
own reasoner or by query expansion in Knora.
[RDFS reasoning](https://www.w3.org/TR/rdf11-mt/). This is done by the API by expanding the incoming query.

Specifically, if a statement pattern specifies a property, the pattern will
also match subproperties of that property, and if a statement specifies that
a subject has a particular `rdf:type`, the statement will also match subjects
belonging to subclasses of that type.

If you know that reasoning will not return any additional results for
your query, you can disable it by adding this line to the `WHERE` clause:
your query, you can disable it by adding this line to the `WHERE` clause, which may improve query performance:

```sparql
knora-api:GravsearchOptions knora-api:useInference false .
```

If Knora is implementing reasoning by query expansion, disabling it can
improve the performance of some queries.

## Gravsearch Syntax

Every Gravsearch query is a valid SPARQL 1.1
Expand Down Expand Up @@ -244,8 +238,8 @@ clauses use the following patterns, with the specified restrictions:
unordered set of triples. However, a Gravsearch query returns an
ordered list of resources, which can be ordered by the values of
specified properties. If the query is written in the complex schema,
items below the level of Knora values may not be used in `ORDER BY`.
- `BIND`: The value assigned must be a Knora resource IRI.
items below the level of DSP-API values may not be used in `ORDER BY`.
- `BIND`: The value assigned must be a DSP resource IRI.

### Resources, Properties, and Values

Expand All @@ -269,7 +263,7 @@ must be represented as a query variable.

#### Filtering on Values in the Simple Schema

In the simple schema, a variable representing a Knora value can be used
In the simple schema, a variable representing a DSP-API value can be used
directly in a `FILTER` expression. For example:

```
Expand All @@ -279,7 +273,7 @@ FILTER(?title = "Zeitglöcklein des Lebens und Leidens Christi")

Here the type of `?title` is `xsd:string`.

The following Knora value types can be compared with literals in `FILTER`
The following value types can be compared with literals in `FILTER`
expressions in the simple schema:

- Text values (`xsd:string`)
Expand All @@ -295,7 +289,7 @@ performing an exact match on a list node's label. Labels can be given in differe
If one of the given list node labels matches, it is considered a match.
Note that in the simple schema, uniqueness is not guaranteed (as opposed to the complex schema).

A Knora value may not be represented as the literal object of a predicate;
A DSP-API value may not be represented as the literal object of a predicate;
for example, this is not allowed:

```
Expand All @@ -304,9 +298,9 @@ for example, this is not allowed:

#### Filtering on Values in the Complex Schema

In the complex schema, variables representing Knora values are not literals.
In the complex schema, variables representing DSP-API values are not literals.
You must add something to the query (generally a statement) to get a literal
from a Knora value. For example:
from a DSP-API value. For example:

```
?book incunabula:title ?title .
Expand Down Expand Up @@ -479,7 +473,7 @@ within a single paragraph.
If you are only interested in specifying that a resource has some text
value containing a standoff link to another resource, the most efficient
way is to use the property `knora-api:hasStandoffLinkTo`, whose subjects and objects
are resources. This property is automatically maintained by Knora. For example:
are resources. This property is automatically maintained by the API. For example:

```
PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
Expand Down Expand Up @@ -623,7 +617,7 @@ CONSTRUCT {

### Filtering on `rdfs:label`

The `rdfs:label` of a resource is not a Knora value, but you can still search for it.
The `rdfs:label` of a resource is not a DSP-API value, but you can still search for it.
This can be done in the same ways in the simple or complex schema:

Using a string literal object:
Expand Down Expand Up @@ -708,8 +702,8 @@ clause but not in the `CONSTRUCT` clause, the matching resources or values
will not be included in the results.

If the query is written in the complex schema, all variables in the `CONSTRUCT`
clause must refer to Knora resources, Knora values, or properties. Data below
the level of Knora values may not be mentioned in the `CONSTRUCT` clause.
clause must refer to DSP-API resources, DSP-API values, or properties. Data below
the level of values may not be mentioned in the `CONSTRUCT` clause.

Predicates from the `rdf`, `rdfs`, and `owl` ontologies may not be used
in the `CONSTRUCT` clause. The `rdfs:label` of each matching resource is always
Expand Down Expand Up @@ -921,7 +915,7 @@ adding statements with the predicate `rdf:type`. The subject must be a resource
and the object must either be `knora-api:Resource` (if the subject is a resource)
or the subject's specific type (if it is a value).

For example, consider this query that uses a non-Knora property:
For example, consider this query that uses a non-DSP property:

```
PREFIX incunabula: <http://0.0.0.0:3333/ontology/0803/incunabula/simple/v2#>
Expand Down Expand Up @@ -992,7 +986,7 @@ CONSTRUCT {
Note that it only makes sense to use `dcterms:title` in the simple schema, because
its object is supposed to be a literal.

Here is another example, using a non-Knora class:
Here is another example, using a non-DSP class:

```
PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>
Expand Down
41 changes: 12 additions & 29 deletions docs/05-internals/design/api-v2/gravsearch.md
Expand Up @@ -128,7 +128,7 @@ pattern orders must be optimised by moving `LuceneQueryPatterns` to the beginnin
- `ConstructToConstructTransformer` (extends `WhereTransformer`): instructions how to turn a triplestore independent Construct query into a triplestore dependent Construct query (implementation of inference).

The traits listed above define methods that are implemented in the transformer classes and called by `QueryTraverser` to perform SPARQL to SPARQL conversions.
When iterating over the statements of the input query, the transformer class's transformation methods are called to perform the conversion.
When iterating over the statements of the input query, the transformer class' transformation methods are called to perform the conversion.

### Prequery

Expand All @@ -152,7 +152,7 @@ Next, the Gravsearch query's WHERE clause is transformed and the prequery (SELEC
The transformation of the Gravsearch query's WHERE clause relies on the implementation of the abstract class `AbstractPrequeryGenerator`.

`AbstractPrequeryGenerator` contains members whose state is changed during the iteration over the statements of the input query.
They can then by used to create the converted query.
They can then be used to create the converted query.

- `mainResourceVariable: Option[QueryVariable]`: SPARQL variable representing the main resource of the input query. Present in the prequery's SELECT clause.
- `dependentResourceVariables: mutable.Set[QueryVariable]`: a set of SPARQL variables representing dependent resources in the input query. Used in an aggregation function in the prequery's SELECT clause (see below).
Expand Down Expand Up @@ -295,23 +295,10 @@ on Gravsearch). This is implemented as follows:
When the non-triplestore-specific version of a SPARQL query is generated, statements that do not need
inference are marked with the virtual named graph `<http://www.knora.org/explicit>`.

When the triplestore-specific version of the query is generated:

- If the triplestore is GraphDB, `SparqlTransformer.transformKnoraExplicitToGraphDBExplicit` changes statements
with the virtual graph `<http://www.knora.org/explicit>` so that they are marked with the GraphDB-specific graph
`<http://www.ontotext.com/explicit>`, and leaves other statements unchanged.
`SparqlTransformer.transformKnoraExplicitToGraphDBExplicit` also adds the `valueHasString` statements which GraphDB needs
for text searches.

- If Knora is not using the triplestore's inference (e.g. with Fuseki),
`SparqlTransformer.transformStatementInWhereForNoInference` removes `<http://www.knora.org/explicit>`, and expands unmarked
statements using `rdfs:subClassOf*` and `rdfs:subPropertyOf*`.

Gravsearch also provides some virtual properties, which take advantage of forward-chaining inference
as an optimisation if the triplestore provides it. For example, the virtual property
`knora-api:standoffTagHasStartAncestor` is equivalent to `knora-base:standoffTagHasStartParent*`. If Knora is not using the triplestore's inference, `SparqlTransformer.transformStatementInWhereForNoInference`
When the triplestore-specific version of the query is generated, this could make use of a triplestore's inference. Currently, no triplestore-based inference is used, instead the API expands the prequery on basis of the available ontologies, to achieve the same results as if inference was used. For that reason, `SparqlTransformer.transformStatementInWhereForNoInference` removes `<http://www.knora.org/explicit>`, and expands unmarked statements using `UNION` statements for all subclasses and subproperties (equivalent to `rdfs:subClassOf*` and `rdfs:subPropertyOf*`). Similarly, `SparqlTransformer.transformStatementInWhereForNoInference`
BalduinLandolt marked this conversation as resolved.
Show resolved Hide resolved
replaces `knora-api:standoffTagHasStartAncestor` with `knora-base:standoffTagHasStartParent*`.


# Optimisation of generated SPARQL

The triplestore-specific transformers in `SparqlTransformer.scala` can run optimisations on the generated SPARQL, in
Expand All @@ -320,8 +307,7 @@ Lucene queries to the beginning of the block in which they occur.

## Query Optimization by Topological Sorting of Statements

GraphDB seems to have inherent algorithms to optimize the query time, however query performance of Fuseki highly depends
on the order of the query statements. For example, a query such as the one below:
In Jena Fuseki, the performance of a query highly depends on the order of the query statements. For example, a query such as the one below:

```sparql
PREFIX beol: <http://0.0.0.0:3333/ontology/0801/beol/v2#>
Expand Down Expand Up @@ -370,8 +356,7 @@ The rest of the query then reads:
?letter beol:creationDate ?date .
```

Since we cannot expect clients to know about performance of triplestores in order to write efficient queries, we have
implemented an optimization method to automatically rearrange the statements of the given queries.
Since users cannot be expected to know about performance of triplestores in order to write efficient queries, an optimization method to automatically rearrange the statements of the given queries has been implemented.
Upon receiving the Gravsearch query, the algorithm converts the query to a graph. For each statement pattern,
the subject of the statement is the origin node, the predicate is a directed edge, and the object
is the target node. For the query above, this conversion would result in the following graph:
Expand All @@ -384,17 +369,16 @@ topological sorting algorithm](https://en.wikipedia.org/wiki/Topological_sorting
The algorithm returns the nodes of the graph ordered in several layers, where the
root element `?letter` is in layer 0, `[?date, ?person1, ?person2]` are in layer 1, `[?gnd1, ?gnd2]` in layer 2, and the
leaf nodes `[(DE-588)118531379, (DE-588)118696149]` are given in the last layer (i.e. layer 3).
According to Kahn's algorithm, there are multiple valid permutations of the topological order. The graph in the example
above has 24 valid permutations of topological order. Here are two of them (nodes are ordered from left to right with the highest
order to the lowest):
According to Kahn's algorithm, there are multiple valid permutations of the topological order. The graph in the example
above has 24 valid permutations of topological order. Here are two of them (nodes are ordered from left to right with the
highest order to the lowest):

- `(?letter, ?date, ?person2, ?person1, ?gnd2, ?gnd1, (DE-588)118696149, (DE-588)118531379)`
- `(?letter, ?date, ?person1, ?person2, ?gnd1, ?gnd2, (DE-588)118531379, (DE-588)118696149)`.

From all valid topological orders, one is chosen based on certain criteria; for example, the leaf should node should not
From all valid topological orders, one is chosen based on certain criteria; for example, the leaf node should not
belong to a statement that has predicate `rdf:type`, since that could match all resources of the specified type.
Once the best order is chosen, it is used to re-arrange the query
statements. Starting from the last leaf node, i.e.
Once the best order is chosen, it is used to re-arrange the query statements. Starting from the last leaf node, i.e.
`(DE-588)118696149`, the method finds the statement pattern which has this node as its object, and brings this statement
to the top of the query. This rearrangement continues so that the statements with the fewest dependencies on other
statements are all brought to the top of the query. The resulting query is as follows:
Expand Down Expand Up @@ -423,8 +407,7 @@ CONSTRUCT {

Note that position of the FILTER statements does not play a significant role in the optimization.

If a Gravsearch query contains statements in `UNION`, `OPTIONAL`, `MINUS`, or
`FILTER NOT EXISTS`, they are reordered
If a Gravsearch query contains statements in `UNION`, `OPTIONAL`, `MINUS`, or `FILTER NOT EXISTS`, they are reordered
by defining a graph per block. For example, consider the following query with `UNION`:

```sparql
Expand Down