feat(gravsearch): improve gravsearch performance by using unions in p…

…requery (DEV-492) (#2045) * Update SparqlTransformer.scala * feat: add superPropertyOf map to ontology cache * refactor: reduce logging noise * chore: add clean-sbt target to makefile * feat: replace property path query statements with unions for subPropertyOf* * feat: use unions for subclasses * refactor: tidy up some old mess * refactor: add more logging * add limiting param to transformer to reduce inference * ignore failing test for now * feat: start working on reducing union options on basis of the query * tidy up * minor improvements * get tests to pass * feat: limit subclasses * feat: include optimization in count query * test: minimal test for compound objects with gravsearch * test: test simulated inference with union patterns * refactor start tidying up * refactor: more tidying up * refactor: tidy up * docs: start documenting the changes * refactor: remove unused code * docs: update documentation * refactor: remove some code smells * refactor: tidy up, improve variable naming and add documentation * refactor: format sparqlTransformarSpec.scala * Apply suggestions from code review Co-authored-by: irinaschubert <irina.schubert@dasch.swiss> * tidy up * wrap up according to review Co-authored-by: irinaschubert <irina.schubert@dasch.swiss>
dasch-swiss · May 10, 2022 · 40354a7 · 40354a7
1 parent a9fda7e
commit 40354a7
Show file tree

Hide file tree

Showing 21 changed files with 985 additions and 527 deletions.
diff --git a/Makefile b/Makefile
@@ -280,6 +280,13 @@ clean-local-tmp:
 	@rm -rf .tmp
 	@mkdir .tmp
 
+.PHONY: clean-metals
+clean-metals: ## clean SBT and Metals related stuff
+	@rm -rf .bloop
+	@rm -rf .bsp
+	@rm -rf .metals
+	@rm -rf target
+
 clean: docs-clean clean-local-tmp clean-docker clean-sipi-tmp ## clean build artifacts
 	@rm -rf .env
 

diff --git a/docs/01-introduction/what-is-knora.md b/docs/01-introduction/what-is-knora.md
@@ -74,7 +74,7 @@ and can regenerate the original XML document at any time.
 
 DSP-API provides a search language, [Gravsearch](../03-apis/api-v2/query-language.md),
 that is designed to meet the needs of humanities researchers. Gravsearch supports DSP-API's
-humanites-focused data structures, including calendar-independent dates and standoff markup, as well
+humanities-focused data structures, including calendar-independent dates and standoff markup, as well
 as fast full-text searches. This allows searches to combine text-related criteria with any other
 criteria. For example, you could search for a text that contains a certain word
 and also mentions a person who lived in the same city as another person who is the

diff --git a/docs/03-apis/api-v2/query-language.md b/docs/03-apis/api-v2/query-language.md
@@ -13,15 +13,15 @@ criteria) while avoiding their drawbacks in terms of performance and
 security (see [The Enduring Myth of the SPARQL
 Endpoint](https://daverog.wordpress.com/2013/06/04/the-enduring-myth-of-the-sparql-endpoint/)).
 It also has the benefit of enabling clients to work with a simpler RDF
-data model than the one Knora actually uses to store data in the
+data model than the one the API actually uses to store data in the
 triplestore, and makes it possible to provide better error-checking.
 
 Rather than being processed directly by the triplestore, a Gravsearch query
-is interpreted by Knora, which enforces certain
+is interpreted by the API, which enforces certain
 restrictions on the query, and implements paging and permission
 checking. The API server generates SPARQL based on the Gravsearch query
 submitted, queries the triplestore, filters the results according to the
-user's permissions, and returns each page of query results as a Knora
+user's permissions, and returns each page of query results as an
 API response. Thus, Gravsearch is a hybrid between a RESTful API and a
 SPARQL endpoint.
 
@@ -80,14 +80,14 @@ If a gravsearch query times out, a `504 Gateway Timeout` will be returned.
 A Gravsearch query can be written in either of the two
 [DSP-API v2 schemas](introduction.md#api-schema). The simple schema
 is easier to work with, and is sufficient if you don't need to query
-anything below the level of a Knora value. If your query needs to refer to
+anything below the level of a DSP-API value. If your query needs to refer to
 standoff markup, you must use the complex schema. Each query must use a single
 schema, with one exception (see [Date Comparisons](#date-comparisons)).
 
 Gravsearch query results can be requested in the simple or complex schema;
 see [API Schema](introduction.md#api-schema).
 
-All examples hereafter run with Knora started locally as documented in the section [Getting Started with DSP-API](../../04-publishing-deployment/getting-started.md). If you access another Knora-Stack, you can check the IRI of the ontology you are targeting by requesting the [ontologies metadata](ontology-information.md#querying-ontology-metadata).
+All examples hereafter run with the DSP stack started locally as documented in the section [Getting Started with DSP-API](../../04-publishing-deployment/getting-started.md). If you access another stack, you can check the IRI of the ontology you are targeting by requesting the [ontologies metadata](ontology-information.md#querying-ontology-metadata).
 
 ### Using the Simple Schema
 
@@ -100,8 +100,7 @@ PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>
 PREFIX incunabula: <http://0.0.0.0:3333/ontology/0803/incunabula/simple/v2#>
 ```
 
-In the simple schema, Knora values are represented as literals, which
-can be used `FILTER` expressions
+In the simple schema, DSP-API values are represented as literals, which can be used `FILTER` expressions
 (see [Filtering on Values in the Simple Schema](#filtering-on-values-in-the-simple-schema)).
 
 ### Using the Complex Schema
@@ -115,7 +114,7 @@ PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
 PREFIX incunabula: <http://0.0.0.0:3333/ontology/0803/incunabula/v2#>
 ```
 
-In the complex schema, Knora values are represented as objects belonging
+In the complex schema, DSP-API values are represented as objects belonging
 to subclasses of `knora-api:Value`, e.g. `knora-api:TextValue`, and have
 predicates of their own, which can be used in `FILTER` expressions
 (see [Filtering on Values in the Complex Schema](#filtering-on-values-in-the-complex-schema)).
@@ -182,7 +181,7 @@ permission to see a matching dependent resource, the link value is hidden.
 ## Paging
 
 Gravsearch results are returned in pages. The maximum number of main
-resources per page is determined by Knora (and can be configured
+resources per page is determined by the API (and can be configured
 in `application.conf` via the setting `app/v2/resources-sequence/results-per-page`).
 If some resources have been filtered out because the user does not have
 permission to see them, a page could contain fewer results, or no results.
@@ -195,25 +194,20 @@ one at a time, until the response does not contain `knora-api:mayHaveMoreResults
 ## Inference
 
 Gravsearch queries are understood to imply a subset of
-[RDFS reasoning](https://www.w3.org/TR/rdf11-mt/). Depending on the
-triplestore being used, this may be implemented using the triplestore's
-own reasoner or by query expansion in Knora.
+[RDFS reasoning](https://www.w3.org/TR/rdf11-mt/). This is done by the API by expanding the incoming query.
 
 Specifically, if a statement pattern specifies a property, the pattern will
 also match subproperties of that property, and if a statement specifies that
 a subject has a particular `rdf:type`, the statement will also match subjects
 belonging to subclasses of that type.
 
 If you know that reasoning will not return any additional results for
-your query, you can disable it by adding this line to the `WHERE` clause:
+your query, you can disable it by adding this line to the `WHERE` clause, which may improve query performance:
 
 ```sparql
 knora-api:GravsearchOptions knora-api:useInference false .
 ```
 
-If Knora is implementing reasoning by query expansion, disabling it can
-improve the performance of some queries.
-
 ## Gravsearch Syntax
 
 Every Gravsearch query is a valid SPARQL 1.1
@@ -244,8 +238,8 @@ clauses use the following patterns, with the specified restrictions:
   unordered set of triples. However, a Gravsearch query returns an
   ordered list of resources, which can be ordered by the values of
   specified properties. If the query is written in the complex schema,
-  items below the level of Knora values may not be used in `ORDER BY`.
-- `BIND`: The value assigned must be a Knora resource IRI.
+  items below the level of DSP-API values may not be used in `ORDER BY`.
+- `BIND`: The value assigned must be a DSP resource IRI.
 
 ### Resources, Properties, and Values
 
@@ -269,7 +263,7 @@ must be represented as a query variable.
 
 #### Filtering on Values in the Simple Schema
 
-In the simple schema, a variable representing a Knora value can be used
+In the simple schema, a variable representing a DSP-API value can be used
 directly in a `FILTER` expression. For example:
 
 ```
@@ -279,7 +273,7 @@ FILTER(?title = "Zeitglöcklein des Lebens und Leidens Christi")
 
 Here the type of `?title` is `xsd:string`.
 
-The following Knora value types can be compared with literals in `FILTER`
+The following value types can be compared with literals in `FILTER`
 expressions in the simple schema:
 
 - Text values (`xsd:string`)
@@ -295,7 +289,7 @@ performing an exact match on a list node's label. Labels can be given in differe
 If one of the given list node labels matches, it is considered a match.
 Note that in the simple schema, uniqueness is not guaranteed (as opposed to the complex schema).
 
-A Knora value may not be represented as the literal object of a predicate;
+A DSP-API value may not be represented as the literal object of a predicate;
 for example, this is not allowed:
 
 ```
@@ -304,9 +298,9 @@ for example, this is not allowed:
 
 #### Filtering on Values in the Complex Schema
 
-In the complex schema, variables representing Knora values are not literals.
+In the complex schema, variables representing DSP-API values are not literals.
 You must add something to the query (generally a statement) to get a literal
-from a Knora value. For example:
+from a DSP-API value. For example:
 
 ```
 ?book incunabula:title ?title .
@@ -479,7 +473,7 @@ within a single paragraph.
 If you are only interested in specifying that a resource has some text
 value containing a standoff link to another resource, the most efficient
 way is to use the property `knora-api:hasStandoffLinkTo`, whose subjects and objects
-are resources. This property is automatically maintained by Knora. For example:
+are resources. This property is automatically maintained by the API. For example:
 
 ```
 PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
@@ -623,7 +617,7 @@ CONSTRUCT {
 
 ### Filtering on `rdfs:label`
 
-The `rdfs:label` of a resource is not a Knora value, but you can still search for it.
+The `rdfs:label` of a resource is not a DSP-API value, but you can still search for it.
 This can be done in the same ways in the simple or complex schema:
 
 Using a string literal object:
@@ -708,8 +702,8 @@ clause but not in the `CONSTRUCT` clause, the matching resources or values
 will not be included in the results.
 
 If the query is written in the complex schema, all variables in the `CONSTRUCT`
-clause must refer to Knora resources, Knora values, or properties. Data below
-the level of Knora values may not be mentioned in the `CONSTRUCT` clause.
+clause must refer to DSP-API resources, DSP-API values, or properties. Data below
+the level of values may not be mentioned in the `CONSTRUCT` clause.
 
 Predicates from the `rdf`, `rdfs`, and `owl` ontologies may not be used
 in the `CONSTRUCT` clause. The `rdfs:label` of each matching resource is always
@@ -921,7 +915,7 @@ adding statements with the predicate `rdf:type`. The subject must be a resource
 and the object must either be `knora-api:Resource` (if the subject is a resource)
 or the subject's specific type (if it is a value).
 
-For example, consider this query that uses a non-Knora property:
+For example, consider this query that uses a non-DSP property:
 
 ```
 PREFIX incunabula: <http://0.0.0.0:3333/ontology/0803/incunabula/simple/v2#>
@@ -992,7 +986,7 @@ CONSTRUCT {
 Note that it only makes sense to use `dcterms:title` in the simple schema, because
 its object is supposed to be a literal.
 
-Here is another example, using a non-Knora class:
+Here is another example, using a non-DSP class:
 
 ```
 PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>

diff --git a/docs/05-internals/design/api-v2/gravsearch.md b/docs/05-internals/design/api-v2/gravsearch.md
@@ -128,7 +128,7 @@ pattern orders must be optimised by moving `LuceneQueryPatterns` to the beginnin
 - `ConstructToConstructTransformer` (extends `WhereTransformer`): instructions how to turn a triplestore independent Construct query into a triplestore dependent Construct query (implementation of inference).
 
 The traits listed above define methods that are implemented in the transformer classes and called by `QueryTraverser` to perform SPARQL to SPARQL conversions.
-When iterating over the statements of the input query, the transformer class's transformation methods are called to perform the conversion.
+When iterating over the statements of the input query, the transformer class' transformation methods are called to perform the conversion.
 
 ### Prequery
 
@@ -152,7 +152,7 @@ Next, the Gravsearch query's WHERE clause is transformed and the prequery (SELEC
 The transformation of the Gravsearch query's WHERE clause relies on the implementation of the abstract class `AbstractPrequeryGenerator`.
 
 `AbstractPrequeryGenerator` contains members whose state is changed during the iteration over the statements of the input query.
-They can then by used to create the converted query.
+They can then be used to create the converted query.
 
 - `mainResourceVariable: Option[QueryVariable]`: SPARQL variable representing the main resource of the input query. Present in the prequery's SELECT clause.
 - `dependentResourceVariables: mutable.Set[QueryVariable]`: a set of SPARQL variables representing dependent resources in the input query. Used in an aggregation function in the prequery's SELECT clause (see below).
@@ -288,29 +288,12 @@ to the maximum allowed page size, the predicate
 
 ## Inference
 
-Gravsearch queries support a subset of RDFS reasoning
-(see [Inference](../../../03-apis/api-v2/query-language.md#inference) in the API documentation
+Gravsearch queries support a subset of RDFS reasoning (see [Inference](../../../03-apis/api-v2/query-language.md#inference) in the API documentation
 on Gravsearch). This is implemented as follows:
 
-When the non-triplestore-specific version of a SPARQL query is generated, statements that do not need
-inference are marked with the virtual named graph `<http://www.knora.org/explicit>`.
+To simulate RDF inference, the API expands the prequery on basis of the available ontologies. For that reason, `SparqlTransformer.transformStatementInWhereForNoInference` expands all `rdfs:subClassOf` and `rdfs:subPropertyOf` statements using `UNION` statements for all subclasses and subproperties from the ontologies (equivalent to `rdfs:subClassOf*` and `rdfs:subPropertyOf*`). 
+Similarly, `SparqlTransformer.transformStatementInWhereForNoInference` replaces `knora-api:standoffTagHasStartAncestor` with `knora-base:standoffTagHasStartParent*`.
 
-When the triplestore-specific version of the query is generated:
-
-- If the triplestore is GraphDB, `SparqlTransformer.transformKnoraExplicitToGraphDBExplicit` changes statements
-  with the virtual graph `<http://www.knora.org/explicit>` so that they are marked with the GraphDB-specific graph
-  `<http://www.ontotext.com/explicit>`, and leaves other statements unchanged. 
-  `SparqlTransformer.transformKnoraExplicitToGraphDBExplicit` also adds the `valueHasString` statements which GraphDB needs 
-  for text searches.
-
-- If Knora is not using the triplestore's inference (e.g. with Fuseki),
-  `SparqlTransformer.transformStatementInWhereForNoInference` removes `<http://www.knora.org/explicit>`, and expands unmarked
-  statements using `rdfs:subClassOf*` and `rdfs:subPropertyOf*`.
-
-Gravsearch also provides some virtual properties, which take advantage of forward-chaining inference
-as an optimisation if the triplestore provides it. For example, the virtual property
-`knora-api:standoffTagHasStartAncestor` is equivalent to `knora-base:standoffTagHasStartParent*`. If Knora is not using the triplestore's inference, `SparqlTransformer.transformStatementInWhereForNoInference`
-replaces `knora-api:standoffTagHasStartAncestor` with `knora-base:standoffTagHasStartParent*`.
 
 # Optimisation of generated SPARQL
 
@@ -320,8 +303,7 @@ Lucene queries to the beginning of the block in which they occur.
 
 ## Query Optimization by Topological Sorting of Statements
 
-GraphDB seems to have inherent algorithms to optimize the query time, however query performance of Fuseki highly depends 
-on the order of the query statements. For example, a query such as the one below:
+In Jena Fuseki, the performance of a query highly depends on the order of the query statements. For example, a query such as the one below:
 
 ```sparql
 PREFIX beol: <http://0.0.0.0:3333/ontology/0801/beol/v2#>
@@ -370,8 +352,7 @@ The rest of the query then reads:
  ?letter beol:creationDate ?date .
 ```
 
-Since we cannot expect clients to know about performance of triplestores in order to write efficient queries, we have 
-implemented an optimization method to automatically rearrange the statements of the given queries. 
+Since users cannot be expected to know about performance of triplestores in order to write efficient queries, an optimization method to automatically rearrange the statements of the given queries has been implemented. 
 Upon receiving the Gravsearch query, the algorithm converts the query to a graph. For each statement pattern,
 the subject of the statement is the origin node, the predicate is a directed edge, and the object 
 is the target node. For the query above, this conversion would result in the following graph:
@@ -384,17 +365,16 @@ topological sorting algorithm](https://en.wikipedia.org/wiki/Topological_sorting
 The algorithm returns the nodes of the graph ordered in several layers, where the 
 root element `?letter` is in layer 0, `[?date, ?person1, ?person2]` are in layer 1, `[?gnd1, ?gnd2]` in layer 2, and the 
 leaf nodes `[(DE-588)118531379, (DE-588)118696149]` are given in the last layer (i.e. layer 3). 
-According to Kahn's algorithm, there are multiple valid permutations of the topological order. The graph in the example 
- above has 24 valid permutations of topological order. Here are two of them (nodes are ordered from left to right with the highest 
- order to the lowest):
+According to Kahn's algorithm, there are multiple valid permutations of the topological order. The graph in the example
+above has 24 valid permutations of topological order. Here are two of them (nodes are ordered from left to right with the 
+highest order to the lowest):
 
 - `(?letter, ?date, ?person2, ?person1, ?gnd2, ?gnd1, (DE-588)118696149, (DE-588)118531379)`   
 - `(?letter, ?date, ?person1, ?person2, ?gnd1, ?gnd2, (DE-588)118531379, (DE-588)118696149)`.   
 
-From all valid topological orders, one is chosen based on certain criteria; for example, the leaf should node should not 
+From all valid topological orders, one is chosen based on certain criteria; for example, the leaf node should not 
 belong to a statement that has predicate `rdf:type`, since that could match all resources of the specified type.
-Once the best order is chosen, it is used to re-arrange the query 
-statements. Starting from the last leaf node, i.e. 
+Once the best order is chosen, it is used to re-arrange the query statements. Starting from the last leaf node, i.e. 
 `(DE-588)118696149`, the method finds the statement pattern which has this node as its object, and brings this statement 
 to the top of the query. This rearrangement continues so that the statements with the fewest dependencies on other 
 statements are all brought to the top of the query. The resulting query is as follows:
@@ -423,8 +403,7 @@ CONSTRUCT {
 
 Note that position of the FILTER statements does not play a significant role in the optimization. 
 
-If a Gravsearch query contains statements in `UNION`, `OPTIONAL`, `MINUS`, or 
-`FILTER NOT EXISTS`, they are reordered 
+If a Gravsearch query contains statements in `UNION`, `OPTIONAL`, `MINUS`, or `FILTER NOT EXISTS`, they are reordered 
 by defining a graph per block. For example, consider the following query with `UNION`:
 
 ```sparql