Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
feat(triplestores): Support Apache Jena Fuseki (#1375)
  • Loading branch information
Benjamin Geer committed Apr 2, 2020
1 parent c41bce7 commit 82f8a55
Show file tree
Hide file tree
Showing 102 changed files with 5,862 additions and 3,603 deletions.
1 change: 1 addition & 0 deletions .gitignore
Expand Up @@ -58,4 +58,5 @@ knora-graphdb-free
knora-graphdb-se
knora-sipi
knora-upgrade
triplestores/fuseki-tomcat/system
dump.rdb
42 changes: 21 additions & 21 deletions docs/src/paradox/03-apis/api-v2/query-language.md
Expand Up @@ -192,12 +192,12 @@ a matching dependent resource, only its IRI is returned.

## Inference

Gravsearch queries are understood to imply
Gravsearch queries are understood to imply a subset of
[RDFS reasoning](https://www.w3.org/TR/rdf11-mt/). Depending on the
triplestore being used, this may be implemented using the triplestore's
own reasoner or by query expansion in Knora.

This means that if a statement pattern specifies a property, the pattern will
Specifically, if a statement pattern specifies a property, the pattern will
also match subproperties of that property, and if a statement specifies that
a subject has a particular `rdf:type`, the statement will also match subjects
belonging to subclasses of that type.
Expand Down Expand Up @@ -352,31 +352,28 @@ text markup (see @ref:[Matching Standoff Dates](#matching-standoff-dates)).

#### Searching for Matching Words

The function `knora-api:match` searches for matching words anywhere in a
The function `knora-api:matchText` searches for matching words anywhere in a
text value, and is implemented using a full-text search index if available.
The first argument must be a variable of type `xsd:string`, and the second
argument is a string containing the words to be matched, separated by spaces.
The words to be matched are separated by spaces in a string literal.
The first argument must represent a text value (a `knore-api:TextValue` in
the complex schema, or an `xsd:string` in the simple schema). The second
argument is a string literal containing the words to be matched, separated by spaces.
The function supports the
@ref:[Lucene Query Parser syntax](../../08-lucene/index.md).
Note that Lucene's default operator is a logical OR when submitting several search terms.

For example, to search for titles that contain the words 'Zeitglöcklein' and
'Lebens' in the simple schema:

```
FILTER knora-api:match(?title, "Zeitglöcklein Lebens")
```
This function can only be used as the top-level expression in a `FILTER`.

In the complex schema:
For example, to search for titles that contain the words 'Zeitglöcklein' and
'Lebens':

```
?title knora-api:valueAsString ?titleStr .
FILTER knora-api:match(?titleStr, "Zeitglöcklein Lebens")
?book incunabule:title ?title .
FILTER knora-api:matchText(?title, "Zeitglöcklein Lebens")
```

If `knora-api:match` is used in a `FILTER`, it must be the only expression in
the `FILTER`.
Note: the `knora-api:match` function has been deprecated, and will no longer work in
a future release of Knora. Please change your Gravsearch queries to use `knora-api:matchText`
instead. Attention: the first argument is different.

#### Filtering Text by Language

Expand Down Expand Up @@ -426,11 +423,11 @@ tags in the text. You can match the tags you're interested in using

#### Matching Text in a Standoff Tag

The function `knora-api:matchInStandoff` searches for standoff tags containing certain terms.
The function `knora-api:matchTextInStandoff` searches for standoff tags containing certain terms.
The implementation is optimised using the full-text search index if available. The
function takes three arguments:

1. A variable representing the string literal value of a text value.
1. A variable representing a text value.
2. A variable representing a standoff tag.
3. A string literal containing space-separated search terms.

Expand All @@ -448,16 +445,19 @@ CONSTRUCT {
} WHERE {
?letter a beol:letter .
?letter beol:hasText ?text .
?text knora-api:valueAsString ?textStr .
?text knora-api:textValueHasStandoff ?standoffParagraphTag .
?standoffParagraphTag a standoff:StandoffParagraphTag .
FILTER knora-api:matchInStandoff(?textStr, ?standoffParagraphTag, "Grund Richtigkeit")
FILTER knora-api:matchTextInStandoff(?text, ?standoffParagraphTag, "Grund Richtigkeit")
}
```

Here we are looking for letters containing the words "Grund" and "Richtigkeit"
within a single paragraph.

Note: the `knora-api:matchInStandoff` function has been deprecated, and will no longer
work in a future release of Knora. Please change your Gravsearch queries to use
`knora-api:matchTextInStandoff` instead. Attention: the first argument is different.

#### Matching Standoff Links

If you are only interested in specifying that a resource has some text
Expand Down
Expand Up @@ -547,6 +547,9 @@ This is useful only if the project does not contain a large amount of data;
otherwise, you should use @ref:[Gravsearch](query-language.md) to search
using more specific criteria.

The specified class and property are used without inference; they will not
match subclasses or subproperties.

The HTTP header `X-Knora-Accept-Project` must be submitted; its value is
a Knora project IRI. In the request URL, the values of `resourceClass` and `orderByProperty`
are URL-encoded IRIs in the @ref:[complex schema](introduction.md#api-schema).
Expand Down
60 changes: 47 additions & 13 deletions docs/src/paradox/05-internals/design/api-v2/gravsearch.md
Expand Up @@ -188,15 +188,16 @@ The resulting SELECT clause of the prequery looks as follows:
```sparql
SELECT DISTINCT
?page
(GROUP_CONCAT(DISTINCT(?book); SEPARATOR='') AS ?book__Concat)
(GROUP_CONCAT(DISTINCT(?seqnum); SEPARATOR='') AS ?seqnum__Concat)
(GROUP_CONCAT(DISTINCT(?book__LinkValue); SEPARATOR='') AS ?book__LinkValue__Concat)
(GROUP_CONCAT(DISTINCT(IF(BOUND(?book), STR(?book), "")); SEPARATOR='') AS ?book__Concat)
(GROUP_CONCAT(DISTINCT(IF(BOUND(?seqnum), STR(?seqnum), "")); SEPARATOR='') AS ?seqnum__Concat)
(GROUP_CONCAT(DISTINCT(IF(BOUND(?book__LinkValue), STR(?book__LinkValue), "")); SEPARATOR='') AS ?book__LinkValue__Concat)
WHERE {...}
GROUP BY ?page
ORDER BY ASC(?page)
LIMIT 25
```
`?page` represents the main resource. When accessing the prequery's result rows, `?page` contains the Iri of the main resource.

`?page` represents the main resource. When accessing the prequery's result rows, `?page` contains the IRI of the main resource.
The prequery's results are grouped by the main resource so that there is exactly one result row per matching main resource.
`?page` is also used as a sort criterion although none has been defined in the input query.
This is necessary to make paging work: results always have to be returned in the same order (the prequery is always deterministic).
Expand All @@ -205,17 +206,23 @@ Like this, results can be fetched page by page using LIMIT and OFFSET.
Grouping by main resource requires other results to be aggregated using the function `GROUP_CONCAT`.
`?book` is used as an argument of the aggregation function.
The aggregation's result is accessible in the prequery's result rows as `?book__Concat`.
The variable `?book` is bound to an Iri.
Since more than one Iri could be bound to a variable representing a dependent resource, the results have to be aggregated.
`GROUP_CONCAT` takes two arguments: a collection of strings (Iris in our use case) and a separator.
When accessing `?book__Concat` in the prequery's results containing the Iris of dependent resources, the string has to be split with the separator used in the aggregation function.
The result is a collection of Iris representing dependent resources.
The variable `?book` is bound to an IRI.
Since more than one IRI could be bound to a variable representing a dependent resource, the results have to be aggregated.
`GROUP_CONCAT` takes two arguments: a collection of strings (IRIs in our use case) and a separator
(we use the non-printing Unicode character `INFORMATION SEPARATOR ONE`).
When accessing `?book__Concat` in the prequery's results containing the IRIs of dependent resources, the string has to be split with the separator used in the aggregation function.
The result is a collection of IRIs representing dependent resources.
The same logic applies to value objects.

Each `GROUP_CONCAT` checks whether the concatenated variable is bound in each result in the group; if a variable
is unbound, we concatenate an empty string. This is necessary because, in Apache Jena (and perhaps other
triplestores), "If `GROUP_CONCAT` has an unbound value in the list of values to concat, the overall result is 'error'"
(see [this Jena issue](https://issues.apache.org/jira/browse/JENA-1856)).

### Main Query

The purpose of the main query is to get all requested information about the main resource, dependent resources, and value objects.
The Iris of those resources and value objects were returned by the prequery.
The IRIs of those resources and value objects were returned by the prequery.
Since the prequery only returns resources and value objects matching the input query's criteria,
the main query can specifically ask for more detailed information on these resources and values without having to reconsider these criteria.

Expand All @@ -225,8 +232,8 @@ The classes involved in generating prequeries can be found in `org.knora.webapi.

The main query is a SPARQL CONSTRUCT query. Its generation is handled by the method `GravsearchMainQueryGenerator.createMainQuery`.
It takes three arguments: `mainResourceIris: Set[IriRef], dependentResourceIris: Set[IriRef], valueObjectIris: Set[IRI]`.
From the given Iris, statements are generated that ask for complete information on *exactly* these resources and values.
For any given resource Iri, only the values present in `valueObjectIris` are to be queried.
From the given IRIs, statements are generated that ask for complete information on *exactly* these resources and values.
For any given resource IRI, only the values present in `valueObjectIris` are to be queried.
This is achieved by using SPARQL's `VALUES` expression for the main resource and dependent resources as well as for values.

#### Processing the Main Query's results
Expand All @@ -237,7 +244,7 @@ The method `getMainQueryResultsWithFullGraphPattern` takes the main query's resu
A main resource and its dependent resources and values are only returned if the user has view permissions on all the resources and value objects present in the main query.
Otherwise the method suppresses the main resource.
To do the permission checking, the results of the main query are passed to `ConstructResponseUtilV2` which transforms a `SparqlConstructResponse` (a set of RDF triples)
into a structure organized by main resource Iris. In this structure, dependent resources and values are nested can be accessed via their main resource.
into a structure organized by main resource IRIs. In this structure, dependent resources and values are nested can be accessed via their main resource.
`SparqlConstructResponse` suppresses all resources and values the user has insufficient permissions on.
For each main resource, a check is performed for the presence of all resources and values after permission checking.

Expand All @@ -247,3 +254,30 @@ All the resources and values not present in the input query's CONSTRUCT clause a
The main resources that have been filtered out due to insufficient permissions are represented by the placeholder `ForbiddenResource`.
This placeholder stands for a main resource that cannot be returned, nevertheless it informs the client that such a resource exists.
This is necessary for a consistent behaviour when doing paging.

## Inference

Gravsearch queries support a subset of RDFS reasoning
(see @ref:[Inference](../../../03-apis/api-v2/query-language.md#inference) in the API documentation
on Gravsearch). This is implemented as follows:

When the non-triplestore-specific version of a SPARQL query is generated, statements that do not need
inference are marked with the virtual named graph `<http://www.knora.org/explicit>`.

When the triplestore-specific version of the query is generated:

- If the triplestore is GraphDB, `SparqlTransformer.transformKnoraExplicitToGraphDBExplicit` changes statements
with the virtual graph `<http://www.knora.org/explicit>` so that they are marked with the GraphDB-specific graph
`<http://www.ontotext.com/explicit>`, and leaves other statements unchanged.

- If Knora is not using the triplestore's inference (e.g. with Fuseki),
`SparqlTransformer.expandStatementForNoInference` removes `<http://www.knora.org/explicit>`, and expands unmarked
statements using `rdfs:subClassOf*` and `rdfs:subPropertyOf*`.

Gravsearch also provides some virtual properties, which take advantage of forward-chaining inference
as an optimisation if the triplestore provides it. For example, the virtual property
`knora-api:standoffTagHasStartAncestor` is equivalent to `knora-base:standoffTagHasStartParent*`, but
with GraphDB it is implemented using a custom inference rule (in `KnoraRules.pie`) and is therefore more
efficient. If Knora is not using the triplestore's inference,
`SparqlTransformer.transformStatementInWhereForNoInference` replaces `knora-api:standoffTagHasStartAncestor`
with `knora-base:standoffTagHasStartParent*`.
93 changes: 93 additions & 0 deletions docs/src/paradox/05-internals/design/api-v2/query-design.md
Expand Up @@ -21,6 +21,99 @@ License along with Knora. If not, see <http://www.gnu.org/licenses/>.

@@toc

## Inference

Knora does not require the triplestore to perform inference, but may be able
to take advantage of inference if the triplestore provides it.

In particular, Knora's SPARQL queries currently need to do the following:

- Given a base property, find triples using a subproperty as predicate, and
return the subproperty used in each case.
- Given a base class, find triples using an instance of subclass as subject or
object, and return the subclass used in each case.

Without inference, this can be done using property path syntax.

```sparql
CONSTRUCT {
?resource a ?resourceClass .
?resource ?resourceValueProperty ?valueObject.
WHERE {
?resource a ?resourceClass .
?resourceType rdfs:subClassOf* knora-base:Resource .
?resource ?resourceValueProperty ?valueObject .
?resourceValueProperty rdfs:subPropertyOf* knora-base:hasValue .
```

This query:

- Checks that the queried resource belongs to a subclass of `knora-base:Resource`.

- Returns the class that the resource explicitly belongs to.

- Finds the Knora values attached to the resource, and returns each value along with
the property that explicitly attaches it to the resource.

In some triplestores, it can be more efficient to use RDFS inference than to use property path syntax,
depending on how inference is implemented. For example, Ontotext GraphDB does inference when
data is inserted, and stores inferred triples in the repository
([forward chaining with full materialisation](http://graphdb.ontotext.com/documentation/standard/reasoning.html)).
Moreover, it provides a way of choosing whether to return explicit or inferred triples.
This allows the query above to be optimised as follows, querying inferred triples but returning
explicit triples:

```sparql
CONSTRUCT {
?resource a ?resourceClass .
?resource ?resourceValueProperty ?valueObject.
WHERE {
?resource a knora-base:Resource . # inferred triple
GRAPH <http://www.ontotext.com/explicit> {
?resource a ?resourceClass . # explicit triple
}
?resource knora-base:hasValue ?valueObject . # inferred triple
GRAPH <http://www.ontotext.com/explicit> {
?resource ?resourceValueProperty ?valueObject . # explicit triple
}
```

By querying inferred triples that are already stored in the repository, the optimised query avoids property path
syntax and is therefore more efficient, while still only returning explicit triples in the query result.

Other triplestores use a backward-chaining inference strategy, meaning that inference is performed during
the execution of a SPARQL query, by expanding the query itself. The expanded query is likely to look like
the first example, using property path syntax, and therefore it is not likely to be more efficient. Moreover,
other triplestores may not provide a way to return explicit rather than inferred triples. To support such
a triplestore, Knora uses property path syntax rather than inference.
See @ref:[the Gravsearch design documentation](gravsearch.md#inference) for information on how this is done
for Gravsearch queries.

The support for Apache Jena Fuseki currently works in this way. However, Fuseki supports both forward-chaining
and backward-chaining rule engines, although it does not seem to have anything like
GraphDB's `<http://www.ontotext.com/explicit>`. It would be worth exploring whether Knora's query result
processing could be changed so that it could use forward-chaining inference as an optimisation, even if
nothing like `<http://www.ontotext.com/explicit>` is available. For example, the example query= could be written like
this:

```sparql
CONSTRUCT {
?resource a ?resourceClass .
?resource ?resourceValueProperty ?valueObject .
WHERE {
?resource a knora-base:Resource .
?resource a ?resourceClass .
?resource knora-base:hasValue ?valueObject .
?resource ?resourceValueProperty ?valueObject .
```

This would return inferred triples as well as explicit ones: a triple for each base class of the explicit
`?resourceClass`, and a triple for each base property of the explicit `?resourceValueProperty`. But since Knora knows
the class and property inheritance hierarchies, it could ignore the additional triples.

## Querying Past Value Versions

Value versions are a linked list, starting with the current version. Each value points to
Expand Down
2 changes: 0 additions & 2 deletions knora-ontologies/knora-base.ttl
Expand Up @@ -457,8 +457,6 @@
"a lien vers"@fr ,
"ha Link verso"@it ;

rdfs:comment "Represents a direct connection between two resources"@en ;

:isEditable true ;

:objectClassConstraint :LinkValue ;
Expand Down
33 changes: 33 additions & 0 deletions triplestores/fuseki-tomcat/config.ttl
@@ -0,0 +1,33 @@
# Licensed under the terms of http://www.apache.org/licenses/LICENSE-2.0

## Fuseki Server configuration file.

@prefix : <#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .

[] rdf:type fuseki:Server ;
# Example::
# Server-wide query timeout.
#
# Timeout - server-wide default: milliseconds.
# Format 1: "1000" -- 1 second timeout
# Format 2: "10000,60000" -- 10s timeout to first result,
# then 60s timeout for the rest of query.
#
# See javadoc for ARQ.queryTimeout for details.
# This can also be set on a per dataset basis in the dataset assembler.
#
# ja:context [ ja:cxtName "arq:queryTimeout" ; ja:cxtValue "30000" ] ;



# Add any custom classes you want to load.
# Must have a "public static void init()" method.
# ja:loadClass "your.code.Class" ;
ja:loadClass "org.apache.jena.query.text.TextQuery";

# End triples.
.

0 comments on commit 82f8a55

Please sign in to comment.