Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(standoff)!: return XML alongside HTML for textValue with custom standoff mapping and default XSL transformation (DEV-201) #1991

Merged
merged 56 commits into from Mar 7, 2022
Merged
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
904a368
testing: add stubs. for StandoffModels
BalduinLandolt Feb 3, 2022
74452c2
test: add test data for standoff custom mapping
BalduinLandolt Feb 3, 2022
c869e84
expand standoff ontology
BalduinLandolt Feb 3, 2022
63d2da6
feat: return XML even with custom mapping
BalduinLandolt Feb 3, 2022
1416c91
test: start working on a proper E2E test for standoff with custom map…
BalduinLandolt Feb 3, 2022
2a22ffd
Merge branch 'main' into wip/DEV-201-return-custom-standoff-as-xml
BalduinLandolt Feb 3, 2022
198f347
test: add TODO for TEI related task
BalduinLandolt Feb 3, 2022
709eb02
test: add more TODOs
BalduinLandolt Feb 3, 2022
1c636e0
test: add some more stubs for StandoffModels
BalduinLandolt Feb 3, 2022
d019f23
Update standoffModelsUtil.scala
BalduinLandolt Feb 3, 2022
5b6ddd6
remove unused StandoffModelsUtil file
BalduinLandolt Feb 3, 2022
9f774cb
test: revert unnecessary base ontology changes
BalduinLandolt Feb 7, 2022
29e90a2
Merge branch 'main' into wip/DEV-201-return-custom-standoff-as-xml
BalduinLandolt Feb 7, 2022
cccd88a
test: fix sample xml file
BalduinLandolt Feb 7, 2022
c53afc4
test: add E2E test for standoff with custom mapping
BalduinLandolt Feb 7, 2022
64a5759
test: adjust FileModels
BalduinLandolt Feb 7, 2022
c0dd499
test: clean up Standoff E2E test
BalduinLandolt Feb 8, 2022
f93d9c5
Merge branch 'main' into wip/DEV-201-return-custom-standoff-as-xml
BalduinLandolt Feb 8, 2022
acd79c7
test: add tests for StandoffModels
BalduinLandolt Feb 8, 2022
c429e62
refactor: minor cleaning up
BalduinLandolt Feb 8, 2022
fcbbe6d
test: add mock sipi to standoff e2e test
BalduinLandolt Feb 10, 2022
0223cbb
tests: start with standoff E2E tests
BalduinLandolt Feb 14, 2022
d174810
Update StandoffModels.scala
BalduinLandolt Feb 14, 2022
11a21d7
tests: check if sipi is available in E2E test
BalduinLandolt Feb 14, 2022
23910fe
Merge branch 'main' into wip/DEV-201-return-custom-standoff-as-xml
BalduinLandolt Feb 14, 2022
8713b5a
tests: enable SIPI in E2E tests
BalduinLandolt Feb 14, 2022
8140e69
test: clean up
BalduinLandolt Feb 15, 2022
cd86bb6
Merge branch 'main' into wip/DEV-201-return-custom-standoff-as-xml
BalduinLandolt Feb 17, 2022
5ca3ca3
test: move SIPI utils to E2ESpec
BalduinLandolt Feb 17, 2022
18c6d0a
test: rename E2E test from R2R to E2E
BalduinLandolt Feb 17, 2022
06ab8c9
Merge branch 'main' into wip/DEV-201-return-custom-standoff-as-xml
BalduinLandolt Feb 21, 2022
cf749d8
Merge branch 'main' into wip/DEV-201-return-custom-standoff-as-xml
BalduinLandolt Feb 21, 2022
73e3835
test: reasonably test creating a standoff mapping in a unit test
BalduinLandolt Feb 21, 2022
880df59
refactor: tidy up
BalduinLandolt Feb 22, 2022
1b4c869
test: add e2e test for standard mapping
BalduinLandolt Feb 22, 2022
ea41105
refactor: tidy up unit tests
BalduinLandolt Feb 22, 2022
f40cd26
refactor: remove potentially unused files
BalduinLandolt Feb 22, 2022
5378bde
testdata: add gitignore
BalduinLandolt Feb 22, 2022
2322675
test: add standoff example to freetest test data
BalduinLandolt Feb 24, 2022
08de265
Merge branch 'main' into wip/DEV-201-return-custom-standoff-as-xml
BalduinLandolt Feb 24, 2022
782609e
refactor: clean up after merging Bazel-to-SBT PR
BalduinLandolt Feb 24, 2022
53cc863
docs: start documenting standoff
BalduinLandolt Feb 24, 2022
0a2ee42
Merge branch 'main' into wip/DEV-201-return-custom-standoff-as-xml
BalduinLandolt Feb 28, 2022
ec369fc
docs: update documentation
BalduinLandolt Mar 1, 2022
2a1ce59
docs: update documentation
BalduinLandolt Mar 1, 2022
6bcba06
docs: update documentation
BalduinLandolt Mar 1, 2022
8191224
refactor: final tidy up
BalduinLandolt Mar 1, 2022
ed66f7a
refactor: changes according to review
BalduinLandolt Mar 1, 2022
0fceab8
docs: add scaladoc
BalduinLandolt Mar 1, 2022
ea34a33
refactor: minor cleanup according to review
BalduinLandolt Mar 1, 2022
cf2965e
refactor: rename variable to be more clear what it actually is
BalduinLandolt Mar 3, 2022
d718cf4
docs: update documentation to make creating text values with custom s…
BalduinLandolt Mar 3, 2022
4d56469
Merge branch 'main' into wip/DEV-201-return-custom-standoff-as-xml
BalduinLandolt Mar 3, 2022
7841f92
docs: fix typo
BalduinLandolt Mar 3, 2022
e2608e2
refactor: move sipi messages from test into the sipi messages package
BalduinLandolt Mar 3, 2022
f563418
Merge branch 'main' into wip/DEV-201-return-custom-standoff-as-xml
BalduinLandolt Mar 7, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
37 changes: 16 additions & 21 deletions docs/01-introduction/standoff-rdf.md
Expand Up @@ -5,15 +5,14 @@

# Standoff/RDF Text Markup

[Standoff markup](https://lexiconse.uantwerpen.be/index.php/lexicon/markup-standoff/)
is text markup that is stored separately from the content it describes. Knora's
[Standoff markup](https://lexiconse.uantwerpen.be/lexicon/markupStandoff.html)
is text markup that is stored separately from the content it describes. DSP-API's
Standoff/RDF markup stores content as a simple Unicode string, and represents markup
separately as RDF data. This approach has some advantages over commonly used markup systems
such as XML:

First, XML and other hierarchical markup systems assume that a document is a hierarchy, and
have difficulty representing
[non-hierarchical structures](http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html)
have difficulty representing [non-hierarchical structures](http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html)
or multiple overlapping hierarchies. Standoff markup can easily represent these structures.

Second, markup languages are typically designed to be used in text files. But there is no
Expand All @@ -22,43 +21,39 @@ markup. It is possible to do this in a non-standard way by using an XML database
such as [eXist](http://exist-db.org), but this still does not allow for queries that include
text as well as non-textual data not stored in XML.

By storing markup as RDF, Knora can search for markup structures in the same way that it
By storing markup as RDF, DSP-API can search for markup structures in the same way that it
BalduinLandolt marked this conversation as resolved.
Show resolved Hide resolved
searches for any RDF data structure. This makes it possible to do searches that combine
text-related criteria with other sorts of criteria. For example, if persons and events are
represented as Knora resources, and texts are represented in Standoff/RDF, a text can contain
represented as resources, and texts are represented in Standoff/RDF, a text can contain
tags representing links to persons or events. You could then search for a text that mentions a
person who lived in the same city as another person who is the author of a text that mentions an
event that occurred during a certain time period.

In Knora's Standoff/RDF, a tag is an RDF entity that is linked to a
In DSP-API's Standoff/RDF, a tag is an RDF entity that is linked to a
[text value](../02-knora-ontologies/knora-base.md#textvalue). Each tag points to a substring
of the text, and has semantic properties of its own. You can define your own tag classes
in your ontology by making subclasses of `knora-base:StandoffTag`, and attach your own
properties to them. You can then search for those properties using Knora's search language,
properties to them. You can then search for those properties using DSP-API's search language,
[Gravsearch](../03-apis/api-v2/query-language.md).

The built-in [knora-base](../02-knora-ontologies/knora-base.md) and `standoff` ontologies
provide some basic tags that can be reused or extended. These include tags that represent
Knora data types. For example, `knora-base:StandoffDateTag` represents a date in exactly the
same way as a Knora [date value](../02-knora-ontologies/knora-base.md#datevalue), i.e. as a
DSP-API data types. For example, `knora-base:StandoffDateTag` represents a date in exactly the
same way as a [date value](../02-knora-ontologies/knora-base.md#datevalue), i.e. as a
calendar-independent astronomical date. You can use this tag as-is, or extend it by making
a subclass, to represent dates in texts. Gravsearch includes built-in functionality for
searching for these data type tags. For example, you can search for text containing a date that
falls within a certain [date range](../03-apis/api-v2/query-language.md#matching-standoff-dates).

Knora's APIs support automatic conversion between XML and Standoff/RDF. To make this work,
DSP-API supports automatic conversion between XML and Standoff/RDF. To make this work,
Standoff/RDF stores the order of tags and their hierarchical relationships. You must define an
[XML-to-Standoff Mapping](../03-apis/api-v2/xml-to-standoff-mapping.md) for your standoff tag classes and properties.
Then you can import an XML document into Knora, which will store it as Standoff/RDF. The text and markup
can then be searched using Gravsearch. When you retrieve the document, Knora converts it back to the
Then you can import an XML document into DSP-API, which will store it as Standoff/RDF. The text and markup
can then be searched using Gravsearch. When you retrieve the document, DSP-API converts it back to the
original XML.

To represent overlapping or non-hierarchical markup in exported and imported XML, Knora supports
[CLIX](http://conferences.idealliance.org/extreme/html/2004/DeRose01/EML2004DeRose01.html#t6) tags.
To represent overlapping or non-hierarchical markup in exported and imported XML, DSP-API supports
[CLIX](https://web.archive.org/web/20171222112655/http://conferences.idealliance.org/extreme/html/2004/DeRose01/EML2004DeRose01.html) tags.

Future plans for Standoff/RDF include:

- Creation and retrieval of standoff markup as such via the DSP-API,
without using XML as an input/output format.
- A user interface for editing standoff markup.
- The ability to create resources that cite particular standoff tags in other resources.
As XML-to-Standoff has proved to be overly complicated and not very well performing, the use of standoff with custom mappings is discouraged.
BalduinLandolt marked this conversation as resolved.
Show resolved Hide resolved
Improved integration of text with XML mark up, particularly TEI-XML, is in planning.