Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(id-to-iri): extend xmlupload to allow references to existing resources (DEV-60) #108

Merged
merged 60 commits into from Nov 22, 2021
Merged
Show file tree
Hide file tree
Changes from 56 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
e8855bb
improve code structure
Oct 27, 2021
db523fd
improve and activate user tests
Oct 28, 2021
cc1fc80
remove code smells in user tests
Oct 28, 2021
adc71fe
remove remnants of id_to_iri feature
Oct 28, 2021
a3a623e
improve user test
Oct 28, 2021
600913b
fix typo in Makefile
Oct 28, 2021
dc23cf4
Update test_user.py
Nov 1, 2021
8ffbc5d
fix failing list tests
Nov 2, 2021
c4199ff
fix typo
Nov 2, 2021
b87a0e6
add more comments to lists test data
Nov 2, 2021
4409227
fix failing list node test
Nov 3, 2021
7b995fc
improve test_listnode
Nov 3, 2021
172d947
improve test_connection
Nov 3, 2021
d5ee2e4
improve test_group
Nov 3, 2021
5995ada
improve test_ontology
Nov 3, 2021
fdd556f
improve test_project
Nov 3, 2021
2807c6c
improve test_propertyclass
Nov 3, 2021
89ce041
improve test_resource
Nov 3, 2021
bb82135
improve test_resourceclass
Nov 3, 2021
bf83733
improve test_tools
Nov 3, 2021
4935636
move unit tests to separate folder
Nov 4, 2021
20e41cb
Update Makefile
Nov 4, 2021
10723ff
update GitHub CI actions
Nov 4, 2021
f23fc79
Update test.yml
Nov 4, 2021
9272bd1
fix code smells in test_langstring
Nov 4, 2021
6d612a2
fix failing GitHub action
Nov 4, 2021
651b660
improve code
Nov 4, 2021
f4d0557
write id2iri to json file after xmlupload
Oct 26, 2021
2458b80
add incremental option
Oct 26, 2021
58d29a3
add feature to replace internal IDs with IRIs in XML file
Oct 26, 2021
16562a5
add optional output file path
Oct 26, 2021
04b9e3d
add verbose option
Oct 26, 2021
51f4774
improve setup
Oct 26, 2021
87bbca6
add documentation for incremental xmlupload
Oct 26, 2021
140c7ca
add documentation for incremental xmlupload
Oct 26, 2021
06923e5
add incremental option to test
Oct 27, 2021
b7dd7d8
update documentation
Oct 27, 2021
9eb2a8c
add test for id2iri
Oct 27, 2021
6dd779b
add simple unit test
Oct 27, 2021
304fd2a
add unit test
Oct 27, 2021
3de3635
improve unit tests
Oct 27, 2021
7f07414
Update test_id_to_iri.py
Nov 4, 2021
3564df8
Collect failed uploads
Nov 8, 2021
4522a8c
Delete dsp-tools-id2iri.md
Nov 8, 2021
6452810
Merge branch 'main' into wip/dev-60-incremental-xmlupload
Nov 8, 2021
fec6dee
Add separate warning for IRIs
Nov 8, 2021
a82237f
improve file naming
Nov 8, 2021
14b7c7d
Update requirements.txt
Nov 9, 2021
84bc037
Update requirements.txt
Nov 9, 2021
f6eb34e
add test
Nov 9, 2021
d3fca33
add resource label to error message
Nov 17, 2021
d4b0274
Merge branch 'main' into wip/dev-60-incremental-xmlupload
Nov 17, 2021
2bf2137
add documentation to incremental xmlupload
Nov 18, 2021
04ab974
Update dsp-tools-usage.md
Nov 18, 2021
f9eeda0
Merge branch 'main' into wip/dev-60-incremental-xmlupload
Nov 22, 2021
bb75a7f
Use verbose=False in tests
Nov 22, 2021
6322a1e
Merge branch 'main' into wip/dev-60-incremental-xmlupload
Nov 22, 2021
6bd4abb
code improvements after review
Nov 22, 2021
7e9a46a
Merge branch 'wip/dev-60-incremental-xmlupload' of https://github.com…
Nov 22, 2021
d76bdf7
Update knora/dsplib/utils/xml_upload.py
Nov 22, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions Makefile
Expand Up @@ -53,6 +53,7 @@ install-requirements: ## install requirements

.PHONY: install
install: ## install from source (runs setup.py)
python3 -m pip install --upgrade pip
pip3 install .

.PHONY: test
Expand Down
28 changes: 28 additions & 0 deletions docs/dsp-tools-usage.md
Expand Up @@ -82,6 +82,7 @@ The following options are available:
- `-p` | `--password` _password_: password used for authentication with the DSP API (default: test)
- `-i` | `--imgdir` _dirpath_: path to the directory where the bitstream objects are stored (default: .)
- `-S` | `--sipi` _SIPIserver_: URL of the SIPI IIIF server (default: http://0.0.0.0:1024)
- `--incremental` : If set, IRIs instead of internal IDs are expected as reference to already existing resources on DSP
irinaschubert marked this conversation as resolved.
Show resolved Hide resolved
- `-v` | `--verbose`: If set, more information about the uploaded resources is printed to the console.

The command is used to upload data defined in an XML file onto a DSP server. The following example shows how to upload
Expand All @@ -96,6 +97,13 @@ dsp-tools xmlupload -s https://api.dsl.server.org -u root@example.com -p test -S

The description of the expected XML format can be found [here](./dsp-tools-xmlupload.md).

An internal ID is used in the `<resptr>` tag of an XML file used for `xmlupload` to reference resources inside the same
XML file. Once data is uploaded to DSP it cannot be referenced by this internal ID anymore. Instead, the resource's IRI
has to be used. The mapping of internal IDs to their respective IRIs is written to a file
called `id2iri_mapping_[timstamp].json` after a successful `xmlupload`.
See [`dsp-tools id2iri`](./dsp-tools-usage.md#replace-internal-ids-with-iris-in-xml-file) for more information about how
to use this file to replace internal IDs in an existing XML file to reference existing resources.

## Create a JSON list file from one or several Excel files

```bash
Expand Down Expand Up @@ -161,3 +169,23 @@ dsp-tools excel2properties Properties.xlsx properties.json
More information about the usage of this command can be found
[here](./dsp-tools-excel.md#create-the-properties-for-a-data-model-from-an-excel-file)
.

## Replace internal IDs with IRIs in XML file

```bash
dsp-tools id2iri xml_file.xml mapping_file.json --outfile xml_out_file.xml
irinaschubert marked this conversation as resolved.
Show resolved Hide resolved
```

When uploading data with `dsp-tools xmlupload` an internal ID is used in the `<resptr>` tag of the XML file to reference
resources inside the same XML file. Once data is uploaded to DSP it cannot be referenced by this internal ID anymore.
Instead, the resource's IRI has to be used.

With `dsp-tools id2iri` internal IDs can be replaced with their corresponding IRIs within a provided XML. The output is
written to a new XML file called `id2iri_replaced_[timestamp].xml` (the file path and name can be overwritten with
option `--outfile`). If all internal IDs were replaced, the newly created XML can be used
with `dsp-tools xmlupload --incremental id2iri_replaced_20211026_120247263754.xml` to upload the data.

Note that internal IDs and IRIs cannot be mixed. The input XML file has to be provided as well as the JSON file which
contains the mapping from internal IDs to IRIs. This JSON file is generated after each successful `xmlupload`.

In order to upload data incrementally the procedure described [here](dsp-tools-xmlupload.md#incremental-xml-upload) is recommended.
30 changes: 26 additions & 4 deletions docs/dsp-tools-xmlupload.md
Expand Up @@ -3,7 +3,9 @@
# DSP XML file format for importing data

With dsp-tools data can be imported into a DSP repository (on a DSP server) from an XML file. The import file is a
standard XML file as described on this page.
standard XML file as described on this page. After a successful upload of the data, an output file is written (called
`id2iri_mapping_[timstamp].json`) with the mapping of internal IDs used inside the XML and their corresponding IRIs which
uniquely identify them inside DSP. This file should be kept if data is later added with the `--incremental` [option](#incremental-xml-upload).

The import file must start with the standard XML header:

Expand Down Expand Up @@ -578,7 +580,9 @@ Attributes:

#### `<resptr>`

The `<resptr>` element contains the internal ID of another resource.
The `<resptr>` element contains either the internal ID of another resource inside the XML or the IRI of an already
existing resource on DSP. Inside the same XML file a mixture of the two is not possible. If referencing existing
resources, `xmlupload --incremental` has to be used.

Attributes:

Expand All @@ -587,8 +591,8 @@ Attributes:

Example:

If there is a resource defined as `<resource label="EURUS015a" restype=":Postcard" unique_id="238807">...</resource>`,
it can be referenced as:
If there is a resource defined as `<resource label="EURUS015a" restype=":Postcard" id="238807">...</resource>`, it can
be referenced as:

```xml
<resptr-prop name=":hasReferenceTo">
Expand Down Expand Up @@ -712,6 +716,24 @@ Example:
</boolean-prop>
```

## Incremental XML Upload

After a successful upload of the data, an output file is written (called `id2iri_mapping_[timstamp].json`) with the
mapping of internal IDs used inside the XML and their corresponding IRIs which uniquely identify them inside DSP. This
file should be kept if data is later added with the `--incremental` option.

To do an incremental XML upload, one of the following procedures is recommended.

Incremental XML upload with use of internal IDs:

1. Initial XML upload with internal IDs.
2. The file `id2iri_mapping_[timestamp].json` is created.
3. Create new XML file(s) with the same pattern of internal IDs.
4. Run `dsp-tools id2iri new_data.xml id2iri_mapping_[timestamp].json` to replace the internal IDs in `new_data.xml` with IRIs. Only internal IDs inside the `<resptr>` tag are replaced.
5. Run `dsp-tools xmlupload --incremental new_data.xml` to upload the data to DSP.

Incremental XML Upload with the use of IRIs: Use IRIs in the XML to reference existing data on the DSP server.
irinaschubert marked this conversation as resolved.
Show resolved Hide resolved

## Complete example

```xml
Expand Down
5 changes: 4 additions & 1 deletion docs/index.md
Expand Up @@ -19,7 +19,7 @@ dsp-tools helps you with the following tasks:
- [`dsp-tools get`](./dsp-tools-usage.md#get-a-data-model-from-a-dsp-server) reads a data model from a DSP server and
writes it into a JSON file.
- [`dsp-tools xmlupload`](./dsp-tools-usage.md#upload-data-to-a-dsp-server) uploads data from a provided XML file (bulk
data import).
data import) and writes the mapping from internal IDs to IRIs into a local file.
- [`dsp-tools excel`](./dsp-tools-usage.md#create-a-json-list-file-from-one-or-several-excel-files)
creates a JSON or XML file from one or several Excel files. The created data can either be integrated into an ontology
or be uploaded directly to a DSP server with `dsp-tools create`.
Expand All @@ -29,4 +29,7 @@ dsp-tools helps you with the following tasks:
- [`dsp-tools excel2properties`](./dsp-tools-usage.md#create-properties-from-an-excel-file)
creates the ontology's properties section from an Excel file. The resulting section can be integrated into an ontology
and then be uploaded to a DSP server with `dsp-tools create`.
- [`dsp-tools id2iri`](./dsp-tools-usage.md#replace-internal-ids-with-iris-in-xml-file)
takes an XML file for bulk data import and replaces referenced internal IDs with IRIs. The mapping has to be provided
with a JSON file.

18 changes: 17 additions & 1 deletion knora/dsp_tools.py
Expand Up @@ -9,6 +9,7 @@
from knora.dsplib.utils.excel_to_json_lists import list_excel2json, validate_list_with_schema
from knora.dsplib.utils.excel_to_json_properties import properties_excel2json
from knora.dsplib.utils.excel_to_json_resources import resources_excel2json
from knora.dsplib.utils.id_to_iri import id_to_iri
from knora.dsplib.utils.onto_create_lists import create_lists
from knora.dsplib.utils.onto_create_ontology import create_ontology
from knora.dsplib.utils.onto_get import get_ontology
Expand Down Expand Up @@ -76,6 +77,7 @@ def program(user_args: list[str]) -> None:
parser_upload.add_argument('-i', '--imgdir', type=str, default='.', help='Path to folder containing the images')
parser_upload.add_argument('-S', '--sipi', type=str, default='http://0.0.0.0:1024', help='URL of SIPI server')
parser_upload.add_argument('-v', '--verbose', action='store_true', help='Verbose feedback')
parser_upload.add_argument('--incremental', action='store_true', help='Incremental XML upload')
irinaschubert marked this conversation as resolved.
Show resolved Hide resolved
parser_upload.add_argument('xmlfile', help='path to xml file containing the data', default='data.xml')

parser_excel_lists = subparsers.add_parser('excel',
Expand Down Expand Up @@ -113,6 +115,14 @@ def program(user_args: list[str]) -> None:
parser_excel_properties.add_argument('outfile', help='Path to the output JSON file containing the properties data',
default='properties.json')

parser_id2iri = subparsers.add_parser('id2iri',
help='Replace internal IDs in an XML with their corresponding IRIs from a provided JSON file.')
parser_id2iri.set_defaults(action='id2iri')
parser_id2iri.add_argument('xmlfile', help='Path to the XML file containing the data to be replaced')
parser_id2iri.add_argument('jsonfile', help='Path to the JSON file containing the mapping of internal IDs and their respective IRIs')
irinaschubert marked this conversation as resolved.
Show resolved Hide resolved
parser_id2iri.add_argument('--outfile', default=None, help='Path to the XML output file containing the replaced IDs (optional)')
parser_id2iri.add_argument('-v', '--verbose', action='store_true', help='Verbose feedback')

args = parser.parse_args(user_args)

if not hasattr(args, 'action'):
Expand Down Expand Up @@ -160,7 +170,8 @@ def program(user_args: list[str]) -> None:
imgdir=args.imgdir,
sipi=args.sipi,
verbose=args.verbose,
validate_only=args.validate)
validate_only=args.validate,
incremental=args.incremental)
elif args.action == 'excel':
list_excel2json(listname=args.listname,
excelfolder=args.excelfolder,
Expand All @@ -171,6 +182,11 @@ def program(user_args: list[str]) -> None:
elif args.action == 'excel2properties':
properties_excel2json(excelfile=args.excelfile,
outfile=args.outfile)
elif args.action == 'id2iri':
id_to_iri(xml_file=args.xmlfile,
json_file=args.jsonfile,
out_file=args.outfile,
verbose=args.verbose)


def main() -> None:
Expand Down
9 changes: 9 additions & 0 deletions knora/dsplib/utils/BUILD.bazel
Expand Up @@ -124,3 +124,12 @@ py_library(
imports = [".", ".."],
)

py_library(
name = "id_to_iri",
visibility = ["//visibility:public"],
srcs = ["id_to_iri.py"],
deps = [
requirement("lxml")
]
)

78 changes: 78 additions & 0 deletions knora/dsplib/utils/id_to_iri.py
@@ -0,0 +1,78 @@
"""
This module handles the replacement of internal IDs with their corresponding IRIs from DSP.
"""
import json
import os
from datetime import datetime
from pathlib import Path

from lxml import etree


def id_to_iri(xml_file: str, json_file: str, out_file: str, verbose: bool) -> None:
"""
This function replaces all occurrences of internal IDs with their respective IRIs inside an XML file. It gets the
mapping from the JSON file provided as parameter for this function.

Args:
xml_file : the XML file with the data to be replaced
json_file : the JSON file with the mapping (dict) of internal IDs to IRIs
out_file: path to the output XML file with replaced IDs (optional), default: "id2iri_replaced_" + timestamp + ".xml"
verbose: verbose feedback if set to True

Returns:
None
"""

# check that provided files exist
if not os.path.isfile(xml_file):
print(f"File {xml_file} could not be found.")
exit(1)

if not os.path.isfile(json_file):
print(f"File {json_file} could not be found.")
exit(1)

# load JSON from provided json file to dict
mapping = json.load(open(json_file))
irinaschubert marked this conversation as resolved.
Show resolved Hide resolved

# parse XML from provided xml file
tree = etree.parse(xml_file)

# iterate through all XML elements and remove namespace declarations
for elem in tree.getiterator():
# skip comments and processing instructions as they do not have namespaces
if not (
isinstance(elem, etree._Comment)
or isinstance(elem, etree._ProcessingInstruction)
):
# remove namespace declarations
elem.tag = etree.QName(elem).localname

resource_elements = tree.xpath("/knora/resource/resptr-prop/resptr")
for resptr_prop in resource_elements:
try:
value_before = resptr_prop.text
value_after = mapping[resptr_prop.text]
resptr_prop.text = value_after
if verbose:
print(f"Replaced internal ID '{value_before}' with IRI '{value_after}'")
except KeyError:
irinaschubert marked this conversation as resolved.
Show resolved Hide resolved
if resptr_prop.text.startswith("http://rdfh.ch/"):
if verbose:
print(f"Skipping '{resptr_prop.text}'")
else:
print(f"WARNING: Could not find internal ID '{resptr_prop.text}' in mapping file {json_file}. "
f"Skipping...")

# write xml with replaced IDs to file with timestamp
if not out_file:
timestamp_now = datetime.now()
timestamp_str = timestamp_now.strftime("%Y%m%d-%H%M%S")

file_name = Path(xml_file).stem
out_file = file_name + "_replaced_" + timestamp_str + ".xml"

et = etree.ElementTree(tree.getroot())
et.write(out_file, pretty_print=True)
print(f"XML with replaced IDs was written to file {out_file}.")