Skip to content

Commit

Permalink
feat(id-to-iri): extend xmlupload to allow references to existing res…
Browse files Browse the repository at this point in the history
…ources (DEV-60) (#108)

* improve code structure

* improve and activate user tests

* remove code smells in user tests

* remove remnants of id_to_iri feature

* improve user test

* fix typo in Makefile

* Update test_user.py

* fix failing list tests

* fix typo

* add more comments to lists test data

* fix failing list node test

* improve test_listnode

* improve test_connection

* improve test_group

* improve test_ontology

* improve test_project

* improve test_propertyclass

* improve test_resource

* improve test_resourceclass

* improve test_tools

* move unit tests to separate folder

* Update Makefile

* update GitHub CI actions

* Update test.yml

* fix code smells in test_langstring

* fix failing GitHub action

* improve code

* write id2iri to json file after xmlupload

* add incremental option

* add feature to replace internal IDs with IRIs in XML file

* add optional output file path

* add verbose option

* improve setup

* add documentation for incremental xmlupload

* add documentation for incremental xmlupload

* add incremental option to test

* update documentation

* add test for id2iri

* add simple unit test

* add unit test

* improve unit tests

* Update test_id_to_iri.py

* Collect failed uploads

* Delete dsp-tools-id2iri.md

* Add separate warning for IRIs

* improve file naming

* Update requirements.txt

* Update requirements.txt

* add test

* add resource label to error message

* add documentation to incremental xmlupload

* Update dsp-tools-usage.md

* Use verbose=False in tests

* code improvements after review

* Update knora/dsplib/utils/xml_upload.py

Co-authored-by: Balduin Landolt <33053745+BalduinLandolt@users.noreply.github.com>

Co-authored-by: Balduin Landolt <33053745+BalduinLandolt@users.noreply.github.com>
  • Loading branch information
irinaschubert and BalduinLandolt committed Nov 22, 2021
1 parent 08effdf commit 40b01db
Show file tree
Hide file tree
Showing 19 changed files with 549 additions and 52 deletions.
1 change: 1 addition & 0 deletions Makefile
Expand Up @@ -53,6 +53,7 @@ install-requirements: ## install requirements

.PHONY: install
install: ## install from source (runs setup.py)
python3 -m pip install --upgrade pip
pip3 install .

.PHONY: test
Expand Down
28 changes: 28 additions & 0 deletions docs/dsp-tools-usage.md
Expand Up @@ -82,6 +82,7 @@ The following options are available:
- `-p` | `--password` _password_: password used for authentication with the DSP API (default: test)
- `-i` | `--imgdir` _dirpath_: path to the directory where the bitstream objects are stored (default: .)
- `-S` | `--sipi` _SIPIserver_: URL of the SIPI IIIF server (default: http://0.0.0.0:1024)
- `-I` | `--incremental` : If set, IRIs instead of internal IDs are expected as reference to already existing resources on DSP
- `-v` | `--verbose`: If set, more information about the uploaded resources is printed to the console.

The command is used to upload data defined in an XML file onto a DSP server. The following example shows how to upload
Expand All @@ -96,6 +97,13 @@ dsp-tools xmlupload -s https://api.dsl.server.org -u root@example.com -p test -S

The description of the expected XML format can be found [here](./dsp-tools-xmlupload.md).

An internal ID is used in the `<resptr>` tag of an XML file used for `xmlupload` to reference resources inside the same
XML file. Once data is uploaded to DSP it cannot be referenced by this internal ID anymore. Instead, the resource's IRI
has to be used. The mapping of internal IDs to their respective IRIs is written to a file
called `id2iri_mapping_[timstamp].json` after a successful `xmlupload`.
See [`dsp-tools id2iri`](./dsp-tools-usage.md#replace-internal-ids-with-iris-in-xml-file) for more information about how
to use this file to replace internal IDs in an existing XML file to reference existing resources.

## Create a JSON list file from one or several Excel files

```bash
Expand Down Expand Up @@ -161,3 +169,23 @@ dsp-tools excel2properties Properties.xlsx properties.json
More information about the usage of this command can be found
[here](./dsp-tools-excel.md#create-the-properties-for-a-data-model-from-an-excel-file)
.

## Replace internal IDs with IRIs in XML file

```bash
dsp-tools id2iri xml_file.xml mapping_file.json --outfile xml_out_file.xml
```

When uploading data with `dsp-tools xmlupload` an internal ID is used in the `<resptr>` tag of the XML file to reference
resources inside the same XML file. Once data is uploaded to DSP it cannot be referenced by this internal ID anymore.
Instead, the resource's IRI has to be used.

With `dsp-tools id2iri` internal IDs can be replaced with their corresponding IRIs within a provided XML. The output is
written to a new XML file called `id2iri_replaced_[timestamp].xml` (the file path and name can be overwritten with
option `--outfile`). If all internal IDs were replaced, the newly created XML can be used
with `dsp-tools xmlupload --incremental id2iri_replaced_20211026_120247263754.xml` to upload the data.

Note that internal IDs and IRIs cannot be mixed. The input XML file has to be provided as well as the JSON file which
contains the mapping from internal IDs to IRIs. This JSON file is generated after each successful `xmlupload`.

In order to upload data incrementally the procedure described [here](dsp-tools-xmlupload.md#incremental-xml-upload) is recommended.
30 changes: 26 additions & 4 deletions docs/dsp-tools-xmlupload.md
Expand Up @@ -3,7 +3,9 @@
# DSP XML file format for importing data

With dsp-tools data can be imported into a DSP repository (on a DSP server) from an XML file. The import file is a
standard XML file as described on this page.
standard XML file as described on this page. After a successful upload of the data, an output file is written (called
`id2iri_mapping_[timstamp].json`) with the mapping of internal IDs used inside the XML and their corresponding IRIs which
uniquely identify them inside DSP. This file should be kept if data is later added with the `--incremental` [option](#incremental-xml-upload).

The import file must start with the standard XML header:

Expand Down Expand Up @@ -578,7 +580,9 @@ Attributes:

#### `<resptr>`

The `<resptr>` element contains the internal ID of another resource.
The `<resptr>` element contains either the internal ID of another resource inside the XML or the IRI of an already
existing resource on DSP. Inside the same XML file a mixture of the two is not possible. If referencing existing
resources, `xmlupload --incremental` has to be used.

Attributes:

Expand All @@ -587,8 +591,8 @@ Attributes:

Example:

If there is a resource defined as `<resource label="EURUS015a" restype=":Postcard" unique_id="238807">...</resource>`,
it can be referenced as:
If there is a resource defined as `<resource label="EURUS015a" restype=":Postcard" id="238807">...</resource>`, it can
be referenced as:

```xml
<resptr-prop name=":hasReferenceTo">
Expand Down Expand Up @@ -712,6 +716,24 @@ Example:
</boolean-prop>
```

## Incremental XML Upload

After a successful upload of the data, an output file is written (called `id2iri_mapping_[timstamp].json`) with the
mapping of internal IDs used inside the XML and their corresponding IRIs which uniquely identify them inside DSP. This
file should be kept if data is later added with the `--incremental` option.

To do an incremental XML upload, one of the following procedures is recommended.

- Incremental XML upload with use of internal IDs:

1. Initial XML upload with internal IDs.
2. The file `id2iri_mapping_[timestamp].json` is created.
3. Create new XML file(s) with resources referencing other resources by their internal IDs in `<resptr>` (using the same IDs as in the initial XML upload).
4. Run `dsp-tools id2iri new_data.xml id2iri_mapping_[timestamp].json` to replace the internal IDs in `new_data.xml` with IRIs. Only internal IDs inside the `<resptr>` tag are replaced.
5. Run `dsp-tools xmlupload --incremental new_data.xml` to upload the data to DSP.

- Incremental XML Upload with the use of IRIs: Use IRIs in the XML to reference existing data on the DSP server.

## Complete example

```xml
Expand Down
5 changes: 4 additions & 1 deletion docs/index.md
Expand Up @@ -19,7 +19,7 @@ dsp-tools helps you with the following tasks:
- [`dsp-tools get`](./dsp-tools-usage.md#get-a-data-model-from-a-dsp-server) reads a data model from a DSP server and
writes it into a JSON file.
- [`dsp-tools xmlupload`](./dsp-tools-usage.md#upload-data-to-a-dsp-server) uploads data from a provided XML file (bulk
data import).
data import) and writes the mapping from internal IDs to IRIs into a local file.
- [`dsp-tools excel`](./dsp-tools-usage.md#create-a-json-list-file-from-one-or-several-excel-files)
creates a JSON or XML file from one or several Excel files. The created data can either be integrated into an ontology
or be uploaded directly to a DSP server with `dsp-tools create`.
Expand All @@ -29,4 +29,7 @@ dsp-tools helps you with the following tasks:
- [`dsp-tools excel2properties`](./dsp-tools-usage.md#create-properties-from-an-excel-file)
creates the ontology's properties section from an Excel file. The resulting section can be integrated into an ontology
and then be uploaded to a DSP server with `dsp-tools create`.
- [`dsp-tools id2iri`](./dsp-tools-usage.md#replace-internal-ids-with-iris-in-xml-file)
takes an XML file for bulk data import and replaces referenced internal IDs with IRIs. The mapping has to be provided
with a JSON file.

18 changes: 17 additions & 1 deletion knora/dsp_tools.py
Expand Up @@ -9,6 +9,7 @@
from knora.dsplib.utils.excel_to_json_lists import list_excel2json, validate_list_with_schema
from knora.dsplib.utils.excel_to_json_properties import properties_excel2json
from knora.dsplib.utils.excel_to_json_resources import resources_excel2json
from knora.dsplib.utils.id_to_iri import id_to_iri
from knora.dsplib.utils.onto_create_lists import create_lists
from knora.dsplib.utils.onto_create_ontology import create_ontology
from knora.dsplib.utils.onto_get import get_ontology
Expand Down Expand Up @@ -76,6 +77,7 @@ def program(user_args: list[str]) -> None:
parser_upload.add_argument('-i', '--imgdir', type=str, default='.', help='Path to folder containing the images')
parser_upload.add_argument('-S', '--sipi', type=str, default='http://0.0.0.0:1024', help='URL of SIPI server')
parser_upload.add_argument('-v', '--verbose', action='store_true', help='Verbose feedback')
parser_upload.add_argument('-I', '--incremental', action='store_true', help='Incremental XML upload')
parser_upload.add_argument('xmlfile', help='path to xml file containing the data', default='data.xml')

parser_excel_lists = subparsers.add_parser('excel',
Expand Down Expand Up @@ -113,6 +115,14 @@ def program(user_args: list[str]) -> None:
parser_excel_properties.add_argument('outfile', help='Path to the output JSON file containing the properties data',
default='properties.json')

parser_id2iri = subparsers.add_parser('id2iri',
help='Replace internal IDs in an XML with their corresponding IRIs from a provided JSON file.')
parser_id2iri.set_defaults(action='id2iri')
parser_id2iri.add_argument('xmlfile', help='Path to the XML file containing the data to be replaced')
parser_id2iri.add_argument('jsonfile', help='Path to the JSON file containing the mapping of internal IDs and their respective IRIs')
parser_id2iri.add_argument('--outfile', default=None, help='Path to the XML output file containing the replaced IDs (optional)')
parser_id2iri.add_argument('-v', '--verbose', action='store_true', help='Verbose feedback')

args = parser.parse_args(user_args)

if not hasattr(args, 'action'):
Expand Down Expand Up @@ -160,7 +170,8 @@ def program(user_args: list[str]) -> None:
imgdir=args.imgdir,
sipi=args.sipi,
verbose=args.verbose,
validate_only=args.validate)
validate_only=args.validate,
incremental=args.incremental)
elif args.action == 'excel':
list_excel2json(listname=args.listname,
excelfolder=args.excelfolder,
Expand All @@ -171,6 +182,11 @@ def program(user_args: list[str]) -> None:
elif args.action == 'excel2properties':
properties_excel2json(excelfile=args.excelfile,
outfile=args.outfile)
elif args.action == 'id2iri':
id_to_iri(xml_file=args.xmlfile,
json_file=args.jsonfile,
out_file=args.outfile,
verbose=args.verbose)


def main() -> None:
Expand Down
9 changes: 9 additions & 0 deletions knora/dsplib/utils/BUILD.bazel
Expand Up @@ -124,3 +124,12 @@ py_library(
imports = [".", ".."],
)

py_library(
name = "id_to_iri",
visibility = ["//visibility:public"],
srcs = ["id_to_iri.py"],
deps = [
requirement("lxml")
]
)

80 changes: 80 additions & 0 deletions knora/dsplib/utils/id_to_iri.py
@@ -0,0 +1,80 @@
"""
This module handles the replacement of internal IDs with their corresponding IRIs from DSP.
"""
import json
import os
from datetime import datetime
from pathlib import Path

from lxml import etree


def id_to_iri(xml_file: str, json_file: str, out_file: str, verbose: bool) -> None:
"""
This function replaces all occurrences of internal IDs with their respective IRIs inside an XML file. It gets the
mapping from the JSON file provided as parameter for this function.
Args:
xml_file : the XML file with the data to be replaced
json_file : the JSON file with the mapping (dict) of internal IDs to IRIs
out_file: path to the output XML file with replaced IDs (optional), default: "id2iri_replaced_" + timestamp + ".xml"
verbose: verbose feedback if set to True
Returns:
None
"""

# check that provided files exist
if not os.path.isfile(xml_file):
print(f"File {xml_file} could not be found.")
exit(1)

if not os.path.isfile(json_file):
print(f"File {json_file} could not be found.")
exit(1)

# load JSON from provided json file to dict
with open(json_file, encoding="utf-8", mode='r') as file:
mapping = json.load(file)

# parse XML from provided xml file
tree = etree.parse(xml_file)

# iterate through all XML elements and remove namespace declarations
for elem in tree.getiterator():
# skip comments and processing instructions as they do not have namespaces
if not (
isinstance(elem, etree._Comment)
or isinstance(elem, etree._ProcessingInstruction)
):
# remove namespace declarations
elem.tag = etree.QName(elem).localname

resource_elements = tree.xpath("/knora/resource/resptr-prop/resptr")
for resptr_prop in resource_elements:
value_before = resptr_prop.text
value_after = mapping.get(resptr_prop.text)
if value_after:
resptr_prop.text = value_after
if verbose:
print(f"Replaced internal ID '{value_before}' with IRI '{value_after}'")

else: # if value couldn't be found in mapping file
if value_before.startswith("http://rdfh.ch/"):
if verbose:
print(f"Skipping '{value_before}'")
else:
print(f"WARNING Could not find internal ID '{value_before}' in mapping file {json_file}. "
f"Skipping...")

# write xml with replaced IDs to file with timestamp
if not out_file:
timestamp_now = datetime.now()
timestamp_str = timestamp_now.strftime("%Y%m%d-%H%M%S")

file_name = Path(xml_file).stem
out_file = file_name + "_replaced_" + timestamp_str + ".xml"

et = etree.ElementTree(tree.getroot())
et.write(out_file, pretty_print=True)
print(f"XML with replaced IDs was written to file {out_file}.")

0 comments on commit 40b01db

Please sign in to comment.