Skip to content

Commit

Permalink
fix: improve excel2xml (DEV-1361) (#232)
Browse files Browse the repository at this point in the history
  • Loading branch information
jnussbaum committed Sep 28, 2022
1 parent 1036acd commit a7e9d85
Show file tree
Hide file tree
Showing 8 changed files with 428 additions and 501 deletions.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions docs/assets/templates/excel2xml_sample_script.py
Expand Up @@ -47,9 +47,9 @@
# and if it's not there, look in "category_dict_fallback"
category_values = [category_dict.get(x.strip(), category_dict_fallback[x.strip()]) for x in
row["Category"].split(",")]
resource.append(excel2xml.make_list_prop("category", ":hasCategory", values=category_values))
resource.append(excel2xml.make_list_prop("category", ":hasCategory", category_values))
if excel2xml.check_notna(row["Complete?"]):
resource.append(excel2xml.make_boolean_prop(name=":isComplete", value=row["Complete?"]))
resource.append(excel2xml.make_boolean_prop(":isComplete", row["Complete?"]))
if excel2xml.check_notna(row["Color"]):
resource.append(excel2xml.make_color_prop(":colorprop", row["Color"]))
if pd.notna(row["Date discovered"]):
Expand Down Expand Up @@ -93,7 +93,7 @@

link = excel2xml.make_link("Link between Resource 0 and 1", "link_res_0_res_1")
link.append(excel2xml.make_text_prop("hasComment", "This is a comment"))
link.append(excel2xml.make_resptr_prop("hasLinkTo", values=["res_0", "res_1"]))
link.append(excel2xml.make_resptr_prop("hasLinkTo", ["res_0", "res_1"]))
root.append(link)

# write file
Expand Down
196 changes: 154 additions & 42 deletions docs/dsp-tools-excel2xml.md
Expand Up @@ -5,30 +5,72 @@ dsp-tools assists you in converting a data source in CSV/XLS(X) format to an XML
transformation from Excel/CSV to XML:

- The CLI command `dsp-tools excel2xml` creates an XML file from an Excel/CSV file which is already structured
according to the DSP specifications. This is mostly used for DaSCH-interal data migration. The CLI command is
documented [here](dsp-tools-excel.md#cli-command-excel2xml).
according to the DSP specifications. This is mostly used for DaSCH-interal data migration. **The CLI command is
documented [here](dsp-tools-excel.md#cli-command-excel2xml).**
- The module `excel2xml` can be imported into a custom Python script that transforms any tabular data into an XML. This
use case is more frequent, because data from research projects have a variety of formats/structures. **This document
only treats the `excel2xml` module.**
use case is more frequent, because data from research projects have a variety of formats/structures. **The
`excel2xml` module is documented on this page.**

<br>
**In the following, an example is given how to use the module `excel2xml`:**

## How to use the module excel2xml
At the end of this document, you find a sample Python script. In the following, it is explained how to use it.
Save the following files into a directory, and run the Python script:

- sample data: [excel2xml_sample_data.csv](./assets/templates/excel2xml_sample_data.csv)
- sample ontology: [excel2xml_sample_onto.json](./assets/templates/excel2xml_sample_onto.json)
- sample script: [excel2xml_sample_script.py](./assets/templates/excel2xml_sample_script.py)

### General preparation
Insert your ontology name, project shortcode, and the path to your data source. If necessary, activate one of the lines
that are commented out.
Then, the `root` element is created, which represents the `<knora>` tag of the XML document. As first children of
`<knora>`, some standard permissions are added. At the end, please carefully check the permissions of the finished XML
file if they meet your requirements, and adapt them if necessary.
The standard permission of a resource is "res-default", and of a property "prop-default". If you don't specify it
otherwise, all resources and properties get these permissions. With excel2xml, it is not possible to create resources/
properties that don't have permissions, because they would be invisible for all users except project admins and system
admins. Read more about permissions [here](./dsp-tools-xmlupload.md#how-to-use-the-permissions-attribute-in-resourcesproperties).
This is the simplified pattern how the Python script works:

```
1 main_df = pd.read_csv("excel2xml_sample_data.csv", dtype="str", sep=",")
2 root = excel2xml.make_root(...)
3 root = excel2xml.append_permissions(root)
4 # if necessary: create list mappings, according to explanation below
5 for index, row in main_df.iterrows():
6 resource = excel2xml.make_resource(...)
7 resource.append(excel2xml.make_text_prop(...))
8 root.append(resource)
9 excel2xml.write_xml(root, "data.xml")
```
```
1 read in your data source with the pandas library (https://pandas.pydata.org/)
2 create the root element `<knora>`
3 append the permissions
4 if necessary: create list mappings (see below)
5 iterate through the rows of your data source:
6 create the `<resource>` tag
7 append properties to it
8 append the resource to the root tag `<knora>`
9 save the finished XML file
```

### Create list mappings
<br>
These steps are now explained in-depth:


## 1. Read in your data source
In the first paragraph of the sample script, insert your ontology name, project shortcode, and the path to your data
source. If necessary, activate one of the lines that are commented out.


## 2. Create root element `<knora>`
Then, the root element is created, which represents the `<knora>` tag of the XML document.


## 3. Append the permissions
As first children of `<knora>`, some standard permissions are added. At the end, please carefully check the permissions
of the finished XML file to ensure that they meet your requirements, and adapt them if necessary.

The standard permission of a resource is `res-default`, and of a property `prop-default`. If you don't specify it
otherwise, all resources and properties get these permissions.

With `excel2xml`, it is not possible to create resources/properties that don't have permissions, because they would be
invisible for all users except project admins and system admins. [Read more about permissions
here](./dsp-tools-xmlupload.md#how-to-use-the-permissions-attribute-in-resourcesproperties).


## 4. Create list mappings
Let's assume that your data source has a column containing list values named after the "label" of the JSON project list,
instead of the "name" which is needed for the `dsp-tools xmlupload`. You need a way to get the names from the labels.
If your data source uses the labels correctly, this is an easy task: The method `create_json_list_mapping()` creates a
Expand All @@ -39,38 +81,117 @@ correct JSON project node name. This happens based on string similarity. Please
no false matches!


### Create all resources
With the help of the [Python pandas library](https://pandas.pydata.org/), you can then iterate through the rows of your
Excel/CSV, and create resources and properties. Some examples of useful helper methods are:
## 5. Iterate through the rows of your data source
With the help of Pandas, you can then iterate through the rows of your Excel/CSV, and create resources and properties.


### 6. Create the `<resource>` tag
There are four kind of resources that can be created:

| super | tag | method |
|--------------|----------------|---------------------|
| `Resource` | `<resource>` | `make_resource()` |
| `Annotation` | `<annotation>` | `make_annotation()` |
| `Region` | `<region>` | `make_region()` |
| `LinkObj` | `<link>` | `make_link()` |

`<resource>` is the most frequent of them. The other three are [explained
here](./dsp-tools-xmlupload.md#dsp-base-resources--base-properties-to-be-used-directly-in-the-xml-file).

Special care is needed when the ID of a resource is created. Every resource must have an ID that is unique in the file,
and it must meet the constraints of xsd:ID. You can simply achieve this if you use the method `make_xsd_id_compatible()`.


### 7. Append the properties
For every property, there is a helper function that explains itself when you hover over it. So you don't need to worry
any more how to construct a certain XML value for a certain property.

Here's how the Docstrings assist you:

- method signature: names of the parameters and accepted types
- short explanation how the method behaves
- usage examples
- link to the dsp-tools documentation of this property
- a short description for every parameter
- short description of the returned object.
- Note: `etree._Element` is a type annotation of an underlying library. You don't have to care about it, as long as
you proceed as described (append the returned object to the parent resource).

![docstring example](./assets/images/img-excel2xml-module-docstring.png)


#### Fine-tuning with `PropertyElement`
There are two possibilities how to create a property: The value can be passed as it is, or as `PropertyElement`. If it
is passed as it is, the `permissions` are assumed to be `prop-default`, texts are assumed to be encoded as `utf8`, and
the value won't have a comment:
```
make_text_prop(":testproperty", "first text")
```
```
<text-prop name=":testproperty">
<text encoding="utf8" permissions="prop-default">first text</text>
</text-prop>
```

If you want to change these defaults, you have to use a `PropertyElement` instead:
```
make_text_prop(
":testproperty",
PropertyElement(
value="first text",
permissions="prop-restricted",
encoding="xml",
comment="some comment"
)
)
```
```
<text-prop name=":testproperty">
<text encoding="xml" permissions="prop-restricted" comment="some comment">first text</text>
</text-prop>
```


#### Supported boolean formats
For `make_boolean_prop(cell)`, the following formats are supported:

- true: True, "true", "True", "1", 1, "yes", "Yes"
- false: False, "false", "False", "0", 0, "no", "No"

#### Create an ID for a resource
The method `make_xsd_id_compatible(string)` makes a string compatible with the constraints of xsd:ID, so that it can be
used as ID of a resource.
N/A-like values will raise an Error. So if your cell is empty, this method will not count it as false, but will raise an
Error. If you want N/A-like values to be counted as false, you may use a construct like this:

```python
if excel2xml.check_notna(cell):
# the cell contains usable content
excel2xml.make_boolean_prop(":hasBoolean", cell)
else:
# the cell is empty: you can decide to count this as "False"
excel2xml.make_boolean_prop(":hasBoolean", False)
```

#### Create a property
For every property, there is a helper function that explains itself when you hover over it. It also has a link to
the dsp-tools documentation of this property. So you don't need to worry how to construct a certain XML value for a
certain property.

For `make_boolean_prop(cell)`, the following formats are supported:
### 8. Append the resource to root
At the end of the for-loop, it is important not to forget to append the finished resource to the root.

- true: True, "true", "True", "1", 1, "yes", "Yes"
- false: False, "false", "False", "0", 0, "no", "No"

## 9. Save the file
At the very end, save the file under a name that you can choose yourself.


#### Check if a cell contains a usable value
## Other helper methods
### Check if a cell contains a usable value
The method `check_notna(cell)` checks a value if it is usable in the context of data archiving. A value is considered
usable if it is

- a number (integer or float, but not numpy.nan)
- a boolean
- a string with at least one Unicode letter, underscore, or number, but not "None", "<NA>", "N/A", or "-"
- a string with at least one Unicode letter (matching the regex `\p{L}`), underscore, ?, !, or number, but not "None",
"<NA>", "N/A", or "-"
- a PropertyElement whose "value" fulfills the above criteria


#### Calendar date parsing
### Calendar date parsing
The method `find_date_in_string(string)` tries to find a calendar date in a string. If successful, it
returns the DSP-formatted date string.

Expand Down Expand Up @@ -99,12 +220,3 @@ Currently supported date formats:
| 1849/1850 | GREGORIAN:CE:1849:CE:1850 |
| 1849/50 | GREGORIAN:CE:1849:CE:1850 |
| 1845-50 | GREGORIAN:CE:1845:CE:1850 |


## Complete example
Save the following files into a directory, and run the Python script. The features discussed in this document are
contained therein.

- sample data: [excel2xml_sample_data.csv](assets/templates/excel2xml_sample_data.csv)
- sample ontology: [excel2xml_sample_onto.json](assets/templates/excel2xml_sample_onto.json)
- sample script: [excel2xml_sample_script.py](assets/templates/excel2xml_sample_script.py)
18 changes: 3 additions & 15 deletions knora/dsplib/models/propertyelement.py
@@ -1,6 +1,4 @@
from typing import Union, Optional
import pandas as pd
import regex
from dataclasses import dataclass
from knora.dsplib.models.helpers import BaseError

Expand All @@ -10,13 +8,13 @@ class PropertyElement:
"""
A PropertyElement object carries more information about a property value than the value itself.
The "value" is the value that could be passed to a method as plain string/int/float/bool. Use a PropertyElement
instead to define more precisely what attributes your <text> tag (for example) will have.
instead to define more precisely what attributes your value tag (e.g. <text>, <uri>, ...) will have.
Args:
value: This is the content that will be written between the <text></text> tags (for example)
value: This is the content that will be written into the value tag (e.g. <text>, <uri>, ...)
permissions: This is the permissions that your <text> tag (for example) will have
comment: This is the comment that your <text> tag (for example) will have
encoding: For <text> tags only. Can be "xml" or "utf8".
encoding: For <text> tags only. If provided, it must be "xml" or "utf8".
Examples:
See the difference between the first and the second example:
Expand All @@ -40,15 +38,5 @@ class PropertyElement:
encoding: Optional[str] = None

def __post_init__(self) -> None:
if not any([
isinstance(self.value, int),
isinstance(self.value, float) and pd.notna(self.value), # necessary because isinstance(np.nan, float)
isinstance(self.value, bool),
isinstance(self.value, str) and all([
regex.search(r"\p{L}|\d|_", self.value, flags=regex.UNICODE),
not bool(regex.search(r"^(none|<NA>|-|n/a)$", self.value, flags=regex.IGNORECASE))
])
]):
raise BaseError(f"'{self.value}' is not a valid value for a PropertyElement")
if self.encoding not in ["utf8", "xml", None]:
raise BaseError(f"'{self.encoding}' is not a valid encoding for a PropertyElement")
5 changes: 3 additions & 2 deletions knora/dsplib/utils/shared.py
Expand Up @@ -174,7 +174,8 @@ def check_notna(value: Optional[Any]) -> bool:
Check a value if it is usable in the context of data archiving. A value is considered usable if it is
- a number (integer or float, but not np.nan)
- a boolean
- a string with at least one Unicode letter, underscore, or number, but not "None", "<NA>", "N/A", or "-"
- a string with at least one Unicode letter (matching the regex ``\\p{L}``), underscore, !, ?, or number, but not
"None", "<NA>", "N/A", or "-"
- a PropertyElement whose "value" fulfills the above criteria
Args:
Expand All @@ -195,7 +196,7 @@ def check_notna(value: Optional[Any]) -> bool:
return True
elif isinstance(value, str):
return all([
regex.search(r"\p{L}|\d|_", value, flags=regex.UNICODE),
regex.search(r"[\p{L}\d_!?]", value, flags=regex.UNICODE),
not bool(regex.search(r"^(none|<NA>|-|n/a)$", value, flags=regex.IGNORECASE))
])
else:
Expand Down

0 comments on commit a7e9d85

Please sign in to comment.