Skip to content

Commit

Permalink
feat: add command excel2json to create JSON project file from folder …
Browse files Browse the repository at this point in the history
…with Excel files (DEV-960) (#248)
  • Loading branch information
jnussbaum committed Nov 9, 2022
1 parent de182dc commit e8e05e4
Show file tree
Hide file tree
Showing 27 changed files with 1,403 additions and 140 deletions.
6 changes: 3 additions & 3 deletions Makefile
Expand Up @@ -10,7 +10,7 @@ CURRENT_DIR := $(shell dirname $(realpath $(firstword $(MAKEFILE_LIST))))
.PHONY: dsp-stack
dsp-stack: ## clone the dsp-api git repository and run the dsp-stack
@mkdir -p .tmp
@git clone --branch main --single-branch --depth 1 https://github.com/dasch-swiss/dsp-api.git .tmp/dsp-stack
@git clone --branch v24.0.8 --single-branch https://github.com/dasch-swiss/dsp-api.git .tmp/dsp-stack
$(MAKE) -C .tmp/dsp-stack env-file
$(MAKE) -C .tmp/dsp-stack init-db-test
$(MAKE) -C .tmp/dsp-stack stack-up
Expand Down Expand Up @@ -51,7 +51,7 @@ install: ## install from source (runs setup.py)

.PHONY: test
test: dsp-stack ## run all tests located in the "test" folder (intended for local usage)
-pytest test/
-pytest test/ # ignore errors, continue anyway with stack-down
$(MAKE) stack-down

.PHONY: test-no-stack
Expand All @@ -60,7 +60,7 @@ test-no-stack: ## run all tests located in the "test" folder, without starting t

.PHONY: test-end-to-end
test-end-to-end: dsp-stack ## run e2e tests (intended for local usage)
-pytest test/e2e/
-pytest test/e2e/ # ignore errors, continue anyway with stack-down
$(MAKE) stack-down

.PHONY: test-end-to-end-ci
Expand Down
File renamed without changes.
File renamed without changes.
3 changes: 1 addition & 2 deletions docs/dsp-tools-create.md
Expand Up @@ -437,8 +437,7 @@ To do so, it would be necessary to place the following two files into the folder
![Colors_en](./assets/images/img-list-english-colors.png)
![Farben_de](./assets/images/img-list-german-colors.png)

The expected format of the Excel files is documented
[here](./dsp-tools-excel.md#create-the-lists-section-of-a-json-project-file-from-excel-files). The only difference to
The expected format of the Excel files is documented [here](./dsp-tools-excel2json.md#lists-section). The only difference to
the explanations there is that column A of the Excel worksheet is not interpreted as list name (root node), but as
node name of the first children level below the root node.

Expand Down
96 changes: 51 additions & 45 deletions docs/dsp-tools-excel.md → docs/dsp-tools-excel2json.md
@@ -1,22 +1,62 @@
[![PyPI version](https://badge.fury.io/py/dsp-tools.svg)](https://badge.fury.io/py/dsp-tools)

# Excel files for data modelling and data import
# `excel2json`: Create a data model (JSON project file) from Excel

dsp-tools is able to process Excel files and output the appropriate JSON or XML file. The JSON/XML file can then be
used to create the ontology on the DSP server or import data to the DSP repository. dsp-tools can also be used to
create a list from an Excel file.
With dsp-tools, a JSON project file can be created from Excel files. The command for this is documented
[here](./dsp-tools-usage.md#create-a-json-project-file-from-excel-files).

A JSON project consists of

- 0-1 "lists" sections
- 1-n ontologies, each containing
- 1 "properties" section
- 1 "resources" section

For each of these 3 sections, one or several Excel files are necessary. The Excel files and their format are described
below. If you want to convert the Excel files to JSON, it is possible to invoke a command for each of these sections
separately (as described below).

## JSON project file: "resources" section from Excel file
But it is more convenient to use the command that creates the entire JSON project file. In order to do so, put all
involved files into a folder with the following structure:
```
data_model_files
|-- lists
| |-- de.xlsx
| `-- en.xlsx
`-- onto_name (onto_label)
|-- properties.xlsx
`-- resources.xlsx
```

Conventions for the folder names:

- The "lists" folder must have exactly this name, if it exists. It can also be omitted.
- Replace "onto_name" by your ontology's name, and "onto_label" by your ontology's label.
- The only name that can be chosen freely is the name of the topmost folder ("data_model_files" in this example).

Then, use the following command:
```
dsp-tools excel2json data_model_files project.json
```

This will create a file `project.json` with the lists, properties, and resources from the Excel files.

Please note that the "header" of the resulting JSON file is empty and thus invalid. It is necessary to add the project
shortcode, name, description, keywords, etc. by hand.

Continue reading the following paragraphs to learn more about the expected structure of the Excel files.




## "resources" section

With dsp-tools, the `resources` section used in a data model (JSON) can be created from an Excel file. The command for
this is documented [here](./dsp-tools-usage.md#create-the-resources-section-of-a-json-project-file-from-an-excel-file).
Only `XLSX` files are allowed. The `resources` section can be inserted into the ontology file and then be uploaded onto
a DSP server.

**An Excel file template can be found [here](assets/templates/resources_template.xlsx). It is recommended to work from
**An Excel file template can be found [here](assets/data_model_templates/onto_name (onto_label)/resources.xlsx). It is recommended to work from
the template.**

The expected worksheets of the Excel file are:
Expand Down Expand Up @@ -51,14 +91,14 @@ For further information about resources, see [here](./dsp-tools-create-ontologie



## JSON project file: "properties" section from Excel file
## "properties" section

With dsp-tools, the `properties` section used in a data model (JSON) can be created from an Excel file. The command for
this is documented [here](./dsp-tools-usage.md#create-the-properties-section-of-a-json-project-file-from-an-excel-file).
Only the first worksheet of the Excel file is considered and only XLSX files are allowed. The `properties` section can
be inserted into the ontology file and then be uploaded onto a DSP server.

**An Excel file template can be found [here](assets/templates/properties_template.xlsx). It is recommended to work
**An Excel file template can be found [here](assets/data_model_templates/onto_name (onto_label)/properties.xlsx). It is recommended to work
from the template.**

The Excel sheet must have the following structure:
Expand All @@ -84,7 +124,7 @@ For further information about properties, see [here](./dsp-tools-create-ontologi



## JSON project file: "lists" section from Excel file(s)
## "lists" section

With dsp-tools, the "lists" section of a JSON project file can be created from one or several Excel files. The lists can
then be inserted into a JSON project file and uploaded to a DSP server. The command for this is documented
Expand Down Expand Up @@ -116,8 +156,8 @@ Some notes:
printed out if the list is not valid.

**It is recommended to work from the following templates:
[en.xlsx](assets/templates/lists/en.xlsx): File with the English labels
[de.xlsx](assets/templates/lists/de.xlsx): File with the German labels**
[en.xlsx](assets/data_model_templates/lists/en.xlsx): File with the English labels
[de.xlsx](assets/data_model_templates/lists/de.xlsx): File with the German labels**

The output of the above command, with the template files, is:

Expand Down Expand Up @@ -190,37 +230,3 @@ The output of the above command, with the template files, is:
]
}
```



## XML data file from Excel/CSV file

There are two use cases for a transformation from Excel/CSV to XML:

- The CLI command `dsp-tools excel2xml` creates an XML file from an Excel/CSV file which is already structured
according to the DSP specifications. This is mostly used for DaSCH-interal data migration.
- The module `excel2xml` can be imported into a custom Python script that transforms any tabular data into an XML. This
use case is more frequent, because data from research projects have a variety of formats/structures. The module
`excel2xml` is documented [here](./dsp-tools-excel2xml.md).


### CLI command `excel2xml`

The command line tool is used as follows:
```bash
dsp-tools excel2xml data-source.xlsx 1234 shortname
```

There are no flags/options for this command.

The Excel file must be structured as in this image:
![img-excel2xml.png](assets/images/img-excel2xml.png)

Some notes:

- The special tags `<annotation>`, `<link>`, and `<region>` are represented as resources of restype `Annotation`,
`LinkObj`, and `Region`.
- The columns "ark", "iri", and "creation_date" are only used for DaSCH-internal data migration.
- If `file` is provided, but no `file permissions`, an attempt will be started to deduce them from the resource
permissions (`res-default` --> `prop-default` and `res-restricted` --> `prop-restricted`). If this attempt is not
successful, a `BaseError` will be raised.
12 changes: 7 additions & 5 deletions docs/dsp-tools-excel2xml.md
@@ -1,11 +1,13 @@
[![PyPI version](https://badge.fury.io/py/dsp-tools.svg)](https://badge.fury.io/py/dsp-tools)

# `excel2xml`: Convert a data source to XML
dsp-tools assists you in converting a data source in CSV/XLS(X) format to an XML file.
# Module `excel2xml`: Convert a data source to XML

| **Hint** |
|-------------------------------------------------------------------------------------------------------------------------------------------|
| This page is about the **module** `excel2xml`. The CLI command is documented [here](dsp-tools-excel.md#xml-data-file-from-excelcsv-file). |
This page is about the module `excel2xml` that can be imported into a custom Python script that transforms any tabular
data into an XML.

There is also a CLI command `dsp-tools excel2xml` that creates an XML file from an Excel/CSV file which is already
structured according to the DSP specifications. The CLI command is documented
[here](./dsp-tools-usage.md#use-the-module-excel2xml-to-convert-a-data-source-to-xml).

To demonstrate the usage of the `excel2xml` module, there is a GitHub repository named `0123-import-scripts`. It
contains:
Expand Down
99 changes: 56 additions & 43 deletions docs/dsp-tools-usage.md
Expand Up @@ -32,13 +32,13 @@ dsp-tools create [options] project_definition.json

The following options are available:

- `-s` | `--server` _server_: URL of the DSP server (default: 0.0.0.0:3333)
- `-u` | `--user` _username_: username used for authentication with the DSP API (default: root@example.com)
- `-p` | `--password` _password_: password used for authentication with the DSP API (default: test)
- `-V` | `--validate-only`: If set, only the validation of the JSON file is performed.
- `-l` | `--lists-only`: If set, only the lists are created. Please note that in this case the project must already exist.
- `-v` | `--verbose`: If set, more information about the progress is printed to the console.
- `-d` | `--dump`: If set, dump test files for DSP-API requests.
- `-s` | `--server` (optional, default: `0.0.0.0:3333`): URL of the DSP server
- `-u` | `--user` (optional, default: `root@example.com`): username used for authentication with the DSP API
- `-p` | `--password` (optional, default: `test`): password used for authentication with the DSP API
- `-V` | `--validate-only` (optional): If set, only the validation of the JSON file is performed.
- `-l` | `--lists-only` (optional): If set, only the lists are created. Please note that in this case the project must already exist.
- `-v` | `--verbose` (optional): If set, more information about the progress is printed to the console.
- `-d` | `--dump` (optional): If set, dump test files for DSP-API requests.

The command is used to read the definition of a project with its data model(s) (provided in a JSON file) and create it
on the DSP server. The following example shows how to upload the project defined in `project_definition.json` to the DSP
Expand All @@ -61,12 +61,12 @@ dsp-tools get [options] output_file.json

The following options are available:

- `-s` | `--server`: URL of the DSP server (default: 0.0.0.0:3333)
- `-u` | `--user`: username used for authentication with the DSP API (default: root@example.com)
- `-p` | `--password`: password used for authentication with the DSP API (default: test)
- `-P` | `--project`: shortcode, shortname or
[IRI](https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier) of the project (mandatory)
- `-v` | `--verbose`: If set, some information about the progress is printed to the console.
- `-s` | `--server` (optional, default: `0.0.0.0:3333`): URL of the DSP server
- `-u` | `--user` (optional, default: `root@example.com`): username used for authentication with the DSP API
- `-p` | `--password` (optional, default: `test`): password used for authentication with the DSP API
- `-P` | `--project` (mandatory): shortcode, shortname or
[IRI](https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier) of the project
- `-v` | `--verbose` (optional): If set, some information about the progress is printed to the console.

The command is used to get the definition of a project with its data model(s) from a DSP server and write it into a JSON
file. This JSON file can then be used to create the same project on another DSP server. The following example shows how
Expand Down Expand Up @@ -131,21 +131,34 @@ to use this file to replace internal IDs in an existing XML file to reference ex



## Create the "lists" section of a JSON project file from Excel files
## Create a JSON project file from Excel files

```
dsp-tools excel2json data_model_files project.json
```

The expected file and folder structures are described [here](./dsp-tools-excel2json.md#json-project-file-from-excel).




### Create the "lists" section of a JSON project file from Excel files

```bash
dsp-tools excel2lists folder output.json
dsp-tools excel2lists [options] folder output.json
```

Arguments:
- `folder` (optional, default: "lists"): folder with the Excel file(s)
- `output.json` (optional, default: "lists.json"): Output file
The following options are available:

- `-v` | `--verbose` (optional): If set, more information about the progress is printed to the console.

The expected Excel format is [documented here](./dsp-tools-excel.md#create-the-lists-section-of-a-json-project-file-from-excel-files).
The expected Excel format is [documented here](./dsp-tools-excel2json.md#lists-section).

**Tip: The command [`excel2json`](#create-a-json-project-file-from-excel-files) might be more convenient to use.**


## Create the "resources" section of a JSON project file from an Excel file

### Create the "resources" section of a JSON project file from an Excel file

```bash
dsp-tools excel2resources excel_file.xlsx output_file.json
Expand All @@ -154,20 +167,14 @@ dsp-tools excel2resources excel_file.xlsx output_file.json
The command is used to create the resources section of an ontology from an Excel file. Therefore, an Excel file has to
be provided with the data in the first worksheet of the Excel file.

The following example shows how to create the resources section from an Excel file called `Resources.xlsx`. The output
is written to a file called `resources.json`.

```bash
dsp-tools excel2resources Resources.xlsx resources.json
```
The expected Excel format is [documented here](./dsp-tools-excel2json.md#resources-section).

More information about the usage of this command can be
found [here](./dsp-tools-excel.md#create-the-resources-for-a-data-model-from-an-excel-file).
**Tip: The command [`excel2json`](#create-a-json-project-file-from-excel-files) might be more convenient to use.**




## Create the "properties" section of a JSON project file from an Excel file
### Create the "properties" section of a JSON project file from an Excel file

```bash
dsp-tools excel2properties excel_file.xlsx output_file.json
Expand All @@ -176,32 +183,38 @@ dsp-tools excel2properties excel_file.xlsx output_file.json
The command is used to create the properties section of an ontology from an Excel file. Therefore, an Excel file has to
be provided with the data in the first worksheet of the Excel file.

The following example shows how to create the properties section from an Excel file called `Properties.xlsx`. The output
is written to a file called `properties.json`.

```bash
dsp-tools excel2properties Properties.xlsx properties.json
```
The expected Excel format is [documented here](./dsp-tools-excel2json.md#properties-section).

More information about the usage of this command can be found
[here](./dsp-tools-excel.md#create-the-properties-for-a-data-model-from-an-excel-file).
**Tip: The command [`excel2json`](#create-a-json-project-file-from-excel-files) might be more convenient to use.**



## Create an XML file from Excel/CSV

If your data source is already structured according to the DSP specifications, but it is not in XML format yet, the
command `excel2xml` will transform it into XML. This is mostly used for DaSCH-interal data migration.

```bash
dsp-tools excel2xml data-source.xlsx project_shortcode ontology_name
```

Arguments:

- data-source.xlsx: An Excel/CSV file that is structured according to [these requirements](dsp-tools-excel.md#cli-command-excel2xml)
- project_shortcode: The four-digit hexadecimal shortcode of the project
- ontology_name: the name of the ontology that the data belongs to
- data-source.xlsx (mandatory): An Excel/CSV file that is structured as explained below
- project_shortcode (mandatory): The four-digit hexadecimal shortcode of the project
- ontology_name (mandatory): the name of the ontology that the data belongs to

If your data source is already structured according to the DSP specifications, but it is not in XML format yet, the
command `excel2xml` will transform it into XML. This is mostly used for DaSCH-interal data migration. There are no
flags/options for this command. The details of this command are documented [here](dsp-tools-excel.md#cli-command-excel2xml).
The Excel file must be structured as in this image:
![img-excel2xml.png](assets/images/img-excel2xml.png)

Some notes:

- The special tags `<annotation>`, `<link>`, and `<region>` are represented as resources of restype `Annotation`,
`LinkObj`, and `Region`.
- The columns "ark", "iri", and "creation_date" are only used for DaSCH-internal data migration.
- If `file` is provided, but no `file permissions`, an attempt will be started to deduce them from the resource
permissions (`res-default` --> `prop-default` and `res-restricted` --> `prop-restricted`). If this attempt is not
successful, a `BaseError` will be raised.

If your data source is not yet structured according to the DSP specifications, you need a custom Python script for the
data transformation. For this, you might want to import the module `excel2xml` into your Python script, which is
Expand Down
2 changes: 2 additions & 0 deletions docs/index.md
Expand Up @@ -20,6 +20,8 @@ dsp-tools helps you with the following tasks:
a DSP server and writes it into a JSON file.
- [`dsp-tools xmlupload`](./dsp-tools-usage.md#upload-data-to-a-dsp-server) uploads data from an XML file (bulk
data import) and writes the mapping from internal IDs to IRIs into a local file.
- [`dsp-tools excel2json`](./dsp-tools-usage.md#create-a-json-project-file-from-excel-files) creates an entire JSON
project file from a folder with Excel files in it.
- [`dsp-tools excel2lists`](./dsp-tools-usage.md#create-the-lists-section-of-a-json-project-file-from-excel-files)
creates the "lists" section of a JSON project file from one or several Excel files. The resulting section can be
integrated into a JSON project file and then be uploaded to a DSP server with `dsp-tools create`.
Expand Down

0 comments on commit e8e05e4

Please sign in to comment.