Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: improve docs and example data for excel2xml (DEV-1370) #233

Merged
merged 35 commits into from Oct 6, 2022
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
307c037
edit
jnussbaum Sep 28, 2022
f4b0a6c
edit
jnussbaum Sep 28, 2022
f13a870
simplify onto
jnussbaum Sep 29, 2022
dfda6e3
update sample onto and sample script
jnussbaum Sep 29, 2022
b2bfa9c
edit
jnussbaum Sep 29, 2022
9747cdc
rename templates
jnussbaum Sep 29, 2022
b276d63
add zip
jnussbaum Sep 29, 2022
167df49
edit
jnussbaum Sep 30, 2022
d908bcb
improve validation of integer and decimal
jnussbaum Sep 30, 2022
af967f3
fix unittest
jnussbaum Sep 30, 2022
d54bbbf
edit
jnussbaum Sep 30, 2022
6596369
install GitPython
jnussbaum Sep 30, 2022
fe89ab9
add explanations to README of 0123-import-scripts
jnussbaum Oct 3, 2022
607bc3e
add explanations to README of 0123-import-scripts
jnussbaum Oct 3, 2022
33d77a4
edit
jnussbaum Oct 3, 2022
58a2f50
Revert "install GitPython"
jnussbaum Oct 3, 2022
833c799
fix e2e test
jnussbaum Oct 3, 2022
28f41dd
add tearDown
jnussbaum Oct 4, 2022
114c850
sort images alphabetically, so that on different platforms, the outpu…
jnussbaum Oct 4, 2022
2325c69
improve file name
jnussbaum Oct 4, 2022
7a1aba2
make alphabetical sorting platform-independent
jnussbaum Oct 4, 2022
c99a005
add proper derandomization of XML document
jnussbaum Oct 4, 2022
ab4a06f
add PDF of readme
jnussbaum Oct 4, 2022
d49fd33
Revert "add PDF of readme"
jnussbaum Oct 4, 2022
27aa197
move recommended extension to .vscode
jnussbaum Oct 4, 2022
75a9a28
improve readme
jnussbaum Oct 4, 2022
01da069
rename folders and files
jnussbaum Oct 5, 2022
979731e
edit
jnussbaum Oct 5, 2022
c94deee
delete import_scripts
jnussbaum Oct 5, 2022
12560dc
add git submodule
jnussbaum Oct 5, 2022
4ca9d4a
make submodule work
jnussbaum Oct 5, 2022
6d7a71c
move submodule
jnussbaum Oct 5, 2022
6825011
adapt e2e test
jnussbaum Oct 5, 2022
3d179d2
import_scripts % git checkout main
jnussbaum Oct 6, 2022
4749d59
initialize the git submodule
jnussbaum Oct 6, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Binary file added docs/assets/0123-import-scripts.zip
Binary file not shown.
142 changes: 142 additions & 0 deletions docs/assets/0123-import-scripts/README.md
@@ -0,0 +1,142 @@
# Welcome to 0123-import-scripts!
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved

This is a template repository that can be used for the archiving process of a big dataset at the end of a research
project's lifetime. Download and unpack this repository from the
[excel2xml documentation page](https://docs.dasch.swiss/latest/DSP-TOOLS/dsp-tools-excel2xml/).

In this README, you will learn how to write a Python script for preparing data for an import into DSP.

Featuring:

- first steps with Visual Studio Code
- the module `excel2xml` of dsp-tools
- the benefits of Version Control with Git
- the benefits of the Debugging Mode
- extras: OpenRefine, Git GUIs, regexr
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved


# First steps with Visual Studio Code (VSC)
[**Visual Studio Code**](https://code.visualstudio.com/download) is the industry standard IDE, free to use, and
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved
recommended for your daily work at DaSCH. Recommended extensions to install from within VSC:
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved

- redhat.vscode-xml
- ms-python.python
- ms-python.vscode-pylance
- zainchen.json
- nickdemayo.vscode-json-editor
- visualstudioexptteam.vscodeintellicode
- redhat.vscode-xml
- dotjoshjohnson.xml
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved


## Initialize Git
Open this repository in Visual Studio Code, change to the "Source Control" tab, and click on "Inizialize Repository":

![git init](assets/git-init.png)

Stage all changes, write "init" as commit message, and commit all changes:

![git commit](assets/git-commit.png)

You now have the option "Publish Branch". This is to synchronize your local work with a GitHub repository on
[https://github.com/dasch-swiss/](https://github.com/dasch-swiss/). For this purpose, replace `0123` by your project's
shortcode, and `import` by your project's shortname. This is especially recommended for big projects where you spend
weeks/months on, when you might want to have a backup, or when you want to invite colleagues for collaboration or a code
review.


## Choose a Python interpreter
Open `import-script.py`. You can now choose a Python interpreter by clicking on the Version number on the bottom right.
You can either work with the global (system-wide) Python, or you can create a
[virtual environment](https://python.land/virtual-environments) for your project. If you don't know which one to choose,
take the one installed via Homebrew, which is located in `/usr/local/Cellar`. Probably you already have a symlink
(`/usr/local/bin/python3` or `/usr/bin/python3`) that redirects to `/usr/local/Cellar`. The only thing that you
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved
shouldn't do is selecting a virtual environment of another project.

![python interpreter](assets/python-interpreter.png)


## The benefits of the debugging mode
To start the debugging process, switch to the "Run and Debug" tab.

1. set a break point
2. click "Run and Debug"
3. choose "Debug the currently active Python file"
4. The control bar appears, and debugging starts.

![configure debugging](assets/configure-debugging.png)

Code execution will interrupt at your break point, that means, before the line of the break point is executed. Use
this opportunity to inspect what has been done until now in the "Variables" area on the left, where the current state of
the program is shown.

If one of the dependencies is not installed, install it with `pip install package` in the Terminal of Visual Studio
Code.
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved

![pause at breakpoint](assets/pause-at-breakpoint.png)

If you want to experiment with different scenarios how to proceed, go to the "Debug Console" where you can execute
code. For example, let's inspect the Pandas Dataframe by typing `main_df.info`.
You see that there are some empty rows at the end which don't contain useful data. The next two lines of code will
eliminate them. Click on "Step Over" two times, or set a new break point two lines further down and click on "Continue".
Now, type again `main_df.info` in the Debug Console. You will see that the empty rows are gone.

![inspect dataframe](assets/inspect-dataframe.png)


You see that the debugging mode is a useful tool to understand code and to inspect it for correctness.

| **Tip** |
|-----------------------------------------------------------------------------------------------------------|
| **Make regular use of the debugging mode to check if your code really does what you think it should do!** |
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved


## The benefits of version control
One of the big benefits of version control is the diff viewer. Visual Studio highlights the changes you have introduced
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved
since your last commit.

- Deletions are shown as red triangle.
- Additions are shown as green bars.
- Changed lines are shown as striped bars.

Click on these visual elements to see a small popup that shows you the difference. In the popup, you can stage the
change, revert it, or jump to the next/previous change.

![diffs](assets/diffs.png)

Once you have a bunch of code changes that can be meaningfully grouped together, you should make a commit (and perhaps
push it to a GitHub repo).


| **Tips** |
|-------------------------------------------------------------------------|
| **Test your code (e.g. with the debugging mode) before committing it.** |
| **Make small commits that contain only one new feature.** |
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved


## Some extras
### Data cleaning with OpenRefine
[**OpenRefine**](https://openrefine.org/) is a tool for working with messy data. Once downloaded and installed, it runs
as a local server, accessed by your browser. So, all data remains on your own machine. Installation is quick and
painless:
```
brew install openrefine
```
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved

The potentials for the everyday work of the Client Services at DaSCH are twofold:
1. Data cleaning (recommended): For this purpose, you can think of OpenRefine as a much better version of Excel. You
can perform operations which would be very tiresome in Excel.
2. Conversion to our dps-customised xml format for bulk upload (not recommended)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does dps-customised mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ups, that's a weird wording. In my tutorial on the new homepage, I have "DSP-specific XML format". That's better, I guess.


Read more in [this report](https://docs.google.com/document/d/1Y_hZV8UV-Irw-7PLdhm0BGKnGfXJs8JO).
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved


### Git GUIs
Git can be complicated, so you will appreciate to work with one of these GUIs:

- [**SmartGit**](https://www.syntevo.com/smartgit/) has a free edition.
jnussbaum marked this conversation as resolved.
Show resolved Hide resolved
- [**GitHub Desktop**](https://desktop.github.com/)
- [**SourceTree**](https://www.sourcetreeapp.com/)

### Learn, build and test RegEx
[https://regexr.com](https://regexr.com)
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/0123-import-scripts/assets/diffs.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 7 additions & 0 deletions docs/assets/0123-import-scripts/data-raw.csv
@@ -0,0 +1,7 @@
Object,Title,Description,Category,Public,Color,Date,Time,Weight (kg),Location,URL, ,
Anubis,Bengal cat,An example of a domesticated cat,"Säugetiere, Huumans",yes,#f5f5dc,2015_01_01,2015-01-01T13:45:12Z,4.8,2661604,https://en.wikipedia.org/wiki/Cat,,
Meteorite,Gibeon Meteorite,This is a piece of the so-called Gibeon Meteroite,Physics,1,#808080,"March 5,1908",1908-03-05T12:00:00-05:00,0.3,11821111,https://en.wikipedia.org/wiki/Gibeon_(meteorite),,
BM1888-0601-716,Lekythos,"Attic red-figured Lekythos BM 1888,601.716",Artwörk,TRUE,,1.12.1973 - 6.1.1974,,0.5,351274,https://www.britishmuseum.org/collection/object/G_1888-0601-716,,
Horohoroto,Picture and Poem by Matsuo Bashō,ほろほろと山吹ちるかたきのおと,Kunstwerk,,,1849/1850,,1,,https://en.wikipedia.org/wiki/Haiku#/media/File:Basho_Horohoroto.jpg,,
,,,,,,,,,,,,
, ,,,,,,,,,,,
Binary file added docs/assets/0123-import-scripts/images/Anubis.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
218 changes: 218 additions & 0 deletions docs/assets/0123-import-scripts/import-project.json
@@ -0,0 +1,218 @@
{
"$schema": "https://raw.githubusercontent.com/dasch-swiss/dsp-tools/main/knora/dsplib/schemas/ontology.json",
"project": {
"shortcode": "0123",
"shortname": "import",
"longname": "Template project for importing data to DaSCH",
"descriptions": {
"en": "Template project to demonstrate the archiving process of a big dataset at the end of a research project's lifetime."
},
"keywords": ["Data and Service Center for the Humanities (DaSCH)"],
"lists": [
{
"name": "category",
"labels": {"de": "Kategorie", "en": "Category"},
"comments": {"en": "A list containing categories", "de": "Eine Liste mit Kategorien"},
"nodes": [
{
"name": "artwork",
"labels": {"de": "Kunstwerk", "en": "Artwork"}
},
{
"name": "nature",
"labels": {"de": "Natur", "en": "Nature"},
"nodes": [
{
"name": "humans",
"labels": {"de": "Menschen", "en": "Humans"}
},
{
"name": "animals",
"labels": {"de": "Tiere", "en": "Animals"},
"nodes": [
{
"name": "mammals",
"labels": {"de": "Säugetiere", "en": "Mammals"}
},
{
"name": "birds",
"labels": {"de": "Vögel", "en": "Birds"}
},
{
"name": "reptiles",
"labels": {"de": "Reptilien", "en": "Reptiles"}
}
]
},
{
"name": "plants",
"labels": {"de": "Pflanzen", "en": "Plants"}
},
{
"name": "physics",
"labels": {"de": "Physik", "en": "Physics"}
}
]
}
]
}
],
"ontologies": [
{
"name": "import",
"label": "The template ontology for data import",
"properties": [
{
"name": "hasTime",
"super": ["hasValue"],
"object": "TimeValue",
"labels": {"en": "Time"},
"gui_element": "TimeStamp"
},
{
"name": "hasImage",
"super": ["hasLinkTo"],
"object": ":Image2D",
"labels": {"en": "Image"},
"gui_element": "Searchbox"
},
{
"name": "hasDescription",
"super": ["hasValue"],
"object": "TextValue",
"labels": {"en": "Description"},
"gui_element": "Richtext"
},
{
"name": "hasName",
"super": ["hasValue"],
"object": "TextValue",
"labels": {"en": "Name"},
"gui_element": "SimpleText"
},
{
"name": "hasTitle",
"super": ["hasValue"],
"object": "TextValue",
"labels": {"en": "Title"},
"gui_element": "SimpleText"
},
{
"name": "hasDate",
"super": ["hasValue"],
"object": "DateValue",
"labels": {"en": "Dating"},
"gui_element": "Date"
},
{
"name": "hasLocation",
"super": ["hasValue"],
"object": "GeonameValue",
"labels": {"en": "Location"},
"gui_element": "Geonames"
},
{
"name": "hasExternalLink",
"super": ["hasValue"],
"object": "UriValue",
"labels": {"en": "External link"},
"gui_element": "SimpleText"
},
{
"name": "hasCategory",
"super": ["hasValue"],
"object": "ListValue",
"labels": {"en": "Category"},
"gui_element": "List",
"gui_attributes": {"hlist": "category"}
},
{
"name": "hasColor",
"super": ["hasColor"],
"object": "ColorValue",
"labels": {"en": "Colour"},
"gui_element": "Colorpicker"
},
{
"name": "isPublic",
"super": ["hasValue"],
"object": "BooleanValue",
"labels": {"en": "Public"},
"gui_element": "Checkbox"
},
{
"name": "hasWeight",
"super": ["hasValue"],
"object": "DecimalValue",
"labels": {"en": "Weight"},
"gui_element": "SimpleText"
}
],
"resources": [
{
"name": "Image2D",
"labels": {"en": "2D image"},
"super": "StillImageRepresentation",
"cardinalities": [
{
"propname": ":hasTitle",
"cardinality": "1"
}
]
},
{
"name": "Object",
"labels": {"en": "Object"},
"super": "Resource",
"cardinalities": [
{
"propname": ":hasImage",
"cardinality": "0-n"
},
{
"propname": ":hasCategory",
"cardinality": "0-n"
},
{
"propname": ":hasName",
"cardinality": "0-n"
},
{
"propname": ":hasDescription",
"cardinality": "0-n"
},
{
"propname": ":hasWeight",
"cardinality": "0-n"
},
{
"propname": ":hasDate",
"cardinality": "0-n"
},
{
"propname": ":hasLocation",
"cardinality": "0-1"
},
{
"propname": ":isPublic",
"cardinality": "0-1"
},
{
"propname": ":hasColor",
"cardinality": "0-n"
},
{
"propname": ":hasExternalLink",
"cardinality": "0-n"
},
{
"propname": ":hasTime",
"cardinality": "0-n"
}
]
}
]
}
]
}
}