This template repository is intended for repositories containing datasets generated by GLAM Workbench notebooks. GW data repositories will be associated with GW code repos (eg: trove-books-data
contains datasets created by trove-books
), and each individual dataset should be linked to the notebook that created it.
Relationships between datasets, notebooks, and documentation are captured in an ro-crate-metadata.json
file, following the RO-Crate standard. This repository contains some useful scripts for updating and managing the RO-Crate metadata and associated Frictionless table schemas describing datasets.
Use the update_crate.py
script in the corresponding code repo to generate the initial ro-crate-metadata.json
file for this repository.
The RO-Crate file for a data repo differs from the version created for a code repo. In particular:
- ids for notebooks are converted to full urls pointing to the code repo
- ids for datasets are converted from full urls to file names
- only notebooks that generate data files are included
The update_data_crate.py
script in this template repo updates file details and versions in the ro-crate-metadata.json
file, and also generates and links a Frictionless Table Data schema file.
The basic steps to creating a new data repository are:
- use this template to create a new GitHub repository
- clone the data repo locally,
cd
into the directory - create a new virtualenv –
pyenv virtualenv [new code repo]
- activate –
pyenv local [new code repo]
pip install pip-tools
pip-sync requirements.txt
- copy/commit the datasets to the new data repo
- make sure there are links in the notebook metadata to the generated datasets (use full GH urls)
- copy the url of the new code repository
- in the code repository, run the
update_crate.py
script with the--data-repo
parameter set to the url of the new data repo - copy the
ro-crate-metadata.json
file created in thedata-rocrate
directory to the new data repository - in the data repository run
scripts/update_data_crate.py
script to generate and link Frictionless Table Data schemas for each dataset - edit the schema file to add names and descriptions to the column values
- generate a new README file from the
ro-crate-metadata.json
file by running thegenerate_readme.py
script (will overwrite this page)
To create a new version:
- use the
--version
option withupdate_data_crate.py
to set a version number, eg:v1.0
To update existing datasets:
- copy the new versions into this repository and run
update_data_crate.py
to update dates and stats - the script checks the updated datasets against the existing schemas, if columns have been added or removed, this validation will fail, in this case delete the schema file and run again
To add new datasets:
- follow the steps under create a new data repo to generate a new
ro-crate-metadata.json
file