GLAM Workbench Data Repository Template

This template repository is intended for repositories containing datasets generated by GLAM Workbench notebooks. GW data repositories will be associated with GW code repos (eg: trove-books-data contains datasets created by trove-books), and each individual dataset should be linked to the notebook that created it.

Relationships between datasets, notebooks, and documentation are captured in an ro-crate-metadata.json file, following the RO-Crate standard. This repository contains some useful scripts for updating and managing the RO-Crate metadata and associated Frictionless table schemas describing datasets.

Use the update_crate.py script in the corresponding code repo to generate the initial ro-crate-metadata.json file for this repository.

The RO-Crate file for a data repo differs from the version created for a code repo. In particular:

ids for notebooks are converted to full urls pointing to the code repo
ids for datasets are converted from full urls to file names
only notebooks that generate data files are included

The update_data_crate.py script in this template repo updates file details and versions in the ro-crate-metadata.json file, and also generates and links a Frictionless Table Data schema file.

The basic steps to creating a new data repository are:

use this template to create a new GitHub repository
clone the data repo locally, cd into the directory
create a new virtualenv – pyenv virtualenv [new code repo]
activate – pyenv local [new code repo]
pip install pip-tools
pip-sync requirements.txt
copy/commit the datasets to the new data repo
make sure there are links in the notebook metadata to the generated datasets (use full GH urls)
copy the url of the new code repository
in the code repository, run the update_crate.py script with the --data-repo parameter set to the url of the new data repo
copy the ro-crate-metadata.json file created in the data-rocrate directory to the new data repository
in the data repository run scripts/update_data_crate.py script to generate and link Frictionless Table Data schemas for each dataset
edit the schema file to add names and descriptions to the column values
generate a new README file from the ro-crate-metadata.json file by running the generate_readme.py script (will overwrite this page)

To create a new version:

use the --version option with update_data_crate.py to set a version number, eg: v1.0

To update existing datasets:

copy the new versions into this repository and run update_data_crate.py to update dates and stats
the script checks the updated datasets against the existing schemas, if columns have been added or removed, this validation will fail, in this case delete the schema file and run again

To add new datasets:

follow the steps under create a new data repo to generate a new ro-crate-metadata.json file

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

.gitignore

.gitignore

README.md

README.md

requirements.in

requirements.in

requirements.txt

requirements.txt

Repository files navigation

GLAM Workbench Data Repository Template

About

Releases

Packages

Languages

GLAM-Workbench/data-repo-template

Folders and files

Latest commit

History

Repository files navigation

GLAM Workbench Data Repository Template

About

Resources

Stars

Watchers

Forks

Languages