Skip to content

GLAM-Workbench/data-repo-template

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GLAM Workbench Data Repository Template

This template repository is intended for repositories containing datasets generated by GLAM Workbench notebooks. GW data repositories will be associated with GW code repos (eg: trove-books-data contains datasets created by trove-books), and each individual dataset should be linked to the notebook that created it.

Relationships between datasets, notebooks, and documentation are captured in an ro-crate-metadata.json file, following the RO-Crate standard. This repository contains some useful scripts for updating and managing the RO-Crate metadata and associated Frictionless table schemas describing datasets.

Use the update_crate.py script in the corresponding code repo to generate the initial ro-crate-metadata.json file for this repository.

The RO-Crate file for a data repo differs from the version created for a code repo. In particular:

  • ids for notebooks are converted to full urls pointing to the code repo
  • ids for datasets are converted from full urls to file names
  • only notebooks that generate data files are included

The update_data_crate.py script in this template repo updates file details and versions in the ro-crate-metadata.json file, and also generates and links a Frictionless Table Data schema file.

The basic steps to creating a new data repository are:

  • use this template to create a new GitHub repository
  • clone the data repo locally, cd into the directory
  • create a new virtualenv – pyenv virtualenv [new code repo]
  • activate – pyenv local [new code repo]
  • pip install pip-tools
  • pip-sync requirements.txt
  • copy/commit the datasets to the new data repo
  • make sure there are links in the notebook metadata to the generated datasets (use full GH urls)
  • copy the url of the new code repository
  • in the code repository, run the update_crate.py script with the --data-repo parameter set to the url of the new data repo
  • copy the ro-crate-metadata.json file created in the data-rocrate directory to the new data repository
  • in the data repository run scripts/update_data_crate.py script to generate and link Frictionless Table Data schemas for each dataset
  • edit the schema file to add names and descriptions to the column values
  • generate a new README file from the ro-crate-metadata.json file by running the generate_readme.py script (will overwrite this page)

To create a new version:

  • use the --version option with update_data_crate.py to set a version number, eg: v1.0

To update existing datasets:

  • copy the new versions into this repository and run update_data_crate.py to update dates and stats
  • the script checks the updated datasets against the existing schemas, if columns have been added or removed, this validation will fail, in this case delete the schema file and run again

To add new datasets:

  • follow the steps under create a new data repo to generate a new ro-crate-metadata.json file

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages