Skip to content

wlandau/user-conf-2022

Repository files navigation

Data version control for reproducible analysis pipelines in R

In computationally demanding data analysis pipelines, the targets R package maintains an up-to-date set of results while skipping tasks that do not need to rerun. This process increases speed and enhances the reproducibility of the final end product. However, it also overwrites old output with new output, and past results disappear by default. To preserve historical output, two major enhancements have arrived in the targets ecosystem. The first enhancement is version-aware cloud storage. If you opt into Amazon-backed storage formats and supply an Amazon S3 bucket with versioning turned on, then the pipeline metadata automatically records the version ID of each target. That way, if the metadata file is part of the source code version control repository of the pipeline, then the user can roll back to a previous code commit and automatically recover the old data, all without invalidating any targets or cueing the pipeline to rerun. The second enhancement to the ecosystem is gittargets, an alternative cloud-agnostic data version control system. The gittargets package captures version-controlled snapshots of the local data store, and each snapshot points to the underlying commit of the source code. That way, when the user rolls back the code to a previous branch or commit, gittargets recovers the data contemporaneous with that commit so that all targets remain up to date. With cloud versioning and gittargets, the targets package now combines the virtues of both Airflow-like and Make-like tools.