From 5b7aa1c1b2766cd99874e10e5773314bcc904a3f Mon Sep 17 00:00:00 2001 From: gfinak Date: Tue, 3 Jul 2018 10:25:39 -0700 Subject: [PATCH] Update README #24 #21 --- .Rbuildignore | 3 +- R/processData.R | 1 + README.md | 623 +++++++++++------------------ bibliography.bib | 9 + inst/extdata/tests/subsetCars.Rmd | 2 +- inst/extdata/tests/subsetCars.html | 384 ++++++++++++++++++ 6 files changed, 629 insertions(+), 393 deletions(-) create mode 100644 bibliography.bib create mode 100644 inst/extdata/tests/subsetCars.html diff --git a/.Rbuildignore b/.Rbuildignore index 6a93f53..dd48083 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -8,4 +8,5 @@ README.html ^\.travis\.yml$ ^CODE_OF_CONDUCT\.md$ ^appveyor\.yml$ -NEWS.md \ No newline at end of file +NEWS.md +bibliography.bib \ No newline at end of file diff --git a/R/processData.R b/R/processData.R index 2307e7a..3728661 100644 --- a/R/processData.R +++ b/R/processData.R @@ -482,6 +482,7 @@ DataPackageR <- function(arg = NULL) { pkg$set_dep("knitr", "Suggests") pkg$set_dep("rmarkdown", "Suggests") pkg$set("VignetteBuilder" = "knitr") + pkg$write() usethis::use_directory("vignettes") usethis::use_directory("inst/doc") # TODO maybe copy only the files that have both html and Rmd. diff --git a/README.md b/README.md index c3a7c3b..a92a0eb 100644 --- a/README.md +++ b/README.md @@ -1,414 +1,255 @@ + + # DataPackageR -A package to reproducibly process raw data into packaged, analysis-ready data sets. +DataPackageR is used to reproducibly process raw data into packaged, +analysis-ready data sets. - [![Build Status](https://travis-ci.org/RGLab/DataPackageR.svg?branch=master)](https://travis-ci.org/RGLab/DataPackageR) - [![Coverage status](https://codecov.io/gh/RGLab/DataPackageR/branch/master/graph/badge.svg)](https://codecov.io/github/RGLab/DataPackageR?branch=master) - [![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/RGLab/DataPackageR?branch=master&svg=true)](https://ci.appveyor.com/project/RGLab/DataPackageR) +[![Build +Status](https://travis-ci.org/RGLab/DataPackageR.svg?branch=master)](https://travis-ci.org/RGLab/DataPackageR) +[![Coverage +status](https://codecov.io/gh/RGLab/DataPackageR/branch/master/graph/badge.svg)](https://codecov.io/github/RGLab/DataPackageR?branch=master) +[![AppVeyor build +status](https://ci.appveyor.com/api/projects/status/github/RGLab/DataPackageR?branch=master&svg=true)](https://ci.appveyor.com/project/RGLab/DataPackageR) [![DOI](https://zenodo.org/badge/29267435.svg)](https://doi.org/10.5281/zenodo.1292095) -## Code of conduct - -Please note that this project is released with a [Contributor Code of Conduct](CODE_OF_CONDUCT.md). - By participating in this project you agree to abide by its terms. - -## Preprint and publication. - -The publication describing the package is now available at [Gates Open Research](https://gatesopenresearch.org/articles/2-31/v1). - -The preprint is on [biorxiv](https://doi.org/10.1101/342907). - - -## Goals - -You have raw data that needs to be tidied and otherwise processed into a standardized analytic data set (a data set that is ready for analysis). -You want to do the processing using best practices for reproducible research. - -### The current state of affairs - -Normally, you'll write some code that does the tidying and outputs a tidy data set. -If you want to distribute your data set, you can put it in an R package. -The preferred mechanism is to place your data tidying code in `data-raw` in the package source tree and use the `devtools` package (specifically `devtools::use_data`) to save the data into the `data` directory. The build process will include your data set in the final package. -You'll also have to remember to document the data set in `roxygen`, and write a vignette showing how to use the data. -For version control and easy distribution you might post the package on github. - -### Scaling up - -The process outlined works well for smaller data sets. -It can be a hassle if you have complex data that change frequently (as is often the case in biology, where data trickle in from collaborators and follow-up experiments), or more generally if you have large data sets where raw data can't be distributed as part of the package source due to size restrictions (e.g. FASTQ files for sequencing, FCS files for flow cytometry, or other "omics" data). - -### DataPackageR - -The `DataPackageR` package simplifies bundling of code, data and documentation into a single R package that can be versioned and distributed. -The `datapackage.skeleton()` API lets you point `DataPackageR` at your data processing code (in the form of Rmd and / or R files). These are expected to produce `data objects` to be stored in the final package. The names of these are also passed to `datapackage.skeleton()`. This produces the necessary package structure, and populations a `datapackager.yml` configuration file used by the build process. - -The `package_build()` API runs the processing code specified in the `.yml` files and produces html reports of the processing as **package vignettes**. It also builds boilerplate `roxygen` documentation of the R objects specified in the `.yml`, computes checksums of stored R objects and version tags the entire data set collection. - -If raw data changes, the user can rebuild the data sets in the R package with subsequent calls to `package_build()` which will re-run the processing, compare the cheksums of new R objects against those currently stored in the package. -Any changes force an increment of the `DataVersion` string in the package DESCRIPTION file. -When the package is installed, data sets can be accessed via the standard `data()` API, package vignettes describing the data processing can be accessed via `vignette()`, documentation via `?`, and the data version via `data_version(packageName)`. - - -# Installation - -The usual package installation mechanism works: - -``` +## What’s the problem? + +You have diverse raw data sets that you need to preprocess and tidy in +order to: + + - Perform data analysis + - Write a report + - Publish a paper + - Share data with colleagues and collaborators + - Save time in the future when you return to this project but have + forgotten all about what you did. + +### Why package data sets? + + - **Reproducibility.** + + As described [elsewhere](https://github.com/ropensci/rrrpkg), + packaging your data promotes reproducibility. R’s packaging + infrastructure promotes unit testing, documentation, a reproducible + build system, and has many other benefits. Coopting it for packaging + data sets is a natural fit. + + - **Collaboration.** + + A data set packaged in R is easy to distribute and share amongst + collaborators, and is easy to install and use. All the hard work + you’ve put into documenting and standardizing the tidy data set + comes right along with the data package. + + - **Documentation.** + + R’s package system allows us to document data objects. What’s more, + the `roxygen2` package makes this very easy to do with [markup + tags](http://r-pkgs.had.co.nz/data.html). That documentation is the + equivalent of a data dictionary and can be extremely valuable when + returning to a project after a period of time. + + - **Convenience.** + + Data pre-processing can be time consuming, depending on the data + type and raw data sets may be too large to share conveniently in a + packaged format. Packaging and sharing the small, tidied data saves + the users computing time and time spent waiting for downloads. + +## Challenges. + + - **Package size limits.** + + R packages have a 5MB size limit, at least on CRAN. BioCondctor has + explicit [data + package](https://www.bioconductor.org/developers/package-guidelines/#package-types) + types that can be larger and use git LFS for very large files. + + Sharing large volumes of raw data in an R package format is still + not ideal, and there are public biological data repositories better + suited for raw data: e.g., [GEO](https://www.ncbi.nlm.nih.gov/geo/), + [SRA](https://www.ncbi.nlm.nih.gov/sra), + [ImmPort](http://www.immport.org/immport-open/public/home/home), + [ImmuneSpace](https://immunespace.org/), + [FlowRepository](https://flowrepository.org/). + + Tools like [datastorr](https://github.com/ropenscilabs/datastorr) + can help with this and we hope to integrate the into DataPackageR in + the future. + + - **Manual effort** + + There is still a substantial manual effort to set up the correct + directory structures for an R data package. This can dissuade many + individuals, particularly new users who have never built an R + package, from going this route. + +## DataPackageR + +DataPakcageR provides a number of benefits when packaging your data. + + - It aims to automate away much of the tedium of packaging data sets + without getting too much in the way, and keeps your processing + workflow reproducible. + + - It sets up the necessary package structure and files for a data + package. + + - It allows you to keep the large, raw data and only ship the packaged + tidy data, saving space and time consumers of your data set need to + spend downloading and re-processing it. + + - It maintains a reproducible record of the data processing along with + the package. Consumers of the data package can verify how the + processing was done, increasing confidence in your data. + + - It automates construction of the documenation and maintains a data + set version and fingerprint of each data object in the package. If + the data changes and the package is rebuilt, the data version is + automatically updated. + +## Similar work + +There are a number of tools out there that address similar and +complementary problems. + + - **datastorr** [github + repo](https://github.com/ropenscilabs/datastorr) Simple data + retrieval and versioning using GitHub to store data. + + - Caches downloads and uses github releases to version data. + - Deal consistently with translating the file stored online into a + loaded data object + - Access multiple versions of the data at once + + `datastorrr` could be used with DataPackageR to store / access + remote raw data sets, remotely store / acess tidied data that are + too large to fit in the package itself. + + - **fst** [github repo](https://github.com/fstpackage/fst) + + `fst` provides lightning fast serialization of data frames. + + - **The modern data package** + [pdf](https://github.com/noamross/2018-04-18-rstats-nyc/blob/master/Noam_Ross_ModernDataPkg_rstatsnyc_2018-04-20.pdf) + + A presenataion from @noamross touching on modern tools for open + science and reproducibility. Discusses `datastorr` and `fst` as well + as standardized metadata and documentation. + + - **rrrpkg** [github repo](https://github.com/ropensci/rrrpkg) + + A doucment from ropensci describing using an R package as a research + compendium. Based on ideas originally introduced by Robert Gentleman + and Duncan Temple Lang (Gentleman and Lang (2004)) + + - **template** [github repo](https://github.com/ropensci/rrrpkg) + + An R package template for data packages. + +## Installation + +You can install the latest version of DataPackageR from +[github](https://www.github.com/RGLab/DataPackageR) with: + +``` r library(devtools) -devtools::install_github("RGLab/DataPackageR", auth_token=NULL) +devtools::install_github("RGLab/DataPackageR") ``` -# Usage - -Set up a new data package. - -We'll set up a new data package that processes the `cars` data by subsetting it to include only measurements of stopping distances of cars at speeds greater than 20 mph. It is processed using an Rmd file located in `inst/extdata/tests/subsetCars.Rmd` that produces a new object called `cars_over_20`. The package will be called `Test`. The work will be done in the system `/tmp` directory. - - -```r -library(data.tree) +``` r library(DataPackageR) -tmp = normalizePath(tempdir()) -processing_code = system.file("extdata","tests","subsetCars.Rmd",package="DataPackageR") -print(processing_code) -[1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/DataPackageR/extdata/tests/subsetCars.Rmd" -setwd(tmp) -DataPackageR::datapackage.skeleton("Test", - force=TRUE, - code_files = processing_code, - r_object_names = "cars_over_20") # cars_over_20 is an R object -Creating directories ... -Creating DESCRIPTION ... -Creating NAMESPACE ... -Creating Read-and-delete-me ... -Saving functions and data ... -Making help files ... -Done. -Further steps are described in './Test/Read-and-delete-me'. -Adding DataVersion string to DESCRIPTION -Creating data and data-raw directories -configuring yaml file - # created in the Rmd file. -``` -### Package skeleton structure - -This has created a directory, "Test" with the skeleton of a data package. - -The `DESCRIPTION` file should be filled out to describe your package. It contains a new `DataVersion` string, and the -revision is automatically incremented if the packaged data changes. - -`Read-and-delete-me` has some helpful instructions on how to proceed. - -The `data-raw` directory is where the data cleaning code (`Rmd`) files reside. -The contents of this directory are: - - -``` - levelName -1 Test -2 ¦--datapackager.yml -3 ¦--Rprofile-devtools -4 ¦--rs-graphics-320f9b2b-0bab-4f25-ba9e-66eb667018e7 -5 ¦ ¦--empty.png -6 ¦ °--INDEX -7 ¦--Test_1.0.tar.gz -8 °--Test -9 ¦--data-raw -10 ¦ ¦--documentation.R -11 ¦ ¦--subsetCars.knit.md -12 ¦ ¦--subsetCars.Rmd -13 ¦ °--subsetCars.utf8.md -14 ¦--DATADIGEST -15 ¦--datapackager.yml -16 ¦--DESCRIPTION -17 ¦--inst -18 ¦ ¦--doc -19 ¦ ¦ ¦--subsetCars.html -20 ¦ ¦ °--subsetCars.Rmd -21 ¦ °--extdata -22 ¦ °--Logfiles -23 ¦ ¦--processing.log -24 ¦ °--subsetCars.html -25 ¦--Read-and-delete-me -26 °--vignettes -27 °--subsetCars.Rmd -``` - -`datapackager.yml` can be edited as necessary to include additional processing scripts (which should be placed in `data-raw`), and raw data should be located under under `/inst/extdata`. It should be copied into that path and the data munging scripts edited to read from there. - -### Yaml configuration - -Here are the contents of `datapackager.yml`: - - -``` -configuration: - files: - subsetCars.Rmd: - name: subsetCars.Rmd - enabled: yes - objects: cars_over_20 - render_root: - tmp: '185367' +# Let's reproducibly package up +# the cars in the mtcars dataset +# with speed > 20. +# Our dataset will be called cars_over_20. + +# Get the code file that turns the raw data +# to our packaged and processed analysis-ready dataset. +processing_code <- system.file( + "extdata", "tests", "subsetCars.Rmd", package = "DataPackageR" +) + +# Create the package framework. +DataPackageR::datapackage_skeleton( + "mtcars20", force = TRUE, code_files = processing_code, r_object_names = "cars_over_20", path = tempdir()) +#> Creating directories ... +#> Creating DESCRIPTION ... +#> Creating NAMESPACE ... +#> Creating Read-and-delete-me ... +#> Saving functions and data ... +#> Making help files ... +#> Done. +#> Further steps are described in '/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//RtmpjSiNpS/mtcars20/Read-and-delete-me'. +#> Adding DataVersion string to DESCRIPTION +#> Creating data and data-raw directories +#> configuring yaml file + +# Run the preprocessing code to build cars_over_20 +# and reproducibly enclose it in a package. +DataPackageR:::package_build(file.path(tempdir(),"mtcars20")) +#> +#> +#> processing file: subsetCars.Rmd +#> output file: subsetCars.knit.md +#> +#> Output created: /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpjSiNpS/mtcars20/inst/extdata/Logfiles/subsetCars.html +#> First time using roxygen2. Upgrading automatically... +#> Updating roxygen version in /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpjSiNpS/mtcars20/DESCRIPTION +#> '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file \ +#> --no-environ --no-save --no-restore --quiet CMD build \ +#> '/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpjSiNpS/mtcars20' \ +#> --no-resave-data --no-manual --no-build-vignettes +#> + +# Let's use the package we just created. +install.packages(file.path(tempdir(),"mtcars20_1.0.tar.gz"), type = "source", repos = NULL) +library(mtcars20) +data("cars_over_20") # load the data +cars_over_20 # Now we can use it. +?cars_over_20 # See the documentation you wrote in data-raw/documentation.R. + +# We have our dataset! +# Since we preprocessed it, +# it is clean and under the 5 MB limit for data in packages. +cars_over_20 + +# We can easily check the version of the data +DataPackageR::data_version("mtcars20") + +# You can use an assert to check the data version in reports and +# analyses that use the packaged data. +assert_data_version(data_package_name = "mtcars20", + version_string = "0.1.0", + acceptable = "equal") ``` -It includes a `files` property that has an entry for each script, with the `name:` and `enabled:` keys for each file. The `objects` property lists the data objects produced by the scripts. - -The `render_root` property specifies the directory where the Rmd files are rendered. If temporary objects are produced during the processing, they will appear in this directory without polluting the package source tree. A temporary directory is used by default. - -### Build your package. - -Once your scripts are in place and the data objects are documented, you build the package. - -To run the build process: - - -```r -# Within the package directory -setwd(tmp) -DataPackageR:::package_build("Test") -INFO [2018-06-26 07:53:13] Logging to /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpfaLm8c/Test/inst/extdata/Logfiles/processing.log -INFO [2018-06-26 07:53:13] Processing data -INFO [2018-06-26 07:53:13] Reading yaml configuration -INFO [2018-06-26 07:53:13] Found /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpfaLm8c/Test/data-raw/subsetCars.Rmd -INFO [2018-06-26 07:53:13] Processing 1 of 1: /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpfaLm8c/Test/data-raw/subsetCars.Rmd - - -processing file: subsetCars.Rmd - - | - | | 0% - | - |......... | 14% - ordinary text without R code - - - | - |................... | 29% -label: setup (with options) -List of 1 - $ include: logi FALSE - - - | - |............................ | 43% - ordinary text without R code - - - | - |..................................... | 57% -label: cars - - | - |.............................................. | 71% - ordinary text without R code - - - | - |........................................................ | 86% -label: unnamed-chunk-10 - - | - |.................................................................| 100% - ordinary text without R code -output file: subsetCars.knit.md -/usr/local/bin/pandoc +RTS -K512m -RTS subsetCars.utf8.md --to html4 --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash+smart --output /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpfaLm8c/Test/inst/extdata/Logfiles/subsetCars.html --email-obfuscation none --self-contained --standalone --section-divs --template /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:bootstrap' --include-in-header /var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//RtmpfaLm8c/rmarkdown-str1032a54936bd4.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' - -Output created: inst/extdata/Logfiles/subsetCars.html -INFO [2018-06-26 07:53:13] 1 required data objects created by subsetCars.Rmd -INFO [2018-06-26 07:53:13] Processed data sets match existing data sets at version 0.1.0 -INFO [2018-06-26 07:53:13] Saving to data -INFO [2018-06-26 07:53:13] Copied documentation to R/Test.R -* Adding `inst/doc` to ./.gitignore -INFO [2018-06-26 07:53:13] Done -INFO [2018-06-26 07:53:13] Building documentation -First time using roxygen2. Upgrading automatically... -Updating roxygen version in /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpfaLm8c/Test/DESCRIPTION -Writing NAMESPACE -Writing Test.Rd -Writing cars_over_20.Rd -INFO [2018-06-26 07:53:13] Building package -'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file \ - --no-environ --no-save --no-restore --quiet CMD build \ - '/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpfaLm8c/Test' \ - --no-resave-data --no-manual --no-build-vignettes - -[1] "/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpfaLm8c/Test_1.0.tar.gz" -``` - -### Logging the build process - -DataPackageR uses the `futile.logger` pagckage to log progress. If there are errors in the processing, the script will notify you via logging to console and to `/private/tmp/Test/inst/extdata/Logfiles/processing.log`. Errors should be corrected and the build repeated. - -If everything goes smoothly, you will have a new package built in the parent directory. In this case we have a new package -`Test_1.0.tar.gz`. When the package is installed, it will contain a vignette `subsetCars` that can be loaded using the `vignette()` API. The vignette will detail the processing performed by the `subsetCars.Rmd` processing script. - -### The package source directory after building - - -``` - levelName -1 Test -2 ¦--data-raw -3 ¦ ¦--documentation.R -4 ¦ ¦--subsetCars.knit.md -5 ¦ ¦--subsetCars.Rmd -6 ¦ °--subsetCars.utf8.md -7 ¦--data -8 ¦ °--cars_over_20.rda -9 ¦--DATADIGEST -10 ¦--datapackager.yml -11 ¦--DESCRIPTION -12 ¦--inst -13 ¦ ¦--doc -14 ¦ ¦ ¦--subsetCars.html -15 ¦ ¦ °--subsetCars.Rmd -16 ¦ °--extdata -17 ¦ °--Logfiles -18 ¦ ¦--processing.log -19 ¦ °--subsetCars.html -20 ¦--man -21 ¦ ¦--cars_over_20.Rd -22 ¦ °--Test.Rd -23 ¦--NAMESPACE -24 ¦--R -25 ¦ °--Test.R -26 ¦--Read-and-delete-me -27 °--vignettes -28 °--subsetCars.Rmd -``` - -#### Details - -A number of things have changed. The subsetCars processing script now appears under `/vignettes` and `inst/doc` as a processed html report so that it will be available to view via `vignette()` once the package is installed. -`inst/extdata/Logfiles` contains a log file of the entire build process as well as intermediate files created while parsing the R / Rmd code. Documentation Rd files appear in `/man`, these should be edite to provide further details on the data objects in the package. The data objects are stored under `/data` where we see `cars_over_20.rda`, the object we initially specified in `datapackager.yml`. - - -## Versioning data objects - -The DataPackageR package calculates an md5 checksum of each data object it stores, and keeps track of them in a file -called `DATADIGEST`. - -- Each time the package is rebuilt, the md5 sums of the new data objects are compared against the DATADIGEST. -- If they don't match, the build process checks that the `DataVersion` string has been incremented in the `DESCRIPTION` file. -- If it has not the build process will exit and produce an error message. - -### DATADIGEST - - -The `DATADIGEST` file contains the following: - - -``` -DataVersion: 0.1.0 -cars_over_20: 3ccb5b0aaa74fe7cfc0d3ca6ab0b5cf3 -``` - - -### DESCRIPTION - -The description file has the new `DataVersion` string. - - -``` -Package: Test -Type: Package -Title: What the package does (short line) -Version: 1.0 -Date: 2018-53-26 -Author: Who wrote it -Maintainer: Who to complain to -Description: More about what it does (maybe more than one line) -License: What license is it under? -DataVersion: 0.1.0 -Suggests: knitr, - rmarkdown -VignetteBuilder: knitr -RoxygenNote: 6.0.1 -``` - -### Next steps - -Your downstream data analysis can depend on a specific version of your data package (for example by tesing the `packageVersion()` string); - -```r{} -if(DataPackageR::packageVersion("MyNewStudy") != "1.0.0") - stop("The expected version of MyNewStudy is 1.0.0, but ",packageVersion("MyNewStudy")," is installed! Analysis results may differ!") -``` - -The DataPackageR packge also provides `datasetVersion()` to extract the data set version information. - -You should also place the data package source directory under `git` version control. -This allows you to version control your data processing code. - -### Why not use R CMD build? - -If the processing script is time consuming or the data set is particularly large, then `R CMD build` would run the code each time the package is installed. In such cases, raw data may not be available, or the environment to do the data processing may not be set up for each user of the data. In such cases, DataPackageR provides a mechanism to decouple data processing from package building/installation for downstream users of the data. - - -## Partial builds and migrating old data packages. - -Version 1.12.0 has moved away from controlling the build process using `datasets.R` and an additional `masterfile` argument. The build process is now controlled via a `datapackager.yml` configuration file located in the package root directory. - -You can migrate an old package by constructing such a config file using the `construct_yml_config()` API. - - -```r -#assume I have file1.Rmd and file2.R located in /data-raw, and these create 'object1' and 'object2' respectively. - -config = construct_yml_config(code = c("file1.Rmd","file2.R"), data = c("object1","object2")) -cat(as.yaml(config)) -configuration: - files: - file1.Rmd: - name: file1.Rmd - enabled: yes - file2.R: - name: file2.R - enabled: yes - objects: - - object1 - - object2 - render_root: - tmp: '339581' -``` +## Preprint and publication. -`config` is a newly constructed yaml configuration object. It can be written to the package directory: +The publication describing the package is now available at [Gates Open +Research](https://gatesopenresearch.org/articles/2-31/v1). +The preprint is on [biorxiv](https://doi.org/10.1101/342907). -```r -path_to_package = tempdir() #pretend this is the root of our package -yml_write(config,path = path_to_package) -``` +## Code of conduct -Now the package at `path_to_package` will build with version 1.12.0 or greater. - -We can also perform partial builds of a subset of files in a package by toggling the `enabled` key in the config file. This can be done with the following API: - - -```r -config = yml_disable_compile(config,filenames = "file2.R") -cat(as.yaml(config)) -configuration: - files: - file1.Rmd: - name: file1.Rmd - enabled: yes - file2.R: - name: file2.R - enabled: no - objects: - - object1 - - object2 - render_root: - tmp: '339581' -``` +Please note that this project is released with a [Contributor Code of +Conduct](CODE_OF_CONDUCT.md). By participating in this project you agree +to abide by its terms. -Where `config` is a configuration read from a data package root directory. The `config` object needs to be written back to the package root in order for the changes to take effect. The consequence of toggling a file to `enable: no` is that it will be skipped when the package is built, but the data will be retained, and the documentation will not be altered. +# References +
+
+Gentleman, Robert, and Duncan Temple Lang. 2004. “Statistical Analyses +and Reproducible Research.” *Bioconductor Project Working Papers*, +Bioconductor project working papers,. bepress. +
+
diff --git a/bibliography.bib b/bibliography.bib new file mode 100644 index 0000000..9d93763 --- /dev/null +++ b/bibliography.bib @@ -0,0 +1,9 @@ + +@ARTICLE{Gentleman2004-oj, + title = "Statistical Analyses and Reproducible Research", + author = "Gentleman, Robert and Lang, Duncan Temple", + journal = "Bioconductor Project Working Papers", + publisher = "bepress", + series = "Bioconductor Project Working Papers", + year = 2004 +} diff --git a/inst/extdata/tests/subsetCars.Rmd b/inst/extdata/tests/subsetCars.Rmd index 3fe4563..ad72ff5 100644 --- a/inst/extdata/tests/subsetCars.Rmd +++ b/inst/extdata/tests/subsetCars.Rmd @@ -4,7 +4,7 @@ author: "Greg Finak" output: html_document --- -```{r setup, include=FALSE} +```{r include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` diff --git a/inst/extdata/tests/subsetCars.html b/inst/extdata/tests/subsetCars.html new file mode 100644 index 0000000..f84b4d4 --- /dev/null +++ b/inst/extdata/tests/subsetCars.html @@ -0,0 +1,384 @@ + + + + + + + + + + + + + + +A Test Document for DataPackageR + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + +

This is a simple Rmd file that demonstrates how DataPackageR processes Rmarkdown files and creates data sets that are then stored in an R data package.

+

In the config.yml for this example, this file is listed first, and therefore processed first.

+

This particular document simply subsets the cars data set:

+
summary(cars)
+
##      speed           dist       
+##  Min.   : 4.0   Min.   :  2.00  
+##  1st Qu.:12.0   1st Qu.: 26.00  
+##  Median :15.0   Median : 36.00  
+##  Mean   :15.4   Mean   : 42.98  
+##  3rd Qu.:19.0   3rd Qu.: 56.00  
+##  Max.   :25.0   Max.   :120.00
+
dim(cars)
+
## [1] 50  2
+

cars consists of a data frame of 50 rows and two columns. The ?cars documentation specifies that it consists of speed and stopping distances of cars.

+

Let’s say, for some reason, we are only interested in the stopping distances of cars traveling greater than 20 miles per hour.

+
cars_over_20 = subset(cars, speed > 20)
+

The data frame cars_over_20 now holds this information.

+
+

Storing data set objects and making making accessible to other processing scripts.

+

When DataPackageR processes this file, it creates this cars_over_20 object. After processing the file it does several things:

+
    +
  1. It compares the objects in the rmarkdown render environment of subsetCars.Rmd against the objects listed in the config.yml file objects property.
  2. +
  3. It finds cars_over_20 is listed there, so it stores it in a new environment.
  4. +
  5. That environment is passed to subsequent R and Rmd files. Specifically when the extra.rmd file is processed, it has access to an environment object that holds all the objects (defined in the yaml config) that have already been created and processed. This environment is passed into subsequent scripts at the render() call.
  6. +
+

All of the above is done automatically. The user only needs to list the objects to be stored and passed to other scripts in the config.yml file.

+

The datapackager_object_read() API can be used to retrieve these objects from the environment.

+
+

Storing objects in the data package

+

In addition to passing around an environment to subsequent scripts, the cars_over_20 object is stored in the data package /data directory as an rda file.

+

Note that this is all done automatically. The user does not need to explicitly save anything, they only need to list the objects to be store in the config.yml.

+

This object is then accessible in the resulting package via the data() API, and its documentation is accessible via ?cars_over_20.

+
+
+

Data object documentation

+

The documentation for the cars_over_20 object is created in a subsetCars.R file in the /R directory of the data package.

+

While the data object document stub is created automatically, it must be edited by the user to provide additional details about the data object.

+
+
+ + + + +
+ + + + + + + +