From d6370fdab5c0eb09d63fee5f58278a731c337608 Mon Sep 17 00:00:00 2001 From: gfinak Date: Mon, 9 Jul 2018 09:28:54 -0700 Subject: [PATCH] Address point 2 in #32 - move YAML_CONFIG to vignettes. Issue #24 - Add definition of data package to README. Issue #25 - Move "R CMD build" to section after package_build is introduced. - Extend the "Purpose" section a bit. - Extended "Next Steps" and made it a sub-section. - Referenced "Happy Git and Github for the useR" and Hadley's book on R packages. - Fix typo mtcars2 to mtcars20 --- .Rbuildignore | 2 - DESCRIPTION | 2 +- README.Rmd | 4 +- README.md | 13 +- YAML_CONFIG.md | 326 ----------- vignettes/YAML_CONFIG.R | 69 +++ YAML_CONFIG.Rmd => vignettes/YAML_CONFIG.Rmd | 16 +- vignettes/YAML_CONFIG.html | 572 +++++++++++++++++++ vignettes/YAML_CONFIG.md | 366 ++++++++++++ vignettes/usingDataPackageR.Rmd | 47 +- vignettes/usingDataPackageR.html | 102 ++-- vignettes/usingDataPackageR.md | 102 ++-- 12 files changed, 1195 insertions(+), 426 deletions(-) delete mode 100644 YAML_CONFIG.md create mode 100644 vignettes/YAML_CONFIG.R rename YAML_CONFIG.Rmd => vignettes/YAML_CONFIG.Rmd (93%) create mode 100644 vignettes/YAML_CONFIG.html create mode 100644 vignettes/YAML_CONFIG.md diff --git a/.Rbuildignore b/.Rbuildignore index b02d547..260f51f 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -10,5 +10,3 @@ README.html ^appveyor\.yml$ NEWS.md bibliography.bib -YAML_CONFIG.Rmd -YAML_CONFIG.md \ No newline at end of file diff --git a/DESCRIPTION b/DESCRIPTION index a5f9598..1704fca 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -4,7 +4,7 @@ Title: Construct Reproducible Analytic Data Sets as R Packages Authors@R: c(person(given = "Greg Finak", role=c("aut","cre","cph"), email="gfinak@fredhutch.org"), person(given = "Paul Obrecht", role=c("ctb"))) -Version: 0.14.0 +Version: 0.14.1 Description: Construct reproducible analytic data sets as R packages. License: MIT + file LICENSE Depends: R (>= 3.5.0) diff --git a/README.Rmd b/README.Rmd index ebacd3e..2f16754 100644 --- a/README.Rmd +++ b/README.Rmd @@ -25,7 +25,7 @@ DataPackageR is used to reproducibly process raw data into packaged, analysis-re [![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/RGLab/DataPackageR?branch=master&svg=true)](https://ci.appveyor.com/project/RGLab/DataPackageR) [![DOI](https://zenodo.org/badge/29267435.svg)](https://doi.org/10.5281/zenodo.1292095) -- [yaml configuration guide](YAML_CONFIG.md) +- [yaml configuration guide](vignettes/YAML_CONFIG.md) ## What problems does DataPackageR tackle? @@ -40,6 +40,8 @@ You have diverse raw data sets that you need to preprocess and tidy in order to: ### Why package data sets? +**Definition:** A *data package* is a formal R package whose sole purpose is to contain, access, and / or document data sets. + - **Reproducibility.** As described [elsewhere](https://github.com/ropensci/rrrpkg), packaging your data promotes reproducibility. diff --git a/README.md b/README.md index 0a593b3..8cb0acc 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ status](https://codecov.io/gh/RGLab/DataPackageR/branch/master/graph/badge.svg)] status](https://ci.appveyor.com/api/projects/status/github/RGLab/DataPackageR?branch=master&svg=true)](https://ci.appveyor.com/project/RGLab/DataPackageR) [![DOI](https://zenodo.org/badge/29267435.svg)](https://doi.org/10.5281/zenodo.1292095) - - [yaml configuration guide](YAML_CONFIG.md) + - [yaml configuration guide](vignettes/YAML_CONFIG.md) ## What problems does DataPackageR tackle? @@ -30,6 +30,9 @@ order to: ### Why package data sets? +**Definition:** A *data package* is a formal R package whose sole +purpose is to contain, access, and / or document data sets. + - **Reproducibility.** As described [elsewhere](https://github.com/ropensci/rrrpkg), @@ -198,7 +201,7 @@ DataPackageR::datapackage_skeleton( #> Saving functions and data ... #> Making help files ... #> Done. -#> Further steps are described in '/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//Rtmpwy0gLc/mtcars20/Read-and-delete-me'. +#> Further steps are described in '/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//RtmpNPoc2o/mtcars20/Read-and-delete-me'. #> Adding DataVersion string to DESCRIPTION #> Creating data and data-raw directories #> configuring yaml file @@ -211,12 +214,12 @@ DataPackageR:::package_build(file.path(tempdir(),"mtcars20")) #> processing file: subsetCars.Rmd #> output file: subsetCars.knit.md #> -#> Output created: /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmpwy0gLc/mtcars20/inst/extdata/Logfiles/subsetCars.html +#> Output created: /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpNPoc2o/mtcars20/inst/extdata/Logfiles/subsetCars.html #> First time using roxygen2. Upgrading automatically... -#> Updating roxygen version in /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmpwy0gLc/mtcars20/DESCRIPTION +#> Updating roxygen version in /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpNPoc2o/mtcars20/DESCRIPTION #> '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file \ #> --no-environ --no-save --no-restore --quiet CMD build \ -#> '/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmpwy0gLc/mtcars20' \ +#> '/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpNPoc2o/mtcars20' \ #> --no-resave-data --no-manual --no-build-vignettes #> diff --git a/YAML_CONFIG.md b/YAML_CONFIG.md deleted file mode 100644 index cb9acf4..0000000 --- a/YAML_CONFIG.md +++ /dev/null @@ -1,326 +0,0 @@ - -# Configuring and controlling DataPackageR builds. - -Data package builds are controlled using the `config.yml` file. - -This file is created in the package source tree when the user creates a -package using `datapackage_skeleton()`. - -It is automatically populated with the names of the `code_files` and -`data_objects` the passed in to datapackage\_skeleton. - -## The `config.yml` file. - -The structure of a correctly formatted `config.yml` file is shown below: - - configuration: - files: - subsetCars.Rmd: - name: subsetCars.Rmd - enabled: yes - objects: cars_over_20 - render_root: - tmp: '584762' - -## YAML config file properties. - -The main section of the file is the `configuration:` section. - -It has three properties: - - - `files:` - - The files (`R` or `Rmd`) to be processed by DataPackageR. They are - processed in the order shown. Users running multi-script workflows - with dependencies between the scripts need to ensure the files are - processed in the correct order. - - Here `subsetCars.Rmd` is the only file to process. - - Each file itself has several properties: - - - `name:` The name of the file. This is transformed to an absolute - path within the package. - - - `enabled:` A logical `yes`, `no` flag indicating whether the - file should be rendered during the build, or whether it should - be skipped. This is useful for ‘turning off’ long running - processing tasks if they have not changed. Disabling processing - of a file will not overwrite existing documentation or data - objecs created during previous builds. - - - `objects:` - - The names of the data objects created by the processing files, to be - stored in the package. These names are compared against the objects - created in the render environment by each file. They names must - match. - - - `render_root:` - - The directory where the `Rmd` or `R` files will be rendered. - Defaults to a randomly named subdirectory of `tempdir()`. Allows - workflows that use multiple scripts and create file system artifacts - to function correctly by simply writing to and reading from the - working directory. - -## Editing the YAML config file. - -The structure of the YAML is simple enough to understand but complex -enough that it can be a pain to edit by hand. - -DataPackageR provides a number of API calls to construct, read, modify, -and write the yaml config file. - -### API calls - -#### `construct_yml_config` - -Make an r object representing a YAML config file. - -##### Example - -The YAML config shown above was created by: - -``` r -# Note this is done by the datapackage_skeleton. -# The user doesn't usually need to call -# construct_yml_config() -yml <- DataPackageR::construct_yml_config( - code = "subsetCars.Rmd", - data = "cars_over_20" - ) -``` - -#### `yml_find` - -Read a yaml config file from a package path into an r object. \#\#\#\#\# -Example Read the YAML config file from the `mtcars20` example. - -``` r -# returns an r object representation of -# the config file. -mtcars20_config <- yml_find( - file.path(tempdir(),"mtcars20") - ) -``` - -#### `yml_list_objects` - -List the `objects` in a config read by `yml_find`. - -##### Example - -``` r - yml_list_objects(yml) -``` - - cars_over_20 - -#### `yml_list_files` - -List the `files` in a config read by `yml_find`. - -##### Example - -``` r - yml_list_files(yml) -``` - - subsetCars.Rmd - -#### `yml_disable_compile` - -Disable compilation of named files in a config read by `yml_find`. - -##### Example - -``` r -yml_disabled <- yml_disable_compile( - yml, - filenames = "subsetCars.Rmd") -``` - - configuration: - files: - subsetCars.Rmd: - name: subsetCars.Rmd - enabled: no - objects: cars_over_20 - render_root: - tmp: '799346' - -#### `yml_enable_compile` - -Enable compilation of named files in a config read by `yml_find`. - -##### Example - -``` r -yml_enabled <- yml_enable_compile( - yml, - filenames = "subsetCars.Rmd") -``` - - configuration: - files: - subsetCars.Rmd: - name: subsetCars.Rmd - enabled: yes - objects: cars_over_20 - render_root: - tmp: '799346' - -#### `yml_add_files` - -Add named files to a config read by `yml_find`. - -##### Example - -``` r -yml_twofiles <- yml_add_files( - yml, - filenames = "anotherFile.Rmd") -``` - - configuration: - files: - subsetCars.Rmd: - name: subsetCars.Rmd - enabled: yes - anotherFile.Rmd: - name: anotherFile.Rmd - enabled: yes - objects: cars_over_20 - render_root: - tmp: '799346' - - configuration: - files: - subsetCars.Rmd: - name: subsetCars.Rmd - enabled: yes - anotherFile.Rmd: - name: anotherFile.Rmd - enabled: yes - objects: cars_over_20 - render_root: - tmp: '799346' - -#### `yml_add_objects` - -Add named objects to a config read by `yml_find`. - -##### Example - -``` r -yml_twoobj <- yml_add_objects( - yml_twofiles, - objects = "another_object") -``` - - configuration: - files: - subsetCars.Rmd: - name: subsetCars.Rmd - enabled: yes - anotherFile.Rmd: - name: anotherFile.Rmd - enabled: yes - objects: - - cars_over_20 - - another_object - render_root: - tmp: '799346' - - configuration: - files: - subsetCars.Rmd: - name: subsetCars.Rmd - enabled: yes - anotherFile.Rmd: - name: anotherFile.Rmd - enabled: yes - objects: - - cars_over_20 - - another_object - render_root: - tmp: '799346' - -#### `yml_remove_files` - -Remove named files from a config read by `yml_find`. - -##### Example - -``` r -yml_twoobj <- yml_remove_files( - yml_twoobj, - filenames = "anotherFile.Rmd") -``` - - configuration: - files: - subsetCars.Rmd: - name: subsetCars.Rmd - enabled: yes - objects: - - cars_over_20 - - another_object - render_root: - tmp: '799346' - - configuration: - files: - subsetCars.Rmd: - name: subsetCars.Rmd - enabled: yes - objects: - - cars_over_20 - - another_object - render_root: - tmp: '799346' - -#### `yml_remove_objects` - -Remove named objects from a config read by `yml_find`. - -##### Example - -``` r -yml_oneobj <- yml_remove_objects( - yml_twoobj, - objects = "another_object") -``` - - configuration: - files: - subsetCars.Rmd: - name: subsetCars.Rmd - enabled: yes - objects: cars_over_20 - render_root: - tmp: '799346' - - configuration: - files: - subsetCars.Rmd: - name: subsetCars.Rmd - enabled: yes - objects: cars_over_20 - render_root: - tmp: '799346' - -#### `yml_write` - -Write a modified config to its package path. - -##### Example - -``` r -yml_write(yml_oneobj, path = "path_to_package") -``` - -The `yml_oneobj` read by `yml_find()` carries an attribute that is the -path to the package. The user doesn’t need to pass a `path` to -`yml_write` if the config has been read by `yml_find`. diff --git a/vignettes/YAML_CONFIG.R b/vignettes/YAML_CONFIG.R new file mode 100644 index 0000000..9056056 --- /dev/null +++ b/vignettes/YAML_CONFIG.R @@ -0,0 +1,69 @@ +## ---- echo = FALSE, results = 'hide'------------------------------------- +library(DataPackageR) +library(yaml) +yml <- DataPackageR::construct_yml_config(code = "subsetCars.Rmd", data = "cars_over_20") + +## ---- echo = FALSE, comment=""------------------------------------------- +cat(yaml::as.yaml(yml)) + +## ------------------------------------------------------------------------ +# Note this is done by the datapackage_skeleton. +# The user doesn't usually need to call +# construct_yml_config() +yml <- DataPackageR::construct_yml_config( + code = "subsetCars.Rmd", + data = "cars_over_20" + ) + +## ----eval=FALSE---------------------------------------------------------- +# # returns an r object representation of +# # the config file. +# mtcars20_config <- yml_find( +# file.path(tempdir(),"mtcars20") +# ) + +## ---- comment=""--------------------------------------------------------- + yml_list_objects(yml) + +## ---- comment=""--------------------------------------------------------- + yml_list_files(yml) + +## ---- comment="", echo = 1----------------------------------------------- +yml_disabled <- yml_disable_compile( + yml, + filenames = "subsetCars.Rmd") +cat(as.yaml(yml_disabled)) + +## ---- comment="", echo = 1----------------------------------------------- +yml_enabled <- yml_enable_compile( + yml, + filenames = "subsetCars.Rmd") +cat(as.yaml(yml_enabled)) + +## ---- comment="", echo = 1----------------------------------------------- +yml_twofiles <- yml_add_files( + yml, + filenames = "anotherFile.Rmd") +cat(as.yaml(yml_twofiles)) + +## ---- comment="", echo = 1----------------------------------------------- +yml_twoobj <- yml_add_objects( + yml_twofiles, + objects = "another_object") +cat(as.yaml(yml_twoobj)) + +## ---- comment="", echo = 1----------------------------------------------- +yml_twoobj <- yml_remove_files( + yml_twoobj, + filenames = "anotherFile.Rmd") +cat(as.yaml(yml_twoobj)) + +## ---- comment="", echo = 1----------------------------------------------- +yml_oneobj <- yml_remove_objects( + yml_twoobj, + objects = "another_object") +cat(as.yaml(yml_oneobj)) + +## ---- eval = FALSE------------------------------------------------------- +# yml_write(yml_oneobj, path = "path_to_package") + diff --git a/YAML_CONFIG.Rmd b/vignettes/YAML_CONFIG.Rmd similarity index 93% rename from YAML_CONFIG.Rmd rename to vignettes/YAML_CONFIG.Rmd index 74ea9cc..a82c3c0 100644 --- a/YAML_CONFIG.Rmd +++ b/vignettes/YAML_CONFIG.Rmd @@ -1,6 +1,17 @@ --- -output: github_document -bibliography: bibliography.bib +title: "The DataPackageR YAML configuration file." +author: "Greg Finak " +date: "`r Sys.Date()`" +output: + rmarkdown::html_vignette: + keep_md: TRUE + toc: yes + bibliography: bibliography.bib +vignette: > + %\VignetteIndexEntry{DataPackageR YAML configuration.} + %\VignetteEngine{knitr::rmarkdown} + \usepackage[utf8]{inputenc} + \usepackage{graphicx} editor_options: chunk_output_type: inline --- @@ -83,6 +94,7 @@ yml <- DataPackageR::construct_yml_config( #### `yml_find` Read a yaml config file from a package path into an r object. + ##### Example Read the YAML config file from the `mtcars20` example. diff --git a/vignettes/YAML_CONFIG.html b/vignettes/YAML_CONFIG.html new file mode 100644 index 0000000..9410305 --- /dev/null +++ b/vignettes/YAML_CONFIG.html @@ -0,0 +1,572 @@ + + + + + + + + + + + + + + + + +The DataPackageR YAML configuration file. + + + + + + + + + + + + + + + + + +

The DataPackageR YAML configuration file.

+

Greg Finak gfinak@fredhutch.org

+

2018-07-09

+ + + + +
+

Configuring and controlling DataPackageR builds.

+

Data package builds are controlled using the datapackager.yml file.

+

This file is created in the package source tree when the user creates a package using datapackage_skeleton().

+

It is automatically populated with the names of the code_files and data_objects the passed in to datapackage_skeleton.

+
+

The datapackager.yml file.

+

The structure of a correctly formatted datapackager.yml file is shown below:

+
configuration:
+  files:
+    subsetCars.Rmd:
+      name: subsetCars.Rmd
+      enabled: yes
+  objects: cars_over_20
+  render_root:
+    tmp: '738856'
+
+
+

YAML config file properties.

+

The main section of the file is the configuration: section.

+

It has three properties:

+
    +
  • files:

    +

    The files (R or Rmd) to be processed by DataPackageR. They are processed in the order shown. Users running multi-script workflows with dependencies between the scripts need to ensure the files are processed in the correct order.

    +

    Here subsetCars.Rmd is the only file to process.

    +

    Each file itself has several properties:

    +
      +
    • name: The name of the file. This is transformed to an absolute path within the package.

    • +
    • enabled: A logical yes, no flag indicating whether the file should be rendered during the build, or whether it should be skipped. This is useful for ‘turning off’ long running processing tasks if they have not changed. Disabling processing of a file will not overwrite existing documentation or data objecs created during previous builds.

    • +
  • +
  • objects:

    +The names of the data objects created by the processing files, to be stored in the package. These names are compared against the objects created in the render environment by each file. They names must match.
  • +
  • render_root:

    +

    The directory where the Rmd or R files will be rendered. Defaults to a randomly named subdirectory of tempdir(). Allows workflows that use multiple scripts and create file system artifacts to function correctly by simply writing to and reading from the working directory.

  • +
+
+
+

Editing the YAML config file.

+

The structure of the YAML is simple enough to understand but complex enough that it can be a pain to edit by hand.

+

DataPackageR provides a number of API calls to construct, read, modify, and write the yaml config file.

+
+

API calls

+
+

construct_yml_config

+

Make an r object representing a YAML config file.

+ +
+
+

yml_find

+

Read a yaml config file from a package path into an r object.

+ +
+
+

yml_list_objects

+

List the objects in a config read by yml_find.

+
+
Example
+ +
cars_over_20
+
+
+
+

yml_list_files

+

List the files in a config read by yml_find.

+
+
Example
+ +
subsetCars.Rmd
+
+
+
+

yml_disable_compile

+

Disable compilation of named files in a config read by yml_find.

+
+
Example
+ +
configuration:
+  files:
+    subsetCars.Rmd:
+      name: subsetCars.Rmd
+      enabled: no
+  objects: cars_over_20
+  render_root:
+    tmp: '180706'
+
+
+
+

yml_enable_compile

+

Enable compilation of named files in a config read by yml_find.

+
+
Example
+ +
configuration:
+  files:
+    subsetCars.Rmd:
+      name: subsetCars.Rmd
+      enabled: yes
+  objects: cars_over_20
+  render_root:
+    tmp: '180706'
+
+
+
+

yml_add_files

+

Add named files to a config read by yml_find.

+
+
Example
+ +
configuration:
+  files:
+    subsetCars.Rmd:
+      name: subsetCars.Rmd
+      enabled: yes
+    anotherFile.Rmd:
+      name: anotherFile.Rmd
+      enabled: yes
+  objects: cars_over_20
+  render_root:
+    tmp: '180706'
+
configuration:
+  files:
+    subsetCars.Rmd:
+      name: subsetCars.Rmd
+      enabled: yes
+    anotherFile.Rmd:
+      name: anotherFile.Rmd
+      enabled: yes
+  objects: cars_over_20
+  render_root:
+    tmp: '180706'
+
+
+
+

yml_add_objects

+

Add named objects to a config read by yml_find.

+
+
Example
+ +
configuration:
+  files:
+    subsetCars.Rmd:
+      name: subsetCars.Rmd
+      enabled: yes
+    anotherFile.Rmd:
+      name: anotherFile.Rmd
+      enabled: yes
+  objects:
+  - cars_over_20
+  - another_object
+  render_root:
+    tmp: '180706'
+
configuration:
+  files:
+    subsetCars.Rmd:
+      name: subsetCars.Rmd
+      enabled: yes
+    anotherFile.Rmd:
+      name: anotherFile.Rmd
+      enabled: yes
+  objects:
+  - cars_over_20
+  - another_object
+  render_root:
+    tmp: '180706'
+
+
+
+

yml_remove_files

+

Remove named files from a config read by yml_find.

+
+
Example
+ +
configuration:
+  files:
+    subsetCars.Rmd:
+      name: subsetCars.Rmd
+      enabled: yes
+  objects:
+  - cars_over_20
+  - another_object
+  render_root:
+    tmp: '180706'
+
configuration:
+  files:
+    subsetCars.Rmd:
+      name: subsetCars.Rmd
+      enabled: yes
+  objects:
+  - cars_over_20
+  - another_object
+  render_root:
+    tmp: '180706'
+
+
+
+

yml_remove_objects

+

Remove named objects from a config read by yml_find.

+
+
Example
+ +
configuration:
+  files:
+    subsetCars.Rmd:
+      name: subsetCars.Rmd
+      enabled: yes
+  objects: cars_over_20
+  render_root:
+    tmp: '180706'
+
configuration:
+  files:
+    subsetCars.Rmd:
+      name: subsetCars.Rmd
+      enabled: yes
+  objects: cars_over_20
+  render_root:
+    tmp: '180706'
+
+
+
+

yml_write

+

Write a modified config to its package path.

+
+
Example
+ +

The yml_oneobj read by yml_find() carries an attribute that is the path to the package. The user doesn’t need to pass a path to yml_write if the config has been read by yml_find.

+
+
+
+
+
+ + + + + + + + diff --git a/vignettes/YAML_CONFIG.md b/vignettes/YAML_CONFIG.md new file mode 100644 index 0000000..17e0a59 --- /dev/null +++ b/vignettes/YAML_CONFIG.md @@ -0,0 +1,366 @@ +--- +title: "The DataPackageR YAML configuration file." +author: "Greg Finak " +date: "2018-07-09" +output: + rmarkdown::html_vignette: + keep_md: TRUE + toc: yes + bibliography: bibliography.bib +vignette: > + %\VignetteIndexEntry{DataPackageR YAML configuration.} + %\VignetteEngine{knitr::rmarkdown} + \usepackage[utf8]{inputenc} + \usepackage{graphicx} +editor_options: + chunk_output_type: inline +--- + +# Configuring and controlling DataPackageR builds. + +Data package builds are controlled using the `datapackager.yml` file. + +This file is created in the package source tree when the user creates a package using `datapackage_skeleton()`. + +It is automatically populated with the names of the `code_files` and `data_objects` the passed in to datapackage_skeleton. + +## The `datapackager.yml` file. + +The structure of a correctly formatted `datapackager.yml` file is shown below: + + + + +``` +configuration: + files: + subsetCars.Rmd: + name: subsetCars.Rmd + enabled: yes + objects: cars_over_20 + render_root: + tmp: '738856' +``` + +## YAML config file properties. + +The main section of the file is the `configuration:` section. + +It has three properties: + +- `files:` + + The files (`R` or `Rmd`) to be processed by DataPackageR. They are processed in the order shown. Users running multi-script workflows with dependencies between the scripts need to ensure the files are processed in the correct order. + + Here `subsetCars.Rmd` is the only file to process. + + Each file itself has several properties: + + - `name:` + The name of the file. This is transformed to an absolute path within the package. + + - `enabled:` + A logical `yes`, `no` flag indicating whether the file should be rendered during the build, or whether it should be skipped. + This is useful for 'turning off' long running processing tasks if they have not changed. Disabling processing of a file will not overwrite existing documentation or data objecs created during previous builds. + +- `objects:` + + The names of the data objects created by the processing files, to be stored in the package. These names are compared against the objects created in the render environment by each file. They names must match. +- `render_root:` + + The directory where the `Rmd` or `R` files will be rendered. Defaults to a randomly named subdirectory of `tempdir()`. Allows workflows that use multiple scripts and create file system artifacts to function correctly by simply writing to and reading from the working directory. + +## Editing the YAML config file. + +The structure of the YAML is simple enough to understand but complex enough that it can be a pain to edit by hand. + +DataPackageR provides a number of API calls to construct, read, modify, and write the yaml config file. + +### API calls + +#### `construct_yml_config` + + Make an r object representing a YAML config file. + +##### Example + The YAML config shown above was created by: + +```r +# Note this is done by the datapackage_skeleton. +# The user doesn't usually need to call +# construct_yml_config() +yml <- DataPackageR::construct_yml_config( + code = "subsetCars.Rmd", + data = "cars_over_20" + ) +``` + + +#### `yml_find` + + Read a yaml config file from a package path into an r object. + +##### Example + Read the YAML config file from the `mtcars20` example. + + +```r +# returns an r object representation of +# the config file. +mtcars20_config <- yml_find( + file.path(tempdir(),"mtcars20") + ) +``` + +#### `yml_list_objects` + + List the `objects` in a config read by `yml_find`. + + +##### Example + + +```r + yml_list_objects(yml) +``` + +``` +cars_over_20 +``` + +#### `yml_list_files` + + List the `files` in a config read by `yml_find`. + +##### Example + + +```r + yml_list_files(yml) +``` + +``` +subsetCars.Rmd +``` + +#### `yml_disable_compile` + + Disable compilation of named files in a config read by `yml_find`. + +##### Example + + +```r +yml_disabled <- yml_disable_compile( + yml, + filenames = "subsetCars.Rmd") +``` + +``` +configuration: + files: + subsetCars.Rmd: + name: subsetCars.Rmd + enabled: no + objects: cars_over_20 + render_root: + tmp: '180706' +``` + +#### `yml_enable_compile` + + Enable compilation of named files in a config read by `yml_find`. + +##### Example + + +```r +yml_enabled <- yml_enable_compile( + yml, + filenames = "subsetCars.Rmd") +``` + +``` +configuration: + files: + subsetCars.Rmd: + name: subsetCars.Rmd + enabled: yes + objects: cars_over_20 + render_root: + tmp: '180706' +``` + +#### `yml_add_files` + + Add named files to a config read by `yml_find`. + +##### Example + + +```r +yml_twofiles <- yml_add_files( + yml, + filenames = "anotherFile.Rmd") +``` + +``` +configuration: + files: + subsetCars.Rmd: + name: subsetCars.Rmd + enabled: yes + anotherFile.Rmd: + name: anotherFile.Rmd + enabled: yes + objects: cars_over_20 + render_root: + tmp: '180706' +``` + +``` +configuration: + files: + subsetCars.Rmd: + name: subsetCars.Rmd + enabled: yes + anotherFile.Rmd: + name: anotherFile.Rmd + enabled: yes + objects: cars_over_20 + render_root: + tmp: '180706' +``` + +#### `yml_add_objects` + + Add named objects to a config read by `yml_find`. + +##### Example + + +```r +yml_twoobj <- yml_add_objects( + yml_twofiles, + objects = "another_object") +``` + +``` +configuration: + files: + subsetCars.Rmd: + name: subsetCars.Rmd + enabled: yes + anotherFile.Rmd: + name: anotherFile.Rmd + enabled: yes + objects: + - cars_over_20 + - another_object + render_root: + tmp: '180706' +``` + +``` +configuration: + files: + subsetCars.Rmd: + name: subsetCars.Rmd + enabled: yes + anotherFile.Rmd: + name: anotherFile.Rmd + enabled: yes + objects: + - cars_over_20 + - another_object + render_root: + tmp: '180706' +``` + +#### `yml_remove_files` + + Remove named files from a config read by `yml_find`. + +##### Example + + +```r +yml_twoobj <- yml_remove_files( + yml_twoobj, + filenames = "anotherFile.Rmd") +``` + +``` +configuration: + files: + subsetCars.Rmd: + name: subsetCars.Rmd + enabled: yes + objects: + - cars_over_20 + - another_object + render_root: + tmp: '180706' +``` + +``` +configuration: + files: + subsetCars.Rmd: + name: subsetCars.Rmd + enabled: yes + objects: + - cars_over_20 + - another_object + render_root: + tmp: '180706' +``` + +#### `yml_remove_objects` + + Remove named objects from a config read by `yml_find`. + +##### Example + + +```r +yml_oneobj <- yml_remove_objects( + yml_twoobj, + objects = "another_object") +``` + +``` +configuration: + files: + subsetCars.Rmd: + name: subsetCars.Rmd + enabled: yes + objects: cars_over_20 + render_root: + tmp: '180706' +``` + +``` +configuration: + files: + subsetCars.Rmd: + name: subsetCars.Rmd + enabled: yes + objects: cars_over_20 + render_root: + tmp: '180706' +``` + +#### `yml_write` + + Write a modified config to its package path. + +##### Example + + +```r +yml_write(yml_oneobj, path = "path_to_package") +``` + +The `yml_oneobj` read by `yml_find()` carries an attribute +that is the path to the package. The user doesn't need to pass a `path` to `yml_write` if the config has been read by `yml_find`. diff --git a/vignettes/usingDataPackageR.Rmd b/vignettes/usingDataPackageR.Rmd index c810e5c..9bfe472 100644 --- a/vignettes/usingDataPackageR.Rmd +++ b/vignettes/usingDataPackageR.Rmd @@ -23,7 +23,15 @@ knitr::opts_chunk$set( ## Purpose -This vignette demonstrates how to use DataPackageR to build a datapackage from the `mtcars` data set. +This vignette demonstrates how to use DataPackageR to build a data package. + +DataPackageR aims to simplify data package construction. + +It provides mechanisms for reproducibly preprocessing and tidying raw data into into documented, versioned, and packaged analysis-ready data sets. + +Long-running or computationally intensive data processing can be decoupled from the usual `R CMD build` process while maintinaing [data lineage](https://en.wikipedia.org/wiki/Data_lineage). + +In this vignette we will subset and package the `mtcars` data set. ## Set up a new data package. @@ -64,7 +72,7 @@ DataPackageR::datapackage_skeleton( ### What's in the package skeleton structure? -This has created a datapackage source tree named "mtcars2" (in a temporary directory). +This has created a datapackage source tree named "mtcars20" (in a temporary directory). For a real use case you would pick a `path` on your filesystem where you could then initialize a new github repository for the package. The contents of `mtcars20` are: @@ -107,7 +115,7 @@ The objects must be listed in the yaml configuration file. `datapackage_skeleton DataPackageR provides an API for modifying this file, so it does not need to be done by hand. -Further information on the contents of the YAML configuration file, and the API are in the [YAML Configuration Details](https://github.com/RGLab/DataPackageR/blob/master/YAML_CONFIG.md) +Further information on the contents of the YAML configuration file, and the API are in the [YAML Configuration Details](YAML_CONFIG.html) ### Where do I put raw data? @@ -136,6 +144,11 @@ Once the skeleton framework is set up, DataPackageR:::package_build(file.path(tempdir(),"mtcars20")) ``` + +### Why not just use R CMD build? + +If the processing script is time consuming or the data set is particularly large, then `R CMD build` would run the code each time the package is installed. In such cases, raw data may not be available, or the environment to do the data processing may not be set up for each user of the data. DataPackageR decouples data processing from package building/installation for data consumers. + ### A log of the build process DataPackageR uses the `futile.logger` pagckage to log progress. @@ -219,16 +232,12 @@ assert_data_version(data_package_name = "mtcars20", #and provides an informative error. ``` -# Next steps - -You should place the data package source directory under `git` version control. -This allows you to version control your data processing code. # Partial builds and migrating old data packages. Version 1.12.0 has moved away from controlling the build process using `datasets.R` and an additional `masterfile` argument. -The build process is now controlled via a `datapackager.yml` configuration file located in the package root directory. (see [YAML Configuration Details](https://github.com/RGLab/DataPackageR/blob/master/YAML_CONFIG.md)) +The build process is now controlled via a `datapackager.yml` configuration file located in the package root directory. (see [YAML Configuration Details](YAML_CONFIG.html)) You can migrate an old package by constructing such a config file using the `construct_yml_config()` API. @@ -294,6 +303,24 @@ Passing of data objects amongst scripts can be turned off via: `package_build(deps = FALSE)` +# Next steps + +We recommend the following once your package is created. + +## Place your package under source control + +You now have a data package source tree. + +- **Place your package under version control** + 1. Call `git init` in the package source root to initialize a new git repository. + 2. [Create a new repository for your data package on github](https://help.github.com/articles/create-a-repo/). + 3. Push your local package repository to `github`. [see step 7](https://help.github.com/articles/adding-an-existing-project-to-github-using-the-command-line/) + + +This will let you version control your data processing code, and provide a mechanism for sharing your package with others. + + +For more details on using git and github with R, there is an excellent guide provided by Jenny Bryan: [Happy Git and GitHub for the useR](http://happygitwithr.com/) and Hadley Wickham's [book on R packages](http://r-pkgs.had.co.nz/). # Additional Details @@ -326,10 +353,6 @@ The description file has the new `DataVersion` string. cat(readLines(file.path(tempdir(),"mtcars20","DESCRIPTION")),sep="\n") ``` -## Why not use R CMD build? - -If the processing script is time consuming or the data set is particularly large, then `R CMD build` would run the code each time the package is installed. In such cases, raw data may not be available, or the environment to do the data processing may not be set up for each user of the data. In such cases, DataPackageR provides a mechanism to decouple data processing from package building/installation for downstream users of the data. - diff --git a/vignettes/usingDataPackageR.html b/vignettes/usingDataPackageR.html index 7e9cdcb..a350757 100644 --- a/vignettes/usingDataPackageR.html +++ b/vignettes/usingDataPackageR.html @@ -12,7 +12,7 @@ - + Using DataPackageR @@ -280,7 +280,7 @@

Using DataPackageR

Greg Finak gfinak@fredhutch.org

-

2018-07-05

+

2018-07-09

Purpose

-

This vignette demonstrates how to use DataPackageR to build a datapackage from the mtcars data set.

+

This vignette demonstrates how to use DataPackageR to build a data package.

+

DataPackageR aims to simplify data package construction.

+

It provides mechanisms for reproducibly preprocessing and tidying raw data into into documented, versioned, and packaged analysis-ready data sets.

+

Long-running or computationally intensive data processing can be decoupled from the usual R CMD build process while maintinaing data lineage.

+

In this vignette we will subset and package the mtcars data set.

What’s in the package skeleton structure?

-

This has created a datapackage source tree named “mtcars2” (in a temporary directory). For a real use case you would pick a path on your filesystem where you could then initialize a new github repository for the package.

+

This has created a datapackage source tree named “mtcars20” (in a temporary directory). For a real use case you would pick a path on your filesystem where you could then initialize a new github repository for the package.

The contents of mtcars20 are:

                levelName
 1  mtcars20              
@@ -389,12 +395,12 @@ 

A few words abou the YAML config file

enabled: yes objects: cars_over_20 render_root: - tmp: '95288'
+ tmp: '694862'

The two main pieces of information in the configuration are a list of the files to be processed and the data sets the package will store.

This example packages an R data set named cars_over_20 (the name was passed in to datapackage_skeleton()). It is created by the subsetCars.Rmd file.

The objects must be listed in the yaml configuration file. datapackage_skeleton() ensures this is done for you automatically.

DataPackageR provides an API for modifying this file, so it does not need to be done by hand.

-

Further information on the contents of the YAML configuration file, and the API are in the YAML Configuration Details

+

Further information on the contents of the YAML configuration file, and the API are in the YAML Configuration Details

Where do I put raw data?

@@ -404,8 +410,8 @@

Where do I put raw data?

An API to locate data sets within an R or Rmd file.

To locate the data to read from the filesystem:

    -
  • DataPackageR::project_extdata_path() to get the path to inst/extdata from inside an Rmd or R file. (e.g., /var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//Rtmp3EWJ9k/mtcars20/inst/extdata)

  • -
  • DataPackageR::project_path() to get the path to the datapackage root. (e.g., /var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//Rtmp3EWJ9k/mtcars20)

  • +
  • DataPackageR::project_extdata_path() to get the path to inst/extdata from inside an Rmd or R file. (e.g., /var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//RtmpXVZFpW/mtcars20/inst/extdata)

  • +
  • DataPackageR::project_path() to get the path to the datapackage root. (e.g., /var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//RtmpXVZFpW/mtcars20)

Raw data stored externally can be retreived relative to these paths.

@@ -417,36 +423,40 @@

Build the data package.

# Run the preprocessing code to build cars_over_20
 # and reproducibly enclose it in a package.
 DataPackageR:::package_build(file.path(tempdir(),"mtcars20"))
-INFO [2018-07-05 11:41:30] Logging to /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20/inst/extdata/Logfiles/processing.log
-INFO [2018-07-05 11:41:30] Processing data
-INFO [2018-07-05 11:41:30] Reading yaml configuration
-INFO [2018-07-05 11:41:30] Found /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20/data-raw/subsetCars.Rmd
-INFO [2018-07-05 11:41:30] Processing 1 of 1: /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20/data-raw/subsetCars.Rmd
+INFO [2018-07-09 09:24:58] Logging to /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20/inst/extdata/Logfiles/processing.log
+INFO [2018-07-09 09:24:58] Processing data
+INFO [2018-07-09 09:24:58] Reading yaml configuration
+INFO [2018-07-09 09:24:58] Found /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20/data-raw/subsetCars.Rmd
+INFO [2018-07-09 09:24:58] Processing 1 of 1: /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20/data-raw/subsetCars.Rmd
 processing file: subsetCars.Rmd
 output file: subsetCars.knit.md
-/usr/local/bin/pandoc +RTS -K512m -RTS subsetCars.utf8.md --to html4 --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash+smart --output /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20/inst/extdata/Logfiles/subsetCars.html --email-obfuscation none --self-contained --standalone --section-divs --template /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:bootstrap' --include-in-header /var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//Rtmp3EWJ9k/rmarkdown-str1c6861088f2d.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' 
+/usr/local/bin/pandoc +RTS -K512m -RTS subsetCars.utf8.md --to html4 --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash+smart --output /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20/inst/extdata/Logfiles/subsetCars.html --email-obfuscation none --self-contained --standalone --section-divs --template /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:bootstrap' --include-in-header /var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//RtmpXVZFpW/rmarkdown-straf102acc05f9.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' 
 
-Output created: /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20/inst/extdata/Logfiles/subsetCars.html
-INFO [2018-07-05 11:41:30] 1 required data objects created by subsetCars.Rmd
-INFO [2018-07-05 11:41:30] Saving to data
-INFO [2018-07-05 11:41:30] Copied documentation to /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20/R/mtcars20.R
+Output created: /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20/inst/extdata/Logfiles/subsetCars.html
+INFO [2018-07-09 09:24:59] 1 required data objects created by subsetCars.Rmd
+INFO [2018-07-09 09:24:59] Saving to data
+INFO [2018-07-09 09:24:59] Copied documentation to /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20/R/mtcars20.R
 ✔ Creating 'vignettes/'
 ✔ Creating 'inst/doc/'
-INFO [2018-07-05 11:41:30] Done
-INFO [2018-07-05 11:41:30] DataPackageR succeeded
-INFO [2018-07-05 11:41:30] Building documentation
+INFO [2018-07-09 09:24:59] Done
+INFO [2018-07-09 09:24:59] DataPackageR succeeded
+INFO [2018-07-09 09:24:59] Building documentation
 First time using roxygen2. Upgrading automatically...
-Updating roxygen version in /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20/DESCRIPTION
+Updating roxygen version in /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20/DESCRIPTION
 Writing NAMESPACE
 Writing mtcars20.Rd
 Writing cars_over_20.Rd
-INFO [2018-07-05 11:41:30] Building package
+INFO [2018-07-09 09:24:59] Building package
 '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file  \
   --no-environ --no-save --no-restore --quiet CMD build  \
-  '/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20'  \
+  '/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20'  \
   --no-resave-data --no-manual --no-build-vignettes 
 
-[1] "/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20_1.0.tar.gz"
+[1] "/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20_1.0.tar.gz" +
+

Why not just use R CMD build?

+

If the processing script is time consuming or the data set is particularly large, then R CMD build would run the code each time the package is installed. In such cases, raw data may not be available, or the environment to do the data processing may not be set up for each user of the data. DataPackageR decouples data processing from package building/installation for data consumers.

+

A log of the build process

DataPackageR uses the futile.logger pagckage to log progress.

@@ -546,14 +556,10 @@

Using the DataVersion

#and provides an informative error.
-
-

Next steps

-

You should place the data package source directory under git version control. This allows you to version control your data processing code.

-

Partial builds and migrating old data packages.

Version 1.12.0 has moved away from controlling the build process using datasets.R and an additional masterfile argument.

-

The build process is now controlled via a datapackager.yml configuration file located in the package root directory. (see YAML Configuration Details)

+

The build process is now controlled via a datapackager.yml configuration file located in the package root directory. (see YAML Configuration Details)

You can migrate an old package by constructing such a config file using the construct_yml_config() API.

+ tmp: '141841'

config is a newly constructed yaml configuration object. It can be written to the package directory:

@@ -592,7 +598,7 @@

Partial builds

- object1 - object2 render_root: - tmp: '288022' + tmp: '141841'

Note that the modified configuration needs to be written back to the package source directory in order for the changes to take effect.

The consequence of toggling a file to enable: no is that it will be skipped when the package is rebuilt, but the data will still be retained in the package, and the documentation will not be altered.

This is useful in situations where we have multiple data sets, and want to re-run one script to update a specific data set, but not the other scripts because they may be too time consuming, for example.

@@ -614,6 +620,26 @@

File system artifacts

Passing data objects to subsequent scripts.

A script (e.g., script2.Rmd) running after script1.Rmd can access a stored data object named script1_dataset created by script1.Rmd by calling

DataPackageR::datapackager_object_read("script1_dataset").

+

Passing of data objects amongst scripts can be turned off via:

+

package_build(deps = FALSE)

+ + +
+

Next steps

+

We recommend the following once your package is created.

+
+

Place your package under source control

+

You now have a data package source tree.

+ +

This will let you version control your data processing code, and provide a mechanism for sharing your package with others.

+

For more details on using git and github with R, there is an excellent guide provided by Jenny Bryan: Happy Git and GitHub for the useR and Hadley Wickham’s book on R packages.

@@ -640,7 +666,7 @@

DESCRIPTION

Type: Package Title: What the package does (short line) Version: 1.0 -Date: 2018-07-05 +Date: 2018-07-09 Author: Who wrote it Maintainer: Who to complain to <yourfault@somewhere.net> Description: More about what it does (maybe more than one line) @@ -653,10 +679,6 @@

DESCRIPTION

RoxygenNote: 6.0.1
-
-

Why not use R CMD build?

-

If the processing script is time consuming or the data set is particularly large, then R CMD build would run the code each time the package is installed. In such cases, raw data may not be available, or the environment to do the data processing may not be set up for each user of the data. In such cases, DataPackageR provides a mechanism to decouple data processing from package building/installation for downstream users of the data.

-
diff --git a/vignettes/usingDataPackageR.md b/vignettes/usingDataPackageR.md index 96f09d4..5add8a9 100644 --- a/vignettes/usingDataPackageR.md +++ b/vignettes/usingDataPackageR.md @@ -1,7 +1,7 @@ --- title: "Using DataPackageR" author: "Greg Finak " -date: "2018-07-05" +date: "2018-07-09" output: rmarkdown::html_vignette: keep_md: TRUE @@ -17,7 +17,15 @@ vignette: > ## Purpose -This vignette demonstrates how to use DataPackageR to build a datapackage from the `mtcars` data set. +This vignette demonstrates how to use DataPackageR to build a data package. + +DataPackageR aims to simplify data package construction. + +It provides mechanisms for reproducibly preprocessing and tidying raw data into into documented, versioned, and packaged analysis-ready data sets. + +Long-running or computationally intensive data processing can be decoupled from the usual `R CMD build` process while maintinaing [data lineage](https://en.wikipedia.org/wiki/Data_lineage). + +In this vignette we will subset and package the `mtcars` data set. ## Set up a new data package. @@ -62,7 +70,7 @@ Creating Read-and-delete-me ... Saving functions and data ... Making help files ... Done. -Further steps are described in '/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//Rtmp3EWJ9k/mtcars20/Read-and-delete-me'. +Further steps are described in '/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//RtmpXVZFpW/mtcars20/Read-and-delete-me'. Adding DataVersion string to DESCRIPTION Creating data and data-raw directories configuring yaml file @@ -70,7 +78,7 @@ configuring yaml file ### What's in the package skeleton structure? -This has created a datapackage source tree named "mtcars2" (in a temporary directory). +This has created a datapackage source tree named "mtcars20" (in a temporary directory). For a real use case you would pick a `path` on your filesystem where you could then initialize a new github repository for the package. The contents of `mtcars20` are: @@ -111,7 +119,7 @@ configuration: enabled: yes objects: cars_over_20 render_root: - tmp: '95288' + tmp: '694862' ``` The two main pieces of information in the configuration are a list of the files to be processed and the data sets the package will store. @@ -124,7 +132,7 @@ The objects must be listed in the yaml configuration file. `datapackage_skeleton DataPackageR provides an API for modifying this file, so it does not need to be done by hand. -Further information on the contents of the YAML configuration file, and the API are in the [YAML Configuration Details](https://github.com/RGLab/DataPackageR/blob/master/YAML_CONFIG.md) +Further information on the contents of the YAML configuration file, and the API are in the [YAML Configuration Details](YAML_CONFIG.html) ### Where do I put raw data? @@ -136,9 +144,9 @@ In this example we are reading from `data(mtcars)` rather than from the file sys To locate the data to read from the filesystem: -- `DataPackageR::project_extdata_path()` to get the path to `inst/extdata` from inside an `Rmd` or `R` file. (e.g., /var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//Rtmp3EWJ9k/mtcars20/inst/extdata) +- `DataPackageR::project_extdata_path()` to get the path to `inst/extdata` from inside an `Rmd` or `R` file. (e.g., /var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//RtmpXVZFpW/mtcars20/inst/extdata) -- `DataPackageR::project_path()` to get the path to the datapackage root. (e.g., /var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//Rtmp3EWJ9k/mtcars20) +- `DataPackageR::project_path()` to get the path to the datapackage root. (e.g., /var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//RtmpXVZFpW/mtcars20) Raw data stored externally can be retreived relative to these paths. @@ -152,38 +160,43 @@ Once the skeleton framework is set up, # Run the preprocessing code to build cars_over_20 # and reproducibly enclose it in a package. DataPackageR:::package_build(file.path(tempdir(),"mtcars20")) -INFO [2018-07-05 11:41:30] Logging to /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20/inst/extdata/Logfiles/processing.log -INFO [2018-07-05 11:41:30] Processing data -INFO [2018-07-05 11:41:30] Reading yaml configuration -INFO [2018-07-05 11:41:30] Found /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20/data-raw/subsetCars.Rmd -INFO [2018-07-05 11:41:30] Processing 1 of 1: /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20/data-raw/subsetCars.Rmd +INFO [2018-07-09 09:24:58] Logging to /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20/inst/extdata/Logfiles/processing.log +INFO [2018-07-09 09:24:58] Processing data +INFO [2018-07-09 09:24:58] Reading yaml configuration +INFO [2018-07-09 09:24:58] Found /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20/data-raw/subsetCars.Rmd +INFO [2018-07-09 09:24:58] Processing 1 of 1: /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20/data-raw/subsetCars.Rmd processing file: subsetCars.Rmd output file: subsetCars.knit.md -/usr/local/bin/pandoc +RTS -K512m -RTS subsetCars.utf8.md --to html4 --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash+smart --output /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20/inst/extdata/Logfiles/subsetCars.html --email-obfuscation none --self-contained --standalone --section-divs --template /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:bootstrap' --include-in-header /var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//Rtmp3EWJ9k/rmarkdown-str1c6861088f2d.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' +/usr/local/bin/pandoc +RTS -K512m -RTS subsetCars.utf8.md --to html4 --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash+smart --output /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20/inst/extdata/Logfiles/subsetCars.html --email-obfuscation none --self-contained --standalone --section-divs --template /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:bootstrap' --include-in-header /var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T//RtmpXVZFpW/rmarkdown-straf102acc05f9.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' -Output created: /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20/inst/extdata/Logfiles/subsetCars.html -INFO [2018-07-05 11:41:30] 1 required data objects created by subsetCars.Rmd -INFO [2018-07-05 11:41:30] Saving to data -INFO [2018-07-05 11:41:30] Copied documentation to /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20/R/mtcars20.R +Output created: /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20/inst/extdata/Logfiles/subsetCars.html +INFO [2018-07-09 09:24:59] 1 required data objects created by subsetCars.Rmd +INFO [2018-07-09 09:24:59] Saving to data +INFO [2018-07-09 09:24:59] Copied documentation to /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20/R/mtcars20.R ✔ Creating 'vignettes/' ✔ Creating 'inst/doc/' -INFO [2018-07-05 11:41:30] Done -INFO [2018-07-05 11:41:30] DataPackageR succeeded -INFO [2018-07-05 11:41:30] Building documentation +INFO [2018-07-09 09:24:59] Done +INFO [2018-07-09 09:24:59] DataPackageR succeeded +INFO [2018-07-09 09:24:59] Building documentation First time using roxygen2. Upgrading automatically... -Updating roxygen version in /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20/DESCRIPTION +Updating roxygen version in /private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20/DESCRIPTION Writing NAMESPACE Writing mtcars20.Rd Writing cars_over_20.Rd -INFO [2018-07-05 11:41:30] Building package +INFO [2018-07-09 09:24:59] Building package '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file \ --no-environ --no-save --no-restore --quiet CMD build \ - '/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20' \ + '/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20' \ --no-resave-data --no-manual --no-build-vignettes -[1] "/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/Rtmp3EWJ9k/mtcars20_1.0.tar.gz" +[1] "/private/var/folders/jh/x0h3v3pd4dd497g3gtzsm8500000gn/T/RtmpXVZFpW/mtcars20_1.0.tar.gz" ``` + +### Why not just use R CMD build? + +If the processing script is time consuming or the data set is particularly large, then `R CMD build` would run the code each time the package is installed. In such cases, raw data may not be available, or the environment to do the data processing may not be set up for each user of the data. DataPackageR decouples data processing from package building/installation for data consumers. + ### A log of the build process DataPackageR uses the `futile.logger` pagckage to log progress. @@ -306,16 +319,12 @@ assert_data_version(data_package_name = "mtcars20", #and provides an informative error. ``` -# Next steps - -You should place the data package source directory under `git` version control. -This allows you to version control your data processing code. # Partial builds and migrating old data packages. Version 1.12.0 has moved away from controlling the build process using `datasets.R` and an additional `masterfile` argument. -The build process is now controlled via a `datapackager.yml` configuration file located in the package root directory. (see [YAML Configuration Details](https://github.com/RGLab/DataPackageR/blob/master/YAML_CONFIG.md)) +The build process is now controlled via a `datapackager.yml` configuration file located in the package root directory. (see [YAML Configuration Details](YAML_CONFIG.html)) You can migrate an old package by constructing such a config file using the `construct_yml_config()` API. @@ -335,7 +344,7 @@ configuration: - object1 - object2 render_root: - tmp: '288022' + tmp: '141841' ``` `config` is a newly constructed yaml configuration object. It can be written to the package directory: @@ -370,7 +379,7 @@ configuration: - object1 - object2 render_root: - tmp: '288022' + tmp: '141841' ``` Note that the modified configuration needs to be written back to the package source directory in order for the @@ -401,6 +410,29 @@ A script (e.g., `script2.Rmd`) running after `script1.Rmd` can access a stored d `DataPackageR::datapackager_object_read("script1_dataset")`. +Passing of data objects amongst scripts can be turned off via: + +`package_build(deps = FALSE)` + +# Next steps + +We recommend the following once your package is created. + +## Place your package under source control + +You now have a data package source tree. + +- **Place your package under version control** + 1. Call `git init` in the package source root to initialize a new git repository. + 2. [Create a new repository for your data package on github](https://help.github.com/articles/create-a-repo/). + 3. Push your local package repository to `github`. [see step 7](https://help.github.com/articles/adding-an-existing-project-to-github-using-the-command-line/) + + +This will let you version control your data processing code, and provide a mechanism for sharing your package with others. + + +For more details on using git and github with R, there is an excellent guide provided by Jenny Bryan: [Happy Git and GitHub for the useR](http://happygitwithr.com/) and Hadley Wickham's [book on R packages](http://r-pkgs.had.co.nz/). + # Additional Details We provide some additional details for the interested. @@ -436,7 +468,7 @@ Package: mtcars20 Type: Package Title: What the package does (short line) Version: 1.0 -Date: 2018-07-05 +Date: 2018-07-09 Author: Who wrote it Maintainer: Who to complain to Description: More about what it does (maybe more than one line) @@ -449,10 +481,6 @@ VignetteBuilder: knitr RoxygenNote: 6.0.1 ``` -## Why not use R CMD build? - -If the processing script is time consuming or the data set is particularly large, then `R CMD build` would run the code each time the package is installed. In such cases, raw data may not be available, or the environment to do the data processing may not be set up for each user of the data. In such cases, DataPackageR provides a mechanism to decouple data processing from package building/installation for downstream users of the data. -