Skip to content

matthewfeickert/R-in-Jupyter-with-Binder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

R in Jupyter with Binder

An example of how to use R in Jupyter notebooks and then make a Binder environment to run them interactively on the web. This repo was inspired from a Tweet in a discussion about Episode 7 of The Bayes Factor podcast.

Disclaimer: I am a physicist and primarily a Python and C++ programmer and I don't use/know R. This repo is just what I know from being able to read code and understanding how Jupyter works.

License DOI Build Status Binder

Check it out first

Before learning how to setup R in Jupyter, first go check out how cool it is in Binder! Just click the "launch Binder" badge above.

Table of contents

Requirements

Before you can begin you need to make sure that you have the following in your working environment

Setup and Installation

Enable package dependency management with packrat

The first step in any project should be making sure that the library dependencies are specified and reproducible. This can be done in R with the packrat library.

First install packrat

R -e 'install.packages("packrat")'

Then from your project directory initialize Packrat for your project

R -e 'packrat::init()'

which will determine the R libraries used in your project and then build the list of dependencies for you. This effectively creates an isolated computing environment (known as "virtual environments" it other programming languages).

Running packrat::init() results in the directory packrat being created with the files init.R, packrat.lock and packrat.opts inside of it. It will additionally also create or edit a project .Rprofile and edit any existing .gitignore file. The following files should be kept under version control for your project to be reproducible anywhere:

  • .Rprofile
  • packrat/init.R
  • packrat/packrat.lock
  • packrat/packrat.opts

As you work on your project and use more libraries you can update your dependencies managed by packrat with (inside the R environment)

packrat::snapshot()

which updates packrat/packrat.lock and packrat/packrat.opts. You can also check if you have introduces new dependencies with

packrat::status()

If you remove libraries from use that were managed by packrat you can check this with

packrat::status()

and then remove them from packrat with

packrat::clean()

to ensure the minimal environment necessary for reproducibility is kept. Checking the status should now show

packrat::status()
#> Up to date.

If you have a packrat.lock file that you want to create an environment from that doesn't already exist you can build the environment by running (from the command line)

R -e 'packrat::restore()'

This is one way in which you could setup the same packrat environment on a different machine from the Git repository.

Using papermill with Jupyter

Papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebooks.

This means that you can use papermill to externally run, manipulate, and test Jupyter notebooks. This allows you to use Jupyter notebooks as components of an automated data analysis pipeline or for procedurally testing variations.

A toy example of how to use papermill is demonstrated in the example Jupyter notebook.

Testing Jupyter notebooks with pytest

To provide testing for Jupyter notebooks we can use pytest in combination with papermill.

Once you have installed pytest and done some minimal reading of the docs then create a tests directory and write your test files in Python inside of it.

An example of some very simple tests using papermill is provided in tests/test_notebooks.py. Once you read though and understand what the testing file is doing execute the tests with pytest in the top level directory of the repo by running

pytest

To see the output that the individual testing functions would normally print to stdout run with the -s flag

pytest -s

Why write tests?

There are numerous reasons to test your code, but as a scientist an obvious one is ensured reproducibility of experimental results. If your analysis code has unit tests and the analysis itself exists in an automatically testable pipeline then you and your colleagues should have more confidence in your analysis. Additionally, your analysis becomes (by necessity) a well documented and transparent process.

Want to learn more? Check out the Test and Code podcast hosted by Brian Okken.

Why test with pytest?

pytest is the most comprehensive and scalable testing framework that I know of. I am biased, but I continue to be impressed with how nimble, powerful, and easy it is to work with. It makes me want to write tests. For the purposes of this demo repository it is also important as it allows for writing tests that use papermill (papermill's execute_notebook is only accessible through the Python API).

There are testing frameworks in R, most notably testthat, which I assume are good. So I would encourage you to explore those as well.

Automating testing alongside development with CI

Assuming that you're using Git to develop your analysis code then you can have a continuous integration service (such as Travis CI or CircleCI) automatically test your code in a fresh environment every time you push to GitHub. Testing with CI is a great way to know that your analysis code is working exactly as expected in a reproducible environment from installation all the way through execution as you develop, revise, and improve it. To see the output of the build/install and testing of this repo in Travis click on the build status badge at the top of the README (also here: Build Status).

To start off with I would recommend using Travis CI (it is the easiest to get up and running).

Access restrictions and hosting

There may be instances where you want to have your Git repository be private until work is complete or other information is made publicly available, and you still want to be able to use CI services.

Travis CI (currently) only works with GitHub and is free only for public repositories. CircleCI works with any Git web hosting service (i.e., GitHub, GitLab, Bitbucket) and allows for free use with public and private repositories up to a monthly use time budget. Additionally, GitLab offers their own CI service that is integrated into the GitLab platform. If your organization self-hosts an instance of GitLab (GitLab is open core) then you can use those CI tools with your private GitLab hosted repositories. If your organization has access to the enterprise version of GitLab then you can even run GitLab CI on GitHub hosted repositories.

Setting up a Binder environment

Binder turns your GitHub repository into a fully interactive computational environment (as you hopefully have already seen from the demo notebook). It then allows people to run any code that exists in the repository from their web browser without having to install any code and is a great tool for collaboration and sharing results.

The Binder team has done amazing work to make "Binderizing" a GitHub repository as simple as possible. In the case of getting an R computing environment many times all that you need (in addition to a DESCRIPTION file and maybe an install.R) is a runtime.txt file that dictates which daily snapshot of MRAN to use. See the binder directory for an example of what is needed to get this repository to run in Binder.

You'll note that the "launch Binder" badge at the top of the README automatically launches into the R-in-Jupyter-Example.ipynb notebook. This was configured to do so, but the default Binder behavior is to launch the Jupyter server and then show the directory structure of the repository.

To see that behavior launch Binder from here: Binder

Once the server loads click on any file to open it in an editor or as a Jupyter notebook.

Preservation and DOI with Zenodo

To further make your analysis code more robust you can preserve it and make it citable by getting a DOI for the project repository with Zenodo. Activating version tracking on your GitHub repository with Zenodo will allow it to automatically freeze a version of the repository with each new version tag and then archive it. Additionally, Zenodo will create a DOI for your project and versioned DOIs for the project releases which can be added as a DOI badge. This makes it trivial for others to cite your work and allows you to indicate what version of your code was used in any publications.

R Markdown in Jupyter with jupytext

R Markdown is a very popular way to present beautifully rendered R along Markdown in different forms of documents. However, it is source only and not dynamically interactive as the R and Markdown needed to be rendered together with Pandoc (Pandoc is awesome).

jupytext is a utility to open and run R markdown notebooks in Jupyter and save Jupyter notebooks as R markdown.

Once you have installed jupytext create a Jupyter config with

jupyter notebook --generate-config

which creates the config file at

.jupyter/jupyter_notebook_config.py

Add the following line to the Jupyter config

c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"

If you now launch a Jupyter notebook server and open a .Rmd file the R Markdown should now be rendered in the interactive environment of Jupyter!

R Markdown in Jupyter in Binder

To get R Markdown working in Binder simple create a requirements.txt file in the binder directory and add jupytext to it. Binder should take care of the rest!

  • Here's a minimal example using the Example_Rmd.Rmd file from this repository: Binder

Further Reading and Resources

Acknowledgements