Skip to content

br3ndonland/R-proteomics-Nrf1

Repository files navigation

README

Molecular biology experiments, mass spectrometry-based proteomics, and reproducible data analysis in R

Brendon Smith

br3ndonland

Launch in Google Colaboratory

Binder Launch in Binder container

license

Provided on GitHub with a CC-BY-4.0 license, which is commonly used for open-access scientific publications. I encourage you to use the materials in this repository for your own work. If you use this material, please attribute me and explain what you changed.

Table of Contents

Reproducibility

Comments

Science is an incredible tool for learning about the world. We use theory and experiment to generate new knowledge. In science, reproducibility occurs when different scientists do the same experiment and get results that agree. Our current scientific practices do not promote or reward reproducibility. As a result, the scientific community is experiencing a reproducibility crisis, in which the discoveries that we publish can't be reproduced by multiple labs, or even repeated within the same lab by the same person, in some instances. This is not authentic knowledge.

The reproducibility crisis is troubling. During my postdoc in a large molecular biology lab, I saw the reproducibility crisis unfold, both in the scientific literature and among my colleagues. Even more striking than the crisis itself was the lack of an insightful solution.

Documentation is the sine qua non of reproducibility. How can we hope to reproduce experiments if we don't know how they were done? Documentation must start at the beginning, with reproducible data analysis being preceded by reproducible experimental practices. No statistical adjustment can make up for lack of detailed metadata collected at the time the experiment is performed. Clear, annotated raw data should be provided, and data analyses should clearly describe each action taken from raw data to final analysis.

This repository is a practical example of reproducible scientific data analysis. I have attempted to provide, to the greatest extent possible, a complete documentation of the methods that led to the results presented. It's not perfect. The information is not complete, as I worked with others who don't care about documenting their work. Some aspects of the experiment didn't work well, but that's the point. Experiments don't usually turn out exactly according to plan. By carefully documenting the experiment, and sharing the results openly, I can understand what went wrong, and how to move forward in the most efficient way. That's how science should be.

It might seem strange to my scientific colleagues, who are mostly focused on career advancement and personal aggrandizement, that I would take so much time to analyze preliminary data from a pilot study like this. It's not just about the end result. If we want to address the reproducibility crisis, we need to focus on the process.

Resources

Tools

  • Binder turns GitHub repositories into reproducible computing environments. It uses code and dependency files to create Docker images that run in web browsers. Binder is a potentially great feature, but my experience so far is that it's extremely slow, and not properly loading additional R packages.
  • Gigantum: Research project management and collaboration system. It version-controls your research materials, allows them to be easily shared and published, and bundles everything to run reproducibly in the cloud.
  • Greene Integrative Genomics Laboratory at Penn:
  • Hypothesis: Open annotations on the web.
  • Project Jupyter
  • Open Science Framework: Research project management and collaboration system. Integrates many other software tools and forms of data.
  • Protocols.io: Open access repository for creation and sharing of scientific protocols.
  • ScienceFair: Decentralized p2p science literature client. See the eLife Labs blog post about ScienceFair. So far, it can only access eLife articles, and even that doesn't really work.
  • sciNote: Free electronic lab notebook.
  • Stencila: Open document suite that can be used to write and run code in a computationally reproducible way. I recently attended an eLife webinar about Stencila. eLife is considering Stencila as part of a "Reproducible Document Stack" to generate their manuscripts.
  • We-Sci: Tool to ensure proper attribution for scientific work.
  • Whole Tale: Research project management system.
  • Zenodo: Repository for digital materials to be permanently archived and stored with DOI versioning. Figshare is similar.

Workshops

Practices

To promote reproducible scientific work:

  • Comprehensively document experiments and analyses.
  • Format code files as computational narratives mixing prose and code with a tool like Jupyter Notebook.
  • Version control code with Git and share code on a website like GitHub.
  • Create a reproducible cloud computing environment using a tool like Binder.

(Back to top)

Scientific background

This is a summary report of an experiment I performed during my postdoc. The goal of this experiment was to identify a molecular complex associated with Nrf1, a protein our research group was studying. Nrf1 is also abbreviated NFE2L1, and should not be confused with Nuclear Respiratory Factor 1.

We began studying Nrf1 because it resides in a cellular organelle called the Endoplasmic Reticulum (ER). We study the ER and its roles in metabolism. We found that Nrf1 mediates the cellular response to cholesterol, and that it seemed to do this separately from its known function as a genetic transcription factor in the nucleus. Cholesterol metabolism occurs at the ER, and is very important in the liver, where cholesterol is metabolized and prepared for excretion.

We hypothesize that a group of other proteins interacts with Nrf1 to mediate its response to cholesterol at the ER. We used proteomics to test our hypothesis, which identifies all possible proteins in a sample with a technique called mass spectrometry.

I incorporated practices for reproducible scientific experimentation and data analysis throughout the project.

(Back to top)

Supplementary data

Supplementary data files, including the electronic lab notebook, protocols, datasheets and information on materials used, raw data, other data analyses, slides, and images, are available in the data-supplementary sub-directory of this repository.

Git-LFS was used to manage supplementary files.

❯ brew install git-lfs
❯ git lfs install
❯ cd path/to/repo
❯ touch .gitattributes
❯ git lfs track "*.docx" "*.pdf" "*.pptx" "*.xlsx" "*.zip"
Tracking "*.docx"
Tracking "*.pdf"
Tracking "*.pptx"
Tracking "*.xlsx"
Tracking "*.zip"
❯ git add .gitattributes
❯ git commit -m "Initialize Git LFS"
❯ git add --all
❯ git commit
❯ git push origin master
❯ git lfs push origin master

Data analysis

Notebook formats

  • Data analysis was performed with the R computing language, and is provided in R Markdown and Jupyter Notebook formats.
  • Jupyter Notebook combines prose and code to promote construction of reproducible computational narratives that configure the computing environment and precisely describe each step in the data analysis. When code from a reproducible computational narrative is run on another computer, there is a high probability that the same result will be obtained. Reproducibility.

R Markdown

If you haven't used R or R Markdown before, see my R guide.

  • R Markdown is a document creation package based on Markdown (syntax for easy generation of formatted HTML documentation), knitr (report generation package) and pandoc (universal document converter).
  • An RMarkdown file contains three types of data: YAML front matter header at the top of the file to specify output methods, Markdown-formatted text, and functional code chunks.
  • I use RStudio to work with R Markdown.
  • I created an RStudio project, which is required for version control and package management.
  • The R Markdown file outputs in the GitHub document format to output standard Markdown, in addition to HTML, for compatibility with GitHub.
  • renv was used to manage R packages for the project.
    • renv helps avoid problems caused by different package versions and installations by giving each project its own isolated package library.
    • renv is separate from general package managers like Homebrew used to install R and RStudio.
    • This project previously used Packrat, a predecessor to renv that is now "soft-deprecated."
      • Migration from Packrat to renv is simple. Run the following command in the console: renv::migrate("~/path/to/repo").
      • Then if packages aren't already installed, run renv::restore(), or in the Packages pane, navigate to renv -> Restore Library.

Virtual environments for local Jupyter notebooks

JupyterLab can be used to run Jupyter Notebook files.

If running the Jupyter notebook file locally, I would suggest using JupyterLab within a virtual environment. Here are some setup instructions:

  • I install Python on macOS via Homebrew, and then install JupyterLab inside a virtual envieonment. Once installation is complete, navigate to your project's directory, install dependencies, and run JupyterLab.

  • Here are the necessary command line arguments:

    ❯ brew install python3
    ❯ brew install jupyter
    ❯ cd path/where/you/want/jupyterlab
    ❯ python3 -m venv .venv
    ❯ source .venv/bin/activate
    .venv ❯ pip install jupyterlab
    # Install any JupyterLab extensions at this point
    .venv ❯ jupyter labextension install @jupyterlab/toc
    # Run JupyterLab
    .venv ❯ jupyter lab

Cloud Jupyter Notebooks

  • Binder can run Jupyter Notebooks in the cloud by creating Docker containers. It takes a long time to build containers. It works well with Python, but I found that it was not properly loading additional packages when running R.
  • Google provides a cloud-based Jupyter Notebook environment called Colaboratory. It originally just supported Python, but can now run R. Unfortunately, it requires a Google login to run code.
Launch in Google Colaboratory

Binder Launch in Binder container

Results

Volcano plot from R analysis comparing proteins in cholesterol-fed vs chow-fed liver

Complement C1q proteins A, B, C (green dots in the plot above) were identified as potentially interacting with Nrf1 in the setting of liver cholesterol accumulation. The experiment did have notable limitations, which prompted us to refine our methods and continue with further experiments.

(Back to top)