repo2docker study of NIPS 2017 papers

Presented at Learning to Be Reproducible ICML 2018

Data as part of Reproducible Research Environments with repo2docker

We collect data from the NIPS 2017 schedule to demonstrate the relationship between the presence of configuration files used by repo2docker and GitHub engagement. These results are reported in Section 4 of the paper.

We use repo2docker to publish a live version of the repo on binder.

To encourage the reproducibility of this work, we are including a link to a binder version of this repo to run our analysis in the browser. Click the button below to launch a live version of the repo on binder:

Local Installation

To install with conda:

conda env create -f environment.yml

We recommend exploring the repository with JuptyerLab.

source activate r2d-study
jupyter lab

If running a recent version of Jupyter Notebook or binder, you may switch to JupyterLab by replacing the part of your URL with /tree to /lab.

File Descriptions

The majority of analysis occurs in get_data.ipynb. Helper functions used in the notebook are in collect_data.py.

Note: that since we use GitHub's graphql API to collect GitHub metadata, one would need to create a personal access token and replace it in the appropriate code block to recollect all the data. The relevant collected files are included anyway as csv as described in the notebook for simplicity and exceptions are written to skip data collection if necessary.

NIPS 2017 Datasets

Data scraped from the NIPS 2017 schedule is reformatted in a csv in all_papers.csv.

The dataset with all GitHub repo data with metadata is gh_metadata_w_labeled.csv, indicating that this dataset includes URLs to GitHub research repos that were found through manual inspection.

The dataset with all config file information is r2d_w_labeled.csv. Similarly, some URLs were found through manual inspection.

Interim datasets concatenated to make gh_metadata_w_labeled.csv are called gh_metadata.csv and gh_labeled.csv. Interim datasets concatenated to make gh_r2d_data.csv and r2d_labeled.csv. Datasets with the _labeled suffix signify that the URL for the repo was found through manual inpsection.

Two types of papers required manual inspection: papers that changed their GitHub repo name, or papers that had errors in their URL. Papers that had errors in their URL are listed with their labeled URLs in validate_url_w_labels.csv. Papers that changed their reponame with their new repo are in change_reponame_labeled.csv.

Libraries that were part of larger repositories that were excluded from our analysis are in larger_libraries.csv. Similarly repositories that did not include lines of programming cdoe were excluded, which are listed in no_code.csv.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
README.md		README.md
all_papers.csv		all_papers.csv
change_reponame.csv		change_reponame.csv
change_reponame_labeled.csv		change_reponame_labeled.csv
collect_data.py		collect_data.py
engagement_excl_lrg.pdf		engagement_excl_lrg.pdf
environment.yml		environment.yml
get_data.ipynb		get_data.ipynb
gh_labeled.csv		gh_labeled.csv
gh_metadata.csv		gh_metadata.csv
gh_metadata_w_labeled.csv		gh_metadata_w_labeled.csv
gh_metadata_w_labeled_old.csv		gh_metadata_w_labeled_old.csv
gh_metrics_boxplot.pdf		gh_metrics_boxplot.pdf
gh_r2d_data.csv		gh_r2d_data.csv
labeled_r2d_data.csv		labeled_r2d_data.csv
larger_libraries.csv		larger_libraries.csv
manual_review_metrics.pdf		manual_review_metrics.pdf
no_code.csv		no_code.csv
paper_metrics.pdf		paper_metrics.pdf
pct_languages.pdf		pct_languages.pdf
primary_language.pdf		primary_language.pdf
r2d_labeled.csv		r2d_labeled.csv
r2d_w_labeled.csv		r2d_w_labeled.csv
repos_by_r2d_file.pdf		repos_by_r2d_file.pdf
total_r2d_files_plot.pdf		total_r2d_files_plot.pdf
validate_url.csv		validate_url.csv
validate_url_w_labels.csv		validate_url_w_labels.csv

jzf2101/r2d_study

Folders and files

Latest commit

History

Repository files navigation

repo2docker study of NIPS 2017 papers

Presented at Learning to Be Reproducible ICML 2018

Local Installation

File Descriptions

NIPS 2017 Datasets

About

Topics

Resources

Stars

Watchers

Forks

Languages