Open-Source Repositories as Trust-Building Journalism Infrastructure: Examining the Use of GitHub by News Outlets to Promote Transparency, Innovation, and Collaboration
This repository contains all of the source files, data collection scripts, codebook, final datasets, and the data analysis reports for the study titled, "Open-Source Repositories as Trust-Building Journalism Infrastructure: Examining the Use of GitHub by News Outlets to Promote Transparency, Innovation, and Collaboration." A free preprint of the paper is available here.
Zamith, R. (2023). Open-Source Repositories as Trust-Building Journalism Infrastructure: Examining the Use of GitHub by News Outlets to Promote Transparency, Innovation, and Collaboration. Digital Journalism. Advance Online Publication. https://doi.org/10.1080/21670811.2023.2202873
A BibTeX citation is available here.
-
The
supplemental
directory contains two appendices to the study:- The list of organizations and accounts studied (
supplemental/list_of_organizations.md
). - The codebook used for the manual content analysis (
supplemental/codebook.md
).
- The list of organizations and accounts studied (
-
The
data
directory contains all of the processed data used in the study.- These files are effectively slices of the GitHub API data, converted to CSV format, to make analysis easier.
- The directory also includes the human-coded data (
data/analysis_data/coded_repos.csv
) and a CSV version of the list of organizations and accounts that were analyzed (data/list_of_orgs.csv
). - The JSON files containing the raw API responses from GitHub (at the time of data collection) are available upon request.
-
The
method
directory contains all of the data collection scripts used in the study.- Only a partial replication is possible because some repositories or users may have been deleted from GitHub since the study data were collected in August, 2002. In other words, the resulting files/results are likely to differ from those reported in the study.
-
The
analysis
directory contains the R Notebook files necessary to replicate the analysis with the data provided in this repository.- The file
analysis/data_analysis.html
contains a rendered version of the main data analysis notebook. This is the file you're most likely to be interested in if you want to see every step of the analysis.
- The file
If you wish to replicate the analysis or conduct a follow-up study, please follow the following steps. By performing the steps below, you will overwrite the data that came with your download of this repository.
-
Download the latest release and unzip the file. Alternatively, you can just clone this repository if you already have
git
installed. -
Create a CSV file in the
data
directory calledlist_of_orgs.csv
. That file should contain the following variables:Organization Name
: The name of the news organization.Parent Organization
: If you wish to relate one organization to another (e.g., a subsidiary), enter theOrganization Name
of the parent organization.Has GitHub Account
: Enter "Y" if the organization has a GitHub account.Accounts
: List all of the accounts associated with that organization, separated by a comma and a space (,
).- You may include additional fields; these should be ignored by the data collection and analysis pipeline by default.
-
Download Python (version 3.0 or higher), in case it is not already installed on your computer. (Python 3.9.13 was used to collect the data.)
-
Get a personal access token from GitHub. Then, edit the
config.yml
file in themethod
directory, replacing thegithub_token
value with your token. -
Open your Terminal (MacOS/Linux) or Command Prompt (Windows), go into the directory go into the
method
directory (cd method/
). -
Get the list of GitHub repositories (and general repository data) from the desired organizations by executing
python3 get_repo_info.py
in your terminal. This will create a file calledlist_of_repos.csv
in yourdata
directory. It will also create a JSON file with the raw API response for each repository queried and place that file in thedata/repo
subdirectory. -
Get detailed repository information by executing
python3 get_repo_details.py
. This will create several new subdirectories in yourdata
directory containing the JSON files with raw API responses from different API endpoints for each of the repositories listed in thelist_of_repos.csv
file. (This will take a few days to fully execute due to API limitations and will produce a large number of files.) -
Create consolidated CSV files containing subsets of the JSON files generated in the previous step by executing
python3 create_csv_for_analysis.py
. This will create a new subdirectory ofdata
calledanalysis_data
, which will contain the CSV files that will be used by the data analysis R Notebooks.- The JSON files are not required for the subsequent data analysis steps. Only the
data/list_of_orgs.csv
file and the files in thedata/analysis_data/
files are needed for the subsequent step.
- The JSON files are not required for the subsequent data analysis steps. Only the
-
Using RStudio (which depends on R), create a project with a working directory rooted in the
analysis/
directory (File
-->New Project
-->Existing Directory
--> theanalysis
subdirectory of the directory containing the project files). Alternatively, you may open each .Rmd file separately within RStudio.- The
analysis/generate_samples.Rmd
file contains the R code used to generate the samples for the intercoder reliability assessment and the eventual human content analysis. - The
analysis/intercoder_reliability.Rmd
file contains the R code used to perform the intercoder reliability assessment. - The
analysis/data_analysis.Rmd
file contains the R code used to perform all of the data analysis reported in the paper and to generate the associated Figure images. - To execute the R Notebooks, be sure to install all of the required packages first. You may do so by executing the following code in the RStudio console:
install.packages(c("jsonlite", "tidyverse", "lubridate", "scales", "gridExtra", "irr"))
- The
All of the data collection scripts, codebook, and data analysis reports are made available through a Creative Commons Attribution 4.0 International license. For more information, see the LICENSE file. Data collected from the GitHub API may be subject to different copyright restrictions.