Open-Source Repositories as Trust-Building Journalism Infrastructure: Examining the Use of GitHub by News Outlets to Promote Transparency, Innovation, and Collaboration

Description

This repository contains all of the source files, data collection scripts, codebook, final datasets, and the data analysis reports for the study titled, "Open-Source Repositories as Trust-Building Journalism Infrastructure: Examining the Use of GitHub by News Outlets to Promote Transparency, Innovation, and Collaboration." A free preprint of the paper is available here.

Citation

Zamith, R. (2023). Open-Source Repositories as Trust-Building Journalism Infrastructure: Examining the Use of GitHub by News Outlets to Promote Transparency, Innovation, and Collaboration. Digital Journalism. Advance Online Publication. https://doi.org/10.1080/21670811.2023.2202873

A BibTeX citation is available here.

Features

The supplemental directory contains two appendices to the study:
- The list of organizations and accounts studied (supplemental/list_of_organizations.md).
- The codebook used for the manual content analysis (supplemental/codebook.md).
The data directory contains all of the processed data used in the study.
- These files are effectively slices of the GitHub API data, converted to CSV format, to make analysis easier.
- The directory also includes the human-coded data (data/analysis_data/coded_repos.csv) and a CSV version of the list of organizations and accounts that were analyzed (data/list_of_orgs.csv).
- The JSON files containing the raw API responses from GitHub (at the time of data collection) are available upon request.
The method directory contains all of the data collection scripts used in the study.
- Only a partial replication is possible because some repositories or users may have been deleted from GitHub since the study data were collected in August, 2002. In other words, the resulting files/results are likely to differ from those reported in the study.
The analysis directory contains the R Notebook files necessary to replicate the analysis with the data provided in this repository.
- The file analysis/data_analysis.html contains a rendered version of the main data analysis notebook. This is the file you're most likely to be interested in if you want to see every step of the analysis.

Usage Details

If you wish to replicate the analysis or conduct a follow-up study, please follow the following steps. By performing the steps below, you will overwrite the data that came with your download of this repository.

Download the latest release and unzip the file. Alternatively, you can just clone this repository if you already have git installed.
Create a CSV file in the data directory called list_of_orgs.csv. That file should contain the following variables:
- Organization Name: The name of the news organization.
- Parent Organization: If you wish to relate one organization to another (e.g., a subsidiary), enter the Organization Name of the parent organization.
- Has GitHub Account: Enter "Y" if the organization has a GitHub account.
- Accounts: List all of the accounts associated with that organization, separated by a comma and a space (, ).
- You may include additional fields; these should be ignored by the data collection and analysis pipeline by default.
Download Python (version 3.0 or higher), in case it is not already installed on your computer. (Python 3.9.13 was used to collect the data.)
- The following libraries were used and may be installed using a Python library manager like pip3 (that comes with a typical Python installation):
  - Pandas (pip3 install pandas)
  - PyGithub (pip3 install PyGithub)
  - PyYAML (pip3 install PyYAML)
Get a personal access token from GitHub. Then, edit the config.yml file in the method directory, replacing the github_token value with your token.
Open your Terminal (MacOS/Linux) or Command Prompt (Windows), go into the directory go into the method directory (cd method/).
Get the list of GitHub repositories (and general repository data) from the desired organizations by executing python3 get_repo_info.py in your terminal. This will create a file called list_of_repos.csv in your data directory. It will also create a JSON file with the raw API response for each repository queried and place that file in the data/repo subdirectory.
Get detailed repository information by executing python3 get_repo_details.py. This will create several new subdirectories in your data directory containing the JSON files with raw API responses from different API endpoints for each of the repositories listed in the list_of_repos.csv file. (This will take a few days to fully execute due to API limitations and will produce a large number of files.)
Create consolidated CSV files containing subsets of the JSON files generated in the previous step by executing python3 create_csv_for_analysis.py. This will create a new subdirectory of data called analysis_data, which will contain the CSV files that will be used by the data analysis R Notebooks.
- The JSON files are not required for the subsequent data analysis steps. Only the data/list_of_orgs.csv file and the files in the data/analysis_data/ files are needed for the subsequent step.
Using RStudio (which depends on R), create a project with a working directory rooted in the analysis/ directory (File --> New Project --> Existing Directory --> the analysis subdirectory of the directory containing the project files). Alternatively, you may open each .Rmd file separately within RStudio.
- The analysis/generate_samples.Rmd file contains the R code used to generate the samples for the intercoder reliability assessment and the eventual human content analysis.
- The analysis/intercoder_reliability.Rmd file contains the R code used to perform the intercoder reliability assessment.
- The analysis/data_analysis.Rmd file contains the R code used to perform all of the data analysis reported in the paper and to generate the associated Figure images.
- To execute the R Notebooks, be sure to install all of the required packages first. You may do so by executing the following code in the RStudio console: install.packages(c("jsonlite", "tidyverse", "lubridate", "scales", "gridExtra", "irr"))

License

All of the data collection scripts, codebook, and data analysis reports are made available through a Creative Commons Attribution 4.0 International license. For more information, see the LICENSE file. Data collected from the GitHub API may be subject to different copyright restrictions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis

analysis

data

data

method

method

supplemental

supplemental

LICENSE

LICENSE

README.md

README.md

citation.bib

citation.bib

Repository files navigation

Open-Source Repositories as Trust-Building Journalism Infrastructure: Examining the Use of GitHub by News Outlets to Promote Transparency, Innovation, and Collaboration

Description

Citation

Features

Usage Details

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analysis		analysis
data		data
method		method
supplemental		supplemental
LICENSE		LICENSE
README.md		README.md
citation.bib		citation.bib

License

rodzam/open_source_repos_as_trust_building_journalism_infrastructure

Folders and files

Latest commit

History

Repository files navigation

Open-Source Repositories as Trust-Building Journalism Infrastructure: Examining the Use of GitHub by News Outlets to Promote Transparency, Innovation, and Collaboration

Description

Citation

Features

Usage Details

License

About

Resources

License

Stars

Watchers

Forks

Languages