Github Ninja

External Links

All output files generated are hosted on dropbox.

A web interface for viewing the time series graphs of repository graphs can be seen in my Ninja-viewer repository

Objectives

To produce a tool that queries githubarchive.org and the Github API, and generates longitudinal social network data and other time series for specified Github repositories. It will use a command line interface to execute queries.

Tools

Ruby - 1.9+ - the programming language used to develop the tool
BigQuery - BigQuery allows us to query all of the information stored by githubarchive.org. The bq command line tool is used for querying.
Github API - The Github API allows us to retrieve who made commits as extended commit data is not available on BigQuery/githubarchive.org.
igraph - Uses the igraph gem. This requires the igraph C library. Newest version doesn’t work, use 0.5.4 and use this command: gem install igraph -- --with-igraph-include=/usr/local/include/igraph --with-igraph-lib=/usr/local/lib

Output Format

The output needs to be read by the R packages igraph and RSiena. The most complete format that these two use in common is GraphML, which the igraph ruby gem can output to. When the snapshots are output, they will be output one snapshot per file. For example, scanning the last 12 months of rubinius will output rubinius_rubinius_0...rubinius_rubinius_12. This is the format recommended for RSiena.

Context to Query

The aggregator queries three main contexts:

Commits
Pull Requests
Issues

Edges are formed between developers using their interactions within these contexts. The following are the edges created:

Commenting on a commit - Committer -- Commenter
Commenting on a pull request - Pull Submitter -- Commenter
Closing a pull request - Pull Submitter -- Closer
Closing an issue - Issue Submitter -- Closer
Commenting on an issue - Issue Submitter -- Commenter

Data Sources

We will utilize the following data sources:

Github API
Githubarchive.org data on BigQuery

BigQuery will be the primary data source, and most data will be pulled from there. The Github API will be used to retrieve information on commits, primarily, the user who made the commit, as commit data is not available on githubarchive.org.

Current Draft

For simplicity, the initial draft will use an undirected graph and all edges will be considered the same, without differentiating based on event. If there are multiple connections between nodes, the weight of the edge will just increase for each connection.

The initial draft will also only have developers as nodes of the graph. If necessary, this can be changed for a later stage, allowing artifacts such as files, pull requests, issues, and commits to be considered nodes, at which point there will also be edges created for developers submitting any of said artifacts.

Querying

For BigQuery, to save time and money on data processed, we will first pull the top 100 repositories (number can vary). From there we pull only the columns we need, on just the top 100 repositories, and store that dataset in BigQuery. This dataset is only 140MB which dramatically reduces costs, as it is $0.12 per GB of storage per month and $0.035 per GB processed with queries. Updating the dataset processes ~16GB of data.

BigQuery Data

When we query the bigquery data, we want to limit our requests to specific events, and we only need information on certain fields:

Field	Description	Used For
actor	The user involved in this event	Gives the name of a node
payload_action	Specifies what action was performed during this event	Used to identify opened/closed on issues and pull requests
type	What the github event was	Differentiates types of events so we can handle them differently
payload_commit	sha of commit for this comment	Used with GithubAPI to retrieve the commit for this comment
payload_number	The number that identifies this PR or Issue	Used to match opened/closed PR’s and Issues
url	The URL for this event	Used for retrieving the payload_number for pull request comments, which don’t have it listed
repository_name	Name of repository	Necessary because we retrieve information on repositories one at a time
repository_owner	Owner of repository	Ditto above

Processing each event:

Below I will outline the steps necessary for processing each event.

CommitCommentEvent:

Go through all of them, and collect all the commit sha’s(payload_commit). Make a unique list of these, then grab them all using the github_api gem, and group them by their sha(This step prevents us from grabbing the same commit multiple time).

Now go through all of them again, and for each one, make an edge between the “actor” and the commit owner(Which we retrieved from the API)

IssueEvent & PullRequestEvent:

Group them by opened/closed. For every closed one, generate an edge between the actor of it, and the actor of the event with ‘opened’ with the equivalent payload_number.

IssueCommentEvent:

For each one, generate an edge between the “actor” and the issue/PR owner(A hashmap of these is generated from the above step, use the “actor” for the open ones).

PullRequestReviewCommentEvent:

Pull the PR number out of the url. Use info we already got from PullRequestEvent to get the “open” event with that PR number. Generate an edge between the two actors.

Plan of priorities

Derive a graph for the rubinius/rubinius repository since as far back as you can go. Only need to measure “strength of interactions” as a summary measure. One snapshot per month. DONE
Be able to find the 100/250/500 largest repositories per forks at any given point in time, and then derive “strength of interactions” graphs for each of those repos at monthly snapshots. DONE
Extract event streams with timestamps for the 100/250/500 repos selected in bullet 2. DONE
Extract time series of forks, total community members, “pull requesters”, and committers monthly for all the repositories selected in bullet 2
Add extra detail to edges - separate edges for measuring the strength of commits, pull request, and issues-based edges - but also needs to have a summary measures
Extract directed/non-directed networks
Implement all the flags below
Implement artifacts as nodes (creating artifact-actor networks)

Flags

The scraper needs to be able to take the following flags:

Static/dynamic network - on/off switch for whether a single network should be generated, or several snapshots over time
Repositories to query (e.g. rubinius/rubinius)
Time period to query (e.g. 2011-2012)
Time unit for snapshots (days, months, quarters, half-years, or years)
Level of granularity of relationships (either 1 relationship per context, or reduced to “interaction”, i.e. a single relationship for all interactions regardless of context)
Directed or non-directed graph (i.e, either relationships are non-directed, as in “we are working on the same bug”, or they are directed as in “A commented on B’s commit” or “A merged B’s pull request)
Strength of relationship on/off switch - either all relationships of the same type are equally valued, or they are evaluated by strength. I.e., if A and B interact frequently, their relationship will be strong (e.g. +1 for each interaction).
Nodes - determines what counts as a node: artifacts, developers, or artifacts/developers

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.gitignore		.gitignore
README.md		README.md
event-stream.rb		event-stream.rb
sna-csv.rb		sna-csv.rb
sna.rb		sna.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

event-stream.rb

event-stream.rb

sna-csv.rb

sna-csv.rb

sna.rb

sna.rb

Repository files navigation

Github Ninja

External Links

Objectives

Tools

Output Format

Context to Query

Data Sources

Current Draft

Querying

BigQuery Data

Plan of priorities

Flags

About

Releases

Packages

Languages

unforced/Ninja

Folders and files

Latest commit

History

Repository files navigation

Github Ninja

External Links

Objectives

Tools

Output Format

Context to Query

Data Sources

Current Draft

Querying

BigQuery Data

Plan of priorities

Flags

About

Resources

Stars

Watchers

Forks

Languages