An unpolished R implementation of Glickman and Hennessys' A stochastic rank ordered logit model for rating multi-competitor games and sports.
For a side project I was working on, I was trying to develop a way to measure the strength of NCAA cross-country runners (a sport where times are difficult to compare because of varying race courses and conditions) I found Glickman and Hennessys' paper and implemented the algorithm they described on my own dataset - a collection of over 6,500 NCAA cross-country race results and over one million individual performances. The size of the data required me to optimize and parallelize the algorithm and run the code on a high compute Google Cloud server.
At fist, the algorithm looked promising on a small test dataset of cross-country data. Unfortunately, when I scaled the algorithm to the complete dataset the results were nearly meaningless. I suspect this is the case because of the extreme variability of cross-country running races. Glickman and Hennessy created the algorithm for Olympic level downhill skiing where the results are much more consistent than cross-country running (because of the variability in strength of athletes, type of courses, and length of races).
In the future, I will clean up the codebase and create an R package. If you are trying to implement the algorithm yourself, please reach out.
For reading, exploring, manipulating, and transforming the data I used the packages readr, ggplot2, dplyr, and Matrix. For testing and optimizing the data I used packages RUnit and lineprof. For parallelization of the code I used parallel, doMC, and foreach.
Here is an overview of the files above in the order that I developed them:
Step One - understanding the algorithm and getting something working
- small-test.R : Prototype of the algorithm on a small dataset in order to understand the paper and test its viability
Step Two - Attempting to scale the algorithm to larger dataset
- large-model.R : The controller of the process. Feed in data, preprocess, run algorithm, save results.
- preprocess-data.R : First pass at a program that transform the raw data into the matrix format detailed in the paper
- newton-raphson.R : First pass at the algorithm detailed in section A of the paper, which is used to find posterior mode of theta
Step Three - Optimizing and parallelizing the algorithm
- large-model-optimize.R : Optimized/parallelized version of the controller
- preprocess-data-optimize.R : Optimized/parallelized version of preprocess
- newton-raphson-optimize.R : Optimized/parallelized version of Newton Raphson
Step Four - Unit testing the algorithm and understanding the results
- unit-test.R : Test results compared to hand derived results
- unit-test-2.R : Test algorithm on dummy data
- double-check-preprocess.R : I don't remember what this is :s
- large-model-optimize.R : I don't remember what this is :/
- parallel-test.R : Used to test/understand parallelization
- pop-explor.R : Used to understand results
For questions and help reach out to Julian on Twitter @jdegrootlutzner. The project is released under the MIT license.