Skip to content

Latest commit

 

History

History

Kojak

Optimizing MLB Manager's Challenge

For my capstone project I decided to tackle the issue of when to optimally use a manager's challenge in baseball. Starting in 2014, Major League Baseball instituted a new rule where a manager had the right to "challenge" a call by an umpire (for certain types of calls). The call is then reviewed by a crew of umpires in New York City through multiple camera angles. If the manager is correct, the call is overturned. If the umpire is correct or the video is inconclusive, then the manager loses the ability to challenge for the remainder of the game. So, my goal was to model when a manager should optimally attempt to challenge a possible mistaken call. Much of the structure of my project is inspired by this article, (http://www.hardballtimes.com/when-should-managers-challenge/) but I did approach many parts differently. The notebook containing the code is [here](MLB Challengessept8.ipynb). Additionally, the corresponding slide presentation is linked over here.

Data Sources

I scraped replay data from MLB's Baseball Savant website and combined it with the files available available on Retrosheet. With the exception of about 50 out of 3000 replays, the data was pretty consistent.

Units of Measurement

First, I had to determine how exactly I'd capture what "success" or "failure" meant. To do this, I turned to the Win Expectancy stat developed by Tom Tango. This essentially is a matrix to calculate the probability a team will win a game given the game state (i.e. score differential, innings, runners on base, etc.). So, I use the difference in WE to capture how "good" or "bad" a certain event is.

Formula

To figure out whether to challenge, there are three things to take into account:

  1. The potential upside if successful.
  2. The downside if unsuccessful.
  3. The probability of success.

If the probability of success times the gain in the upside is greater than the probability of failure times the loss, then a manager should challenge. Otherwise he shouldn't. Next, I go through my calculations for each of the three factors.

1. Upside

To calculate this, I simply use [Tom Tango's WE Matrix](BigTable-Table 1.csv) to calculate the current WE given the game state. Then, I plug in the game state if the call is overturned. The increase in probability is the upside.

2. Downside

Essentially, I'm trying to capture the "opportunity cost" of losing the ability to challenge for the remainder of the game. This is further complicated by the fact that, even if a manager loses a challenge, he can still request the umpire to initiate a subsequent one out of his own accord. So, I tried a few approaches. First, I tried to match each replay with Fangraph's game logs, which capture the WE after every at bat. I would select all lost challenges and find the differences between the WE after the lost challenge and the final WE (i.e. either 0 or 1). I assumed that, if there was in fact an opportunity cost, there would be a small net loss in the average change. However, the data wasn't neat enough to match up too easily, so the approach wasn't feasible.

Instead, I took the average win percentage of a game after a team just lost a challenge (the win/loss result was obtained via the mlbgame python package). The only thing is that this "swing" below .500 is not totally attributable to a lost challenge since I'm already biasing the sample to plays that were just challenged (which always follows an adverse call). So, to offset this, I calculated the net win percentage gain above .500 for successful challenges. I assumed that the greater dip below .500 for a lost challenge vs. the gain above .500 for a successful challenge is attributable to "opportunity cost." Furthermore, I assumed that this number decreases linearly with the number of outs remaining, since the lost "opportunities" are proportional to the number of outs. However, the number I got seemed to be too high, as a team losing a challenge at the beginning of a game would lose about 9% WE.

So, I tried a third approach. I simply did the same as the second except limited my sample to the 1st inning and did not offset it against a positive "swing" from an overturned call (thus "rounding up"). I divided by the average number of outs remaining, which lead to my estimated figure: .03% decrease in WE per out remaining.

Either way you slice it, there were a lot of different shifts in how a lost challenge affected the chances of winning from inning to inning, which I do not have an intuitive explanation for other than attribute it to small sample sizes. But, either way, I think it's safe to assume that the "cost" is very low. In fact, I managed to get this quote directly from Tom Tango himself via direct email, "There's not much cost to consider."

3. Probability of Call Being Overturned

This part proved to be the most complex. Essentially, I wanted to use machine learning classification algorithmns to generate the probability of a call being overturned (as opposed to a straightforward binary classification). To do this, I used Linear Regression and Support Vector Machines and optimized for log-loss. I chose 5 different features to train my model on, all of which seemed fairly predictive: the team challenging, umpire challenged, type of play, inning, and day of week. I figured with some extra thought I could try engineering a few more, but would probably increase the accuracy of the model by small margins. Furthermore, since there are three different possible results for each challenge (overturned, stands, and confirmed), I split each feature into 3 bins. Instead of creating dummy variables, I simply used the percentage that each feature resulted in each result as inputs (for example, for a play-type in the category "tag play", instead of a dummy variable "tag play", I used the values .550, .300, .150, representing the percentage of the time it is overturned, stands, or confirmed, respectively).

However, I was thrown a conceptual "curveball" trying to navigate around a major design issue. Imagine a set "A" containing all plays that occur in an MLB game. Now, imagine that "A" has a subset "B", containing all "close" plays (for simplicity, we'll assume that the definition of what belongs in set "B" has a broad consensus). Now, set "B" is a superset to set "O", the set of all incorrect umpire calls (that would be overturned by replay). Also, it's a superset to "C," all plays that are challenged. This is illustrated here: alt text Now, the whole goal of our model is to take plays belonging in set "B" and figure out the chances that they are also contained in "O." However, our model is only trained on items from set "C," since we only record the accuracy of calls that are challenged. By definition, there's no way to definitively know whether an unchallenged call belongs in set "O" or not. So, the model must find a way to approximate the full size of the sets "O" and "B".

The most basic way I chose to address this is by assuming that set "B" will approximately equal the set "C" of the most aggressive-challenging team plus the average number of umpire-initiated challenges per team per game. To capture those sets numerically, I calculated the sizes of each team's "C" in units of challenges used per game and added that umpire challenges initiated per game. Next, I assumed that the total size of "O" equaled the number of overturned calls per game of the team that most frequently has calls overturned plus umpire-initiated overturned per game. Then, I calculated the ratio of this approximated "O" and "B" to the mean "O" and "C" and used that to convert any output of the model. Also, I removed the feature containing the challenging team, since this would no longer be relevant, as "B" and "O" should theoretically be consistent for all teams.

Additionally, I started a second approach to come up with a better approximation of "O" and "B", but have not yet implemented it. I observed that, predictably, the more frequently a team challenged, the more frequently they successfully overturned calls:

However, from basic data visualization, this seemed to occur at a diminishing rate as seen by its effect on overall accuracy:

So, presumably, if one were to fit a curve to the points, there would be some sort of horizontal maximum that it would approach. One can take the value of that maximum together with the approximate number of challenges per game to come close to achieving it, one can get a good approximation of sets "O" and "B." However, I still need further work to refine the implementation.

Either way, I felt that removing the team challenging as a factor would fail to leverage the good judgment of teams that generally have been better at overturning calls. So, I decided that it would make most sense to allow a mix of two models. The first one being the initial unadjusted one taking into account the team challenging and the second being the one with the adjustment for set "B," which eliminates team. A team can choose to weight it based on how "similar" a close call might be to their own set "C."

Conclusions and Future Considerations

When trying different combinations of scenarios, this model seems to encourage the manager to be very aggressive in challenging close calls. However, much of this depends on the assumption that the "opportunity cost" for a lost challenge is very low. Ideally, to get a better measure of opportunity cost, one can simply retrace Tom Tango's steps in constructing Win Expectancy tables, but incorporating the opportunity to challenge in the game state.

Additionally, to better leverage the judgment of a team's replay official in the algorithm, a team can implement a system where after a call is challenge, the replay official rates how "confident" he is on a scale. Later on, that scale can be used to retrain a new model taking into account the confidence level of the replay official.