Impact at Home Plate

By: Kevin Smith 08/18/2022

Whether you are a Major League Baseball coach making lineup decisions, or a general manager making free agency decisions, there is a lot to consider when putting your team together. Analytics have been great in part because they aschewed tradtional baseball wisdom and allowed teams to find new paths to success. The impact metric just might be something that can bridge the gap between analytics and "old school baseball."

Project Goal

The goal of this project is to build a model to predict a players offensive wins above replacement (owar) utilizing machine learning and a metric that i invented called "impact."

being able to utilize impact in both free agent and lineup decisions simplifies the process by utilizing both analytics and tradition baseball wisdom.

We will be walking through a basic data science pipeline to reach our goal today. The steps in the pipeline are:

Planning
Acqusition
Prep
Exploration
Clustering
Feature Engineering
Modeling
Delivery

Steps For Reproduction

Clone my repo (including the , acquire.py. explore.py , stats.csv, value.csv, model.py & wrangle.pyfiles.
The libraries used are pandas, numpy, scipy, matplotlib, seaborn, and sklearn.
You should now be able to run the zillow_final_cluster.ipynb file.

Planning

Their are two essential parts to any good plan. Identify your Goals, and the necessary Steps to get there.

Goals:

Identify variables driving owar.
Develop a model to make value predicitons based on those variable.
Deliver actionable takeaways

Steps:

Initial hypothesis
Acquire and cache the dataset
Clean, prep, and split the data to prevent data leakage
Do some preliminary exploration of the data (including visualiztions and statistical analyses)
Create meaningful clusters
Trim dataset of variables that are not statistically significant
Determine which machine learning model perfoms the best
Utilize the best model on the test dataset
Create a final report notebook with streamlined code optimized for a technical audience

Data Library

Variable Name	Explanation	Values
last	the players last name	object
first	the players first name	object
id	unique identification code	Numeric value
year	the year of the season of the observation	Numeric value
name	The players full name	object
team	Code representing the team played for during observation	object
owar	offensive wins above replacement	Numeric value
age	age of player during season in years	Numeric value
ops	on-base + slugging-pct	Numeric value
tb	total bases	Numeric vaue
pitches_faced	number of puitches faced by batter during season	Numeric value
pull	the percentage of balls hit to the 'pull' side	Numeric value
center	the percetage of balls hit 'up the middle'	Numeric value
oppo	percentage of balls hit to the 'opposite' side	Numeric value
batted	number of balls the hitter made contact with during the season	Numeric value
raa	runs above average	Numeric value
waa	wins above average	Numeric value
impact	measure of players average impact per plate appearance	Numeric value
ppa	number of pitches seen in average plate appearance	Numeric value

Initial Hypothesis

The initial hypothesis can be based on a gut instinct or the first question that comes to mind when encountering a dataset.

Initial hypothesis number	hypothesis
Initial hypothesis 1	pull, center, and oppo have a non-linear relationship with each other
Initial hypothesis 2	There is a linear relationship between raa and waa, but a non-linear relationship between each of them and ops

Acquire and Prep

Utitlize the functions imported from the wrangle.py to create a DataFrame with pandas.

These functions will also cache the data to reduce execution time in the future should we need to create the DataFrame again.

In this step we will utilize the functions in the wrangle.py file to get our data ready for exploration.

This means that we will be looking for columns that may be dropped because they are duplicates, and either dropping or filling any rows that contain blanks depending on the number of blank rows there are.

This also means that we will be splitting the data into 3 separate DataFrames in order to prevent data leakage corrupting our exploration and modeling phases.

Exploration

This is the fun part! this is where we get to ask questions, form hypotheses based on the answers to those questions and use our skills as data scientist to evaluate those hypotheses!

For example, in the zillow dataset I asked "Does square footage drive up value?" and unsurprisingly the answer was generally yes. This lead me to the hypothesis that square footage would have a dependent relationship with tax value, which hypothesis testing confirmed. However I was able to find another variable that did a better job of predicting value.

Clustering

I created 2 clusters and utilized them as features for the models.

Cluster	Elements
tendency	pull, center, oppo
production	raa, waa, ops

Feature Engineering

I didnt try to reinvent the wheel here. I used Sklearn's Kbest function to find the most important variables for our model.

The tendency cluster was used in the models, but the production cluster proved less useful.

Modeling

Here we determine the best model to use for predicting value. I ran the data through 3 different regression algorithms to determine which would perform the best

The 2 degree polynomial model performed the best against baseline when predicting owar. Predicting owar on average 132% better than baseline.

Delivery

So, you may be wondering how you will be able to use this model to make decisions.

By keeping track of a players impact at the plate over the course of the season, managers can make lineup decisions based on a players recent impact level knowing that it is a key driver in how many wins above or below that player is likely to be on that given day.
When making free agency decisions the general manager can take into account the average impact that player has had at the plate over the course of previous seasons, and decide if they would rather have a player with a lower average impact who is more consistent, or a player who has short bursts of dominance at the plate.

Conclusion

In conclusion, with more time, and more robust data sources I would be able to track a players impact at the plate and identify potential free agents that would fit the identity of the team because of the metrics reliance on both analytics and tradition baseball wisdom.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.ipynb_checkpoints		.ipynb_checkpoints
2023_daily_stats.ipynb		2023_daily_stats.ipynb
acquire.py		acquire.py
explore.ipynb		explore.ipynb
explore.py		explore.py
final_player_valuation.ipynb		final_player_valuation.ipynb
model.ipynb		model.ipynb
model.py		model.py
package.json		package.json
player_offense_valuation.ipynb		player_offense_valuation.ipynb
readme.md		readme.md
stats.csv		stats.csv
value.csv		value.csv
wrangle.ipynb		wrangle.ipynb
wrangle.py		wrangle.py

kevin-mal-smith/player_valuation

Folders and files

Latest commit

History

Repository files navigation

Impact at Home Plate

By: Kevin Smith 08/18/2022

Project Goal

Steps For Reproduction

Planning

Goals:

Steps:

Data Library

Initial Hypothesis

Acquire and Prep

Exploration

Clustering

Feature Engineering

Modeling

Delivery

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Languages