# basics
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from joblib import dump, load
import warnings
# machine learning
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
# baseball data
from pybaseball import statcast_pitcher
from pybaseball import playerid_lookup
Lab 05: Building a Pitch Classifier
Introduction and Data
Goal: The goal of this lab is to create a pitch classifier that can be used as a part of an automatic system. It should predict the “pitch type” based on the pitch’s velocity and spin rate.
What is a pitch type you might ask? Well, it’s complicated. Let’s allow someone else to explain:
As we’ll see in a moment, while the pitch type is technically defined by what the pitcher claims they threw, we can probably infer the pitch type based on only speed and spin.
Now that you’re a pitch type expert, here’s a game to see how well you can identify pitches from video:
That game was difficult wasn’t it? Don’t feel bad! Identifying pitches with your eyes is difficult. It is even more difficult when you realize the cameras are playing tricks on you:
But wait! Then how do television broadcasts of baseball games instantly display the pitch type? You guessed it… machine learning! For a deep dive on how they do this, see here:
The long story short is:
- Have advanced tracking technology that can instantly record speed and spin for each pitch.
- Have a trained classifier for pitch type based on speed and spin.
- In real time, make predictions of pitch type as soon as the speed and spin are recorded.
- Display the result in the stadium and on the broadcast!
To do this, you’ll need to import the following:
You are free to import additional packages and modules as you see fit, but this lab can be completed with only the above.
We will use the pybaseball
package to obtain Statcast data. Statcast is a term used for data produced by MLB Advanced Media and published on MLB’s Baseball Savant. One method to obtain data from this service is to use the so-called Statcast Search. However, the pybaseball
package provides the ability to directly obtain data from within Python.
Documentation for this data can be found here: Statcast Search CSV Documentation
The following code is a bit of necessary setup before we move to obtaining data for two particular pitchers.
# deal with pandas versus seaborn issues
="ignore", category=FutureWarning)
warnings.simplefilter(action
# define the dates of the regular season
# https://en.wikipedia.org/wiki/2023_Major_League_Baseball_season
= "2023-03-30"
start_date_2023 = "2023-10-01"
end_date_2023
# define required columns for (basic) pitch classifier
= ["pitch_name", "release_speed", "release_spin_rate"] pitch_cols
Justin Verlander
Justin Verlander is a pitcher for the 2023 Houston Astros.
All Verlander data will be prefaced with a v
, such as vX_train
.
# access and process Verlander train and test data
# playerid_lookup("Verlander", "Justin")
= 434378
verlander_key_mlbam = statcast_pitcher(start_date_2023, end_date_2023, verlander_key_mlbam)
verlander_pitches = verlander_pitches[pitch_cols]
verlander_pitches = verlander_pitches.dropna()
verlander_pitches = verlander_pitches[["release_speed", "release_spin_rate"]]
vX = verlander_pitches["pitch_name"].astype("category")
vy = train_test_split(vX, vy, test_size=0.2, random_state=42) vX_train, vX_test, vy_train, vy_test
Gathering Player Data
# gather verlander train data into a df
= pd.concat([pd.DataFrame(vX_train), pd.DataFrame(vy_train)], axis=1)
verladner_train_df verladner_train_df
release_speed | release_spin_rate | pitch_name | |
---|---|---|---|
25 | 84.9 | 1740.0 | Changeup |
1733 | 94.8 | 2483.0 | 4-Seam Fastball |
1721 | 94.3 | 2490.0 | 4-Seam Fastball |
1693 | 84.5 | 2088.0 | Changeup |
1583 | 86.6 | 2715.0 | Slider |
... | ... | ... | ... |
1639 | 94.4 | 2451.0 | 4-Seam Fastball |
1095 | 94.6 | 2449.0 | 4-Seam Fastball |
1130 | 93.6 | 2447.0 | 4-Seam Fastball |
1294 | 87.8 | 2573.0 | Slider |
860 | 94.6 | 2519.0 | 4-Seam Fastball |
2067 rows × 3 columns
# plot verlander data
= sns.jointplot(
verlander_plot =verladner_train_df, x="release_speed", y="release_spin_rate", hue="pitch_name", space=0
data
)"Velocity (mph)", "Spin (rpm)")
verlander_plot.set_axis_labels("2023 Justin Verlander: Velocity vs Spin", y=1.01)
verlander_plot.fig.suptitle(="Pitch Type", loc="lower right")
verlander_plot.ax_joint.legend(titleTrue)
verlander_plot.ax_joint.grid(8, 8) verlander_plot.fig.set_size_inches(
Zack Wheeler
Zack Wheeler is a pitcher for the 2023 Philadelphia Phillies.
All Wheeler data will be prefaced with a w
, such as wX_train
.
# access and process Wheeler train and test data
# playerid_lookup("Wheeler", "Zack")
# NOTE: changeups are removed because very few were thrown, and are possibly errors or wild pitches misclassified
= 554430
wheeler_key_mlbam = statcast_pitcher(start_date_2023, end_date_2023, wheeler_key_mlbam)
wheeler_pitches = wheeler_pitches[pitch_cols]
wheeler_pitches = wheeler_pitches.dropna()
wheeler_pitches = wheeler_pitches[wheeler_pitches["pitch_name"] != "Changeup"]
wheeler_pitches = wheeler_pitches[["release_speed", "release_spin_rate"]]
wX = wheeler_pitches["pitch_name"].astype("category")
wy = train_test_split(wX, wy, test_size=0.2, random_state=42) wX_train, wX_test, wy_train, wy_test
Gathering Player Data
# gather wheeler train data into a df
= pd.concat([pd.DataFrame(wX_train), pd.DataFrame(wy_train)], axis=1)
wheeler_train_df wheeler_train_df
release_speed | release_spin_rate | pitch_name | |
---|---|---|---|
548 | 96.6 | 2486.0 | 4-Seam Fastball |
1374 | 95.1 | 2543.0 | 4-Seam Fastball |
2255 | 97.3 | 2608.0 | 4-Seam Fastball |
2377 | 93.8 | 2411.0 | 4-Seam Fastball |
1096 | 95.8 | 2365.0 | Sinker |
... | ... | ... | ... |
3114 | 94.1 | 2539.0 | 4-Seam Fastball |
1102 | 84.7 | 2681.0 | Sweeper |
1137 | 95.7 | 2548.0 | 4-Seam Fastball |
1302 | 95.6 | 2618.0 | 4-Seam Fastball |
865 | 94.4 | 2357.0 | Sinker |
2506 rows × 3 columns
# plot wheeler data
= sns.jointplot(
wheeler_plot =wheeler_train_df, x="release_speed", y="release_spin_rate", hue="pitch_name", space=0
data
)"Velocity (mph)", "Spin (rpm)")
wheeler_plot.set_axis_labels("2023 Zack Wheeler: Velocity vs Spin", y=1.01)
wheeler_plot.fig.suptitle(="Pitch Type", loc="lower left")
wheeler_plot.ax_joint.legend(titleTrue)
wheeler_plot.ax_joint.grid(8, 8) wheeler_plot.fig.set_size_inches(
Summary Statistics (Graded Work)
Before modeling, you need to look at the data!. To do so, you should calculate several summary statistics for the training data for the two pitchers. To assist, we have provide each pitcher’s training data as a pandas
data frame that contains both the X
and y
information.
verlander_train_df
wheeler_train_df
What summary statistics should be calculated? See the relevant assignment on PrairieLearn!
Model Training (Graded Work)
For this lab, you may train models however you’d like! You should train a model for each of the two pitchers, Verlander and Wheeler.
The only rules are:
- Models must start from the given training data, unmodified.
- You must use cross-validation and tune across at least one tuning parameter.
You will submit your chosen model to an autograder for checking. It will calculate your models performance on the test data. Notice that, because you will have unlimited attempts, this somewhat encourages checking against test data multiple times. But you know this is bad in practice. Also, if you use cross-validation to find a good model before submitting, hopefully you’ll only need to submit once!
# use this cell to train models for verlander
# use this cell to train models for wheeler
To submit your models to the autograder, you will need to serialize them. In the following cell, replace _____
with the model you have found for Verlander and Wheeler, respectively.
"v_model.joblib")
dump(______, "w_model.joblib") dump(______,
After you run this cell, two files will be written in the same folder as this notebook that you should submit to the autograder. See the relevant question in the relevant lab on PrairieLearn.
Discussion
# use this cell to create and print any supporting statistics
# use this cell to create any supporting graphics
# if you would like to include additional graphics,
# please add additional cells and create one graphic per cell
Graded discussion: Would you put the Verlander model in production? Justify you answer. Would you put the Wheeler model in production? Justify you answer.
Un-graded discussion: Do you think these classifiers predict fast enough to work in real-time? How could you improve these classifiers? (Hint: You might need more data.) Why did we model the two pitchers with different models? Why didn’t we use the same model for both pitchers?
Submission
Before submitting, please review the Lab Policy document on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.
Be sure that you have added your name at the top of this notebook.
Once you’ve reviewed the lab policy document, head to Canvas to submit your lab notebook.