Lab 05: Building a Pitch Classifier

Author

Your Name Here

Published

October 13, 2023

Introduction and Data

Goal: The goal of this lab is to create a pitch classifier that can be used as a part of an automatic system. It should predict the “pitch type” based on the pitch’s velocity and spin rate.

What is a pitch type you might ask? Well, it’s complicated. Let’s allow someone else to explain:

As we’ll see in a moment, while the pitch type is technically defined by what the pitcher claims they threw, we can probably infer the pitch type based on only speed and spin.

Now that you’re a pitch type expert, here’s a game to see how well you can identify pitches from video:

That game was difficult wasn’t it? Don’t feel bad! Identifying pitches with your eyes is difficult. It is even more difficult when you realize the cameras are playing tricks on you:

But wait! Then how do television broadcasts of baseball games instantly display the pitch type? You guessed it… machine learning! For a deep dive on how they do this, see here:

The long story short is:

  • Have advanced tracking technology that can instantly record speed and spin for each pitch.
  • Have a trained classifier for pitch type based on speed and spin.
  • In real time, make predictions of pitch type as soon as the speed and spin are recorded.
  • Display the result in the stadium and on the broadcast!

To do this, you’ll need to import the following:

# basics
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from joblib import dump, load
import warnings

# machine learning
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# baseball data
from pybaseball import statcast_pitcher
from pybaseball import playerid_lookup

You are free to import additional packages and modules as you see fit, but this lab can be completed with only the above.

We will use the pybaseball package to obtain Statcast data. Statcast is a term used for data produced by MLB Advanced Media and published on MLB’s Baseball Savant. One method to obtain data from this service is to use the so-called Statcast Search. However, the pybaseball package provides the ability to directly obtain data from within Python.

Documentation for this data can be found here: Statcast Search CSV Documentation

The following code is a bit of necessary setup before we move to obtaining data for two particular pitchers.

# deal with pandas versus seaborn issues
warnings.simplefilter(action="ignore", category=FutureWarning)

# define the dates of the regular season
# https://en.wikipedia.org/wiki/2023_Major_League_Baseball_season
start_date_2023 = "2023-03-30"
end_date_2023 = "2023-10-01"

# define required columns for (basic) pitch classifier
pitch_cols = ["pitch_name", "release_speed", "release_spin_rate"]

Justin Verlander

Justin Verlander is a pitcher for the 2023 Houston Astros.

All Verlander data will be prefaced with a v, such as vX_train.

# access and process Verlander train and test data
# playerid_lookup("Verlander", "Justin")
verlander_key_mlbam = 434378
verlander_pitches = statcast_pitcher(start_date_2023, end_date_2023, verlander_key_mlbam)
verlander_pitches = verlander_pitches[pitch_cols]
verlander_pitches = verlander_pitches.dropna()
vX = verlander_pitches[["release_speed", "release_spin_rate"]]
vy = verlander_pitches["pitch_name"].astype("category")
vX_train, vX_test, vy_train, vy_test = train_test_split(vX, vy, test_size=0.2, random_state=42)
Gathering Player Data
# gather verlander train data into a df
verladner_train_df = pd.concat([pd.DataFrame(vX_train), pd.DataFrame(vy_train)], axis=1)
verladner_train_df
release_speed release_spin_rate pitch_name
25 84.9 1740.0 Changeup
1733 94.8 2483.0 4-Seam Fastball
1721 94.3 2490.0 4-Seam Fastball
1693 84.5 2088.0 Changeup
1583 86.6 2715.0 Slider
... ... ... ...
1639 94.4 2451.0 4-Seam Fastball
1095 94.6 2449.0 4-Seam Fastball
1130 93.6 2447.0 4-Seam Fastball
1294 87.8 2573.0 Slider
860 94.6 2519.0 4-Seam Fastball

2067 rows × 3 columns

# plot verlander data
verlander_plot = sns.jointplot(
    data=verladner_train_df, x="release_speed", y="release_spin_rate", hue="pitch_name", space=0
)
verlander_plot.set_axis_labels("Velocity (mph)", "Spin (rpm)")
verlander_plot.fig.suptitle("2023 Justin Verlander: Velocity vs Spin", y=1.01)
verlander_plot.ax_joint.legend(title="Pitch Type", loc="lower right")
verlander_plot.ax_joint.grid(True)
verlander_plot.fig.set_size_inches(8, 8)

Zack Wheeler

Zack Wheeler is a pitcher for the 2023 Philadelphia Phillies.

All Wheeler data will be prefaced with a w, such as wX_train.

# access and process Wheeler train and test data
# playerid_lookup("Wheeler", "Zack")
# NOTE: changeups are removed because very few were thrown, and are possibly errors or wild pitches misclassified
wheeler_key_mlbam = 554430
wheeler_pitches = statcast_pitcher(start_date_2023, end_date_2023, wheeler_key_mlbam)
wheeler_pitches = wheeler_pitches[pitch_cols]
wheeler_pitches = wheeler_pitches.dropna()
wheeler_pitches = wheeler_pitches[wheeler_pitches["pitch_name"] != "Changeup"]
wX = wheeler_pitches[["release_speed", "release_spin_rate"]]
wy = wheeler_pitches["pitch_name"].astype("category")
wX_train, wX_test, wy_train, wy_test = train_test_split(wX, wy, test_size=0.2, random_state=42)
Gathering Player Data
# gather wheeler train data into a df
wheeler_train_df = pd.concat([pd.DataFrame(wX_train), pd.DataFrame(wy_train)], axis=1)
wheeler_train_df
release_speed release_spin_rate pitch_name
548 96.6 2486.0 4-Seam Fastball
1374 95.1 2543.0 4-Seam Fastball
2255 97.3 2608.0 4-Seam Fastball
2377 93.8 2411.0 4-Seam Fastball
1096 95.8 2365.0 Sinker
... ... ... ...
3114 94.1 2539.0 4-Seam Fastball
1102 84.7 2681.0 Sweeper
1137 95.7 2548.0 4-Seam Fastball
1302 95.6 2618.0 4-Seam Fastball
865 94.4 2357.0 Sinker

2506 rows × 3 columns

# plot wheeler data
wheeler_plot = sns.jointplot(
    data=wheeler_train_df, x="release_speed", y="release_spin_rate", hue="pitch_name", space=0
)
wheeler_plot.set_axis_labels("Velocity (mph)", "Spin (rpm)")
wheeler_plot.fig.suptitle("2023 Zack Wheeler: Velocity vs Spin", y=1.01)
wheeler_plot.ax_joint.legend(title="Pitch Type", loc="lower left")
wheeler_plot.ax_joint.grid(True)
wheeler_plot.fig.set_size_inches(8, 8)

Summary Statistics (Graded Work)

Before modeling, you need to look at the data!. To do so, you should calculate several summary statistics for the training data for the two pitchers. To assist, we have provide each pitcher’s training data as a pandas data frame that contains both the X and y information.

  • verlander_train_df
  • wheeler_train_df

What summary statistics should be calculated? See the relevant assignment on PrairieLearn!

Model Training (Graded Work)

For this lab, you may train models however you’d like! You should train a model for each of the two pitchers, Verlander and Wheeler.

The only rules are:

  • Models must start from the given training data, unmodified.
  • You must use cross-validation and tune across at least one tuning parameter.

You will submit your chosen model to an autograder for checking. It will calculate your models performance on the test data. Notice that, because you will have unlimited attempts, this somewhat encourages checking against test data multiple times. But you know this is bad in practice. Also, if you use cross-validation to find a good model before submitting, hopefully you’ll only need to submit once!

# use this cell to train models for verlander
# use this cell to train models for wheeler

To submit your models to the autograder, you will need to serialize them. In the following cell, replace _____ with the model you have found for Verlander and Wheeler, respectively.

dump(______, "v_model.joblib")
dump(______, "w_model.joblib")

After you run this cell, two files will be written in the same folder as this notebook that you should submit to the autograder. See the relevant question in the relevant lab on PrairieLearn.

Discussion

# use this cell to create and print any supporting statistics
# use this cell to create any supporting graphics
# if you would like to include additional graphics,
# please add additional cells and create one graphic per cell

Graded discussion: Would you put the Verlander model in production? Justify you answer. Would you put the Wheeler model in production? Justify you answer.

Un-graded discussion: Do you think these classifiers predict fast enough to work in real-time? How could you improve these classifiers? (Hint: You might need more data.) Why did we model the two pitchers with different models? Why didn’t we use the same model for both pitchers?

Submission

Before submitting, please review the Lab Policy document on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.

Be sure that you have added your name at the top of this notebook.

Once you’ve reviewed the lab policy document, head to Canvas to submit your lab notebook.