# basics
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from joblib import dump, load
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import calibration_curve
Lab 08: MLB Swing Probability Model
Introduction and Data
Goal: The goal of this lab is to develop a model to estimate the probability that an MLB batter swings at a pitch thrown by a particular pitcher.
To do this, you’ll need to import, at least, the following:
You are free to import additional packages and modules as you see fit, and you will almost certainly need to.
For this lab, we will focus on a model for a specific pitcher, Zac Gallen.
Note that we are not providing a test dataset. Instead, the test dataset will live within the autograder, and once you submit, you will receive feedback and metrics based on the test data. (Therefor, cross-validation or a validation set will be your friend here.)
In this lab, the train-test split is down within the 2023 MLB Season. That is, the training data is Zac Gallen pitches that occured between opening day (2023-03-30
) and the trade deadline (2023-08-31
). The test data (which you could but should not obtain) covers the remainder of the season, from September 1 (2023-09-01
) to the final day of the World Series (2023-11-01
).
We do this in place of randomly splitting the data in an attempt to create a model that can predict into the future. Imagine this model is created on the day of the trade deadline, then possibly used to make baseball decisions for the remainder of the season.
Because Statcast data can change at any moment, say it is constantly changed and improved, we provide a snapshot of the data for use in this lab.
= pd.read_csv("https://cs307.org/lab/lab-08/data/gallen_pitches_train.csv")
gallen_pitches_train gallen_pitches_train
pitch_type | game_date | release_speed | release_pos_x | release_pos_z | player_name | batter | pitcher | events | description | ... | fld_score | post_away_score | post_home_score | post_bat_score | post_fld_score | if_fielding_alignment | of_fielding_alignment | spin_axis | delta_home_win_exp | delta_run_exp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | FC | 2023-08-28 | 92.6 | -2.76 | 5.81 | Gallen, Zac | 683737 | 668678 | single | hit_into_play | ... | 4 | 4 | 6 | 6 | 4 | NaN | Standard | 195.0 | 0.013 | 0.159 |
1 | CH | 2023-08-28 | 86.3 | -2.87 | 5.66 | Gallen, Zac | 683737 | 668678 | NaN | ball | ... | 4 | 4 | 6 | 6 | 4 | NaN | Standard | 226.0 | 0.000 | 0.077 |
2 | CH | 2023-08-28 | 87.9 | -2.83 | 5.68 | Gallen, Zac | 683737 | 668678 | NaN | ball | ... | 4 | 4 | 6 | 6 | 4 | NaN | Standard | 224.0 | 0.000 | 0.034 |
3 | KC | 2023-08-28 | 82.4 | -2.70 | 5.78 | Gallen, Zac | 683737 | 668678 | NaN | swinging_strike | ... | 4 | 4 | 6 | 6 | 4 | Standard | Standard | 32.0 | 0.000 | -0.031 |
4 | FC | 2023-08-28 | 91.0 | -2.64 | 5.81 | Gallen, Zac | 683737 | 668678 | NaN | ball | ... | 4 | 4 | 6 | 6 | 4 | NaN | Standard | 189.0 | 0.000 | 0.025 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2659 | FF | 2023-03-30 | 93.4 | -2.64 | 5.99 | Gallen, Zac | 518692 | 668678 | NaN | called_strike | ... | 1 | 1 | 0 | 0 | 1 | Standard | Standard | 206.0 | 0.000 | -0.026 |
2660 | FC | 2023-03-30 | 87.3 | -2.91 | 5.86 | Gallen, Zac | 605141 | 668678 | strikeout | swinging_strike | ... | 1 | 1 | 0 | 0 | 1 | Standard | Standard | 113.0 | -0.023 | -0.173 |
2661 | KC | 2023-03-30 | 84.7 | -2.91 | 5.87 | Gallen, Zac | 605141 | 668678 | NaN | swinging_strike | ... | 1 | 1 | 0 | 0 | 1 | Standard | Standard | 35.0 | 0.000 | -0.055 |
2662 | FF | 2023-03-30 | 94.3 | -2.67 | 6.04 | Gallen, Zac | 605141 | 668678 | NaN | ball | ... | 1 | 1 | 0 | 0 | 1 | Standard | Standard | 201.0 | 0.000 | 0.028 |
2663 | FF | 2023-03-30 | 94.4 | -2.86 | 6.02 | Gallen, Zac | 605141 | 668678 | NaN | called_strike | ... | 1 | 1 | 0 | 0 | 1 | Standard | Standard | 209.0 | 0.000 | -0.038 |
2664 rows × 92 columns
We’ll subset to a list of variables that we believe are relevant for our purposes.
# define required columns for a swing probability model
= [
pitch_cols # full pitcher controlled
"pitch_name",
# mostly pitcher controlled
"release_extension",
"release_pos_x",
"release_pos_y",
"release_pos_z",
# somewhat pitcher controlled
"release_speed",
"release_spin_rate",
"spin_axis",
"plate_x",
"plate_z",
# downstream from pitcher controlled
"pfx_x",
"pfx_z",
# situational information
"balls",
"strikes",
"on_3b",
"on_2b",
"on_1b",
"outs_when_up",
# fixed batter information
"stand",
"sz_top",
"sz_bot",
# pitch outcome (to be engineered to swing or not)
"description",
]
While we will certainly not be able to make any truly causal claims about our model, it is important to understand which variables are controlled by the pitcher. We could imagine a coach using this model to help explain to a pitcher where and how to throw a pitch if they want to induce a swing.
Fully Pitcher Controlled
This variable is fully controlled by the pitcher.
pitch_name
: The name of the pitch type to be thrown.
Mostly Pitcher Controlled
These variables are largely controlled by the pitcher, but even at the highest levels of baseball, there will be variance based on skill, fatigue, etc.
release_extension
: Release extension of pitch in feet as tracked by Statcast.release_pos_x
: Horizontal Release Position of the ball measured in feet from the catcher’s perspective.release_pos_y
: Release position of pitch measured in feet from the catcher’s perspective.release_pos_z
: Vertical Release Position of the ball measured in feet from the catcher’s perspective.
Somewhat Pitcher Controlled
These variables are in some sense controlled by the pitcher, but less so than the previous. At the MLB level, pitchers will have some control here, but even at the highest levels, there can be a lot of variance.
release_speed
: Velocity of the pitch thrown.release_spin_rate
: Spin rate of pitch tracked by Statcast.spin_axis
: The spin axis in the 2D X-Z plane in degrees from 0 to 360, such that 180 represents a pure backspin fastball and 0 degrees represents a pure topspin (12-6) curveball.plate_x
: Horizontal position of the ball when it crosses home plate from the catcher’s perspective.plate_z
: Vertical position of the ball when it crosses home plate from the catcher’s perspective.
Downstream Pitcher Controlled
Theses variables are pitch characteristics, and maybe somewhat controlled by the pitcher, but are largely functions of the previous variables.
pfx_x
: Horizontal movement in feet from the catcher’s perspective.pfx_z
: Vertical movement in feet from the catcher’s perspective.
Situational Information
These variables describe part of the game situation when the pitch was thrown. (We have omitted some other obvious variables here like score and inning, just for simplicity.) These are fixed before a pitch is thrown, but could have an effect. Pitchers and batters often act differently based on the game situation. For example, batters are known to “protect” when there are two strikes, thus, much more likely to swing.
balls
: Pre-pitch number of balls in count.strikes
: Pre-pitch number of strikes in count.on_3b
: Pre-pitch MLB Player Id of Runner on 3B.on_2b
: Pre-pitch MLB Player Id of Runner on 2B.on_1b
: Pre-pitch MLB Player Id of Runner on 1B.outs_when_up
: Pre-pitch number of outs.
Fixed Batter Information
These variables given some information about the batter facing the pitch. In particular, are they a righty or lefty, and the size of their strike zone, which is a function of their height.
stand
: Side of the plate batter is standing.sz_top
: Top of the batter’s strike zone set by the operator when the ball is halfway to the plate.sz_bot
: Bottom of the batter’s strike zone set by the operator when the ball is halfway to the plate.
Pitch Outcome
This variable contains the “outcome” of each pitch, which we will engineer to be whether or not the batter swung.
description
: Description of the resulting pitch. A “swing” includes all outcomes other than:ball
,blocked_ball
,called_strike
,hit_by_pitch
.
The following two functions are used to pre-process the data. These were also applied to the test data. Comments within the functions describe their functionality.
def process_pitches(df, cols):
# subset to relevant columns
= df[cols]
df # remove sinkers due to extremely low data
# this is arbitrary, but also done by Statcast
= df[df["pitch_name"] != "Sinker"]
df # change baserunner variables to be 0-1 indicators
# currently they contain NaN for no runner or an MLBAM key to indicate a specific player on base
"on_1b", "on_2b", "on_3b"]] = df[["on_1b", "on_2b", "on_3b"]].notnull().astype(int)
df[[# engineer the swing variable from the description variable
"swing"] = df["description"].apply(
df[lambda x: 0 if x in ["ball", "blocked_ball", "called_strike", "hit_by_pitch"] else 1
)return df
def get_X_y(df):
# create the X feature data frame
= df.drop(columns=["swing", "description"])
X # create the y target series
= df["swing"]
y return X, y
= process_pitches(gallen_pitches_train, pitch_cols)
gallen_pitches_train_processed = get_X_y(gallen_pitches_train_processed) X_train, y_train
X_train
pitch_name | release_extension | release_pos_x | release_pos_y | release_pos_z | release_speed | release_spin_rate | spin_axis | plate_x | plate_z | ... | pfx_z | balls | strikes | on_3b | on_2b | on_1b | outs_when_up | stand | sz_top | sz_bot | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Cutter | 6.6 | -2.76 | 53.86 | 5.81 | 92.6 | 2376.0 | 195.0 | -0.09 | 2.79 | ... | 0.97 | 3 | 1 | 0 | 0 | 0 | 1 | L | 3.15 | 1.52 |
1 | Changeup | 6.8 | -2.87 | 53.74 | 5.66 | 86.3 | 1511.0 | 226.0 | -1.47 | 1.84 | ... | 0.40 | 2 | 1 | 0 | 0 | 0 | 1 | L | 3.13 | 1.56 |
2 | Changeup | 6.7 | -2.83 | 53.82 | 5.68 | 87.9 | 1570.0 | 224.0 | -1.52 | 2.38 | ... | 0.84 | 1 | 1 | 0 | 0 | 0 | 1 | L | 3.12 | 1.51 |
3 | Knuckle Curve | 6.7 | -2.70 | 53.78 | 5.78 | 82.4 | 2398.0 | 32.0 | 0.20 | 1.04 | ... | -0.91 | 1 | 0 | 0 | 0 | 0 | 1 | L | 3.15 | 1.52 |
4 | Cutter | 6.7 | -2.64 | 53.83 | 5.81 | 91.0 | 2427.0 | 189.0 | 0.89 | 1.65 | ... | 1.00 | 0 | 0 | 0 | 0 | 0 | 1 | L | 3.12 | 1.51 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2659 | 4-Seam Fastball | 6.8 | -2.64 | 53.75 | 5.99 | 93.4 | 2411.0 | 206.0 | 0.59 | 2.91 | ... | 1.45 | 0 | 0 | 0 | 0 | 0 | 1 | L | 3.50 | 1.81 |
2660 | Cutter | 6.3 | -2.91 | 54.19 | 5.86 | 87.3 | 2541.0 | 113.0 | 1.38 | 1.73 | ... | -0.01 | 1 | 2 | 0 | 0 | 0 | 0 | R | 3.19 | 1.48 |
2661 | Knuckle Curve | 6.4 | -2.91 | 54.13 | 5.87 | 84.7 | 2539.0 | 35.0 | 0.81 | 0.11 | ... | -0.97 | 1 | 1 | 0 | 0 | 0 | 0 | R | 3.19 | 1.48 |
2662 | 4-Seam Fastball | 6.4 | -2.67 | 54.13 | 6.04 | 94.3 | 2531.0 | 201.0 | 1.05 | 1.79 | ... | 1.52 | 0 | 1 | 0 | 0 | 0 | 0 | R | 3.03 | 1.48 |
2663 | 4-Seam Fastball | 6.3 | -2.86 | 54.19 | 6.02 | 94.4 | 2504.0 | 209.0 | 0.21 | 2.02 | ... | 1.58 | 0 | 0 | 0 | 0 | 0 | 0 | R | 3.19 | 1.48 |
2653 rows × 21 columns
Summary Statistics (Graded Work)
What summary statistics should be calculated? See the relevant assignment on PrairieLearn! You will want to work with the gallen_pitches_train_processed
data frame to do so.
Model Training (Graded Work)
For this lab, you will need to train two separate but related models.
Probability Model: Train a supervised (likely classifier) to predict `swing`` from the other variables. However, this model will not be evaluated on its ability to classify to swing or not. Instead, we will directly asses its ability to estimate the probability of a swing. Thus, you need a well-calibrated model.
The above sklearn
user guide page will provide some hints. Importantly, you may need to use CalibratedClassifierCV
to further calibrate the probability estimates from a classifier. We also provide a function to produce a calibration plot for a given model.
def plot_calibration_plot(y_true, y_pred_prob):
# generate "data" for calibration plot
= calibration_curve(
fraction_of_positives, mean_predicted_value =15, pos_label=1
y_true, y_pred_prob, n_bins
)
# plot the calibration curve
plt.plot(
mean_predicted_value,
fraction_of_positives,"s-",
="Learned Classifier",
label="#1E3877",
color
)
# plot the diagonal "perfect" line
0, 1], [0, 1], "--", label="Perfectly calibrated", color="#F5821E")
plt.plot([
# set the plot title and axis labels
"Calibration Plot")
plt.title("Mean Predicted Value")
plt.xlabel("Fraction of Positives")
plt.ylabel(
# add a grid
True, which="both", color="grey", linewidth=0.5)
plt.grid(
# show the legend
plt.legend()
# show the plot
plt.show()
In the autograder, we will use two metrics to assess your submitted model:
- Expected Calibration Error (ECE): This is essentially how far on average the points on the plot are from the “perfect” line.
- Maximum Calibration Error (MCE): This is essentially the furthest any point on plot is from the “perfect” line.
We do not recommend worrying about calculating these. Instead, use calibration plots to get a rough sense of them before submitting to the autograder.
For further reference:
Novelty Detector: The second model you will train is an unsupervised novelty detector. It should be fit to the training features only. We will test how many observations it flags as novel (outliers) in the test data. Your detector should detect at least one novel observation in the test data, but flag no more than 5% of the observations. Use 1
for inliers and -1
for outlier as is the default in sklearn
.
For this lab, you may train models however you’d like! The only rules are:
Probability Model:
- Models must start from the given training data, unmodified.
- Importantly, the type and shape of
X_train
andy_train
should not be changed. - The number of features can and should be modified via a pipeline, but the pipeline must start from the given
X_train
.
- Importantly, the type and shape of
- Your model must have a
predict
method. - Your model must have a
predict_proba
method. - Your serialized probability model must be less than 5MB.
- Be aware: some models use more disk space than others and
CalibratedClassifierCV
will increase the size of your models!
- Be aware: some models use more disk space than others and
Novelty Detector:
- Models must start from the given training data, unmodified.
- Importantly, the type and shape of
X_train
should not be changed. - The number of features can be modified via a pipeline, but the pipeline must start from the given
X_train
.
- Importantly, the type and shape of
- Your model must have a
predict
method. - Your serialized model must be less than 5MB.
- Be aware: some models use more disk space than others!
Also, the size of the probability model plus the size of the novelty detector must be less than 5MB.
You will submit your chosen models to an autograder for checking. It will calculate your models performance on the test data. Notice that, because you will have unlimited attempts, this somewhat encourages checking against test data multiple times. But you know this is bad in practice. Also, if you use cross-validation to find a good model before submitting, hopefully you’ll only need to submit once!
# use this cell to train probability models
# use this cell to train novelty detectors
To submit your models to the autograder, you will need to serialize them. In the following cell, replace _____
with the model you have found. Notice that we are using compress=9
to help reduce the size of your models. In practice, this makes reading and writing to disk slower, however, we get the benefit of smaller serialized models.
"swing_probability.joblib", compress=9)
dump(______, "swing_novelty.joblib", compress=9) dump(______,
After you run this cell, two files will be written in the same folder as this notebook that you should submit to the autograder. See the relevant question in the relevant lab on PrairieLearn.
Discussion
# use this cell to create and print any supporting statistics
Graded discussion: You likely needed to transform some of the X
variables. If you did, which and why? How did you consider balls and strikes? As numeric or categorical? Why? If you tried to use this model as an MLB coach, which variables would you ask the pitcher to modify to induce a swing? Why? Or, can you think of a different use for this model? In the context of this setup, why might the novelty detector be useful?
Submission
Before submitting, please review the Lab Policy document on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.
Be sure that you have added your name at the top of this notebook.
Once you’ve reviewed the lab policy document, head to Canvas to submit your lab notebook.