Lab 08: MLB Swing Probability Model

Author

Your Name Here

Published

November 3, 2023

Introduction and Data

Goal: The goal of this lab is to develop a model to estimate the probability that an MLB batter swings at a pitch thrown by a particular pitcher.

To do this, you’ll need to import, at least, the following:

# basics
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from joblib import dump, load

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import calibration_curve

You are free to import additional packages and modules as you see fit, and you will almost certainly need to.

For this lab, we will focus on a model for a specific pitcher, Zac Gallen.

Note that we are not providing a test dataset. Instead, the test dataset will live within the autograder, and once you submit, you will receive feedback and metrics based on the test data. (Therefor, cross-validation or a validation set will be your friend here.)

In this lab, the train-test split is down within the 2023 MLB Season. That is, the training data is Zac Gallen pitches that occured between opening day (2023-03-30) and the trade deadline (2023-08-31). The test data (which you could but should not obtain) covers the remainder of the season, from September 1 (2023-09-01) to the final day of the World Series (2023-11-01).

We do this in place of randomly splitting the data in an attempt to create a model that can predict into the future. Imagine this model is created on the day of the trade deadline, then possibly used to make baseball decisions for the remainder of the season.

Because Statcast data can change at any moment, say it is constantly changed and improved, we provide a snapshot of the data for use in this lab.

gallen_pitches_train = pd.read_csv("https://cs307.org/lab/lab-08/data/gallen_pitches_train.csv")
gallen_pitches_train
pitch_type game_date release_speed release_pos_x release_pos_z player_name batter pitcher events description ... fld_score post_away_score post_home_score post_bat_score post_fld_score if_fielding_alignment of_fielding_alignment spin_axis delta_home_win_exp delta_run_exp
0 FC 2023-08-28 92.6 -2.76 5.81 Gallen, Zac 683737 668678 single hit_into_play ... 4 4 6 6 4 NaN Standard 195.0 0.013 0.159
1 CH 2023-08-28 86.3 -2.87 5.66 Gallen, Zac 683737 668678 NaN ball ... 4 4 6 6 4 NaN Standard 226.0 0.000 0.077
2 CH 2023-08-28 87.9 -2.83 5.68 Gallen, Zac 683737 668678 NaN ball ... 4 4 6 6 4 NaN Standard 224.0 0.000 0.034
3 KC 2023-08-28 82.4 -2.70 5.78 Gallen, Zac 683737 668678 NaN swinging_strike ... 4 4 6 6 4 Standard Standard 32.0 0.000 -0.031
4 FC 2023-08-28 91.0 -2.64 5.81 Gallen, Zac 683737 668678 NaN ball ... 4 4 6 6 4 NaN Standard 189.0 0.000 0.025
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2659 FF 2023-03-30 93.4 -2.64 5.99 Gallen, Zac 518692 668678 NaN called_strike ... 1 1 0 0 1 Standard Standard 206.0 0.000 -0.026
2660 FC 2023-03-30 87.3 -2.91 5.86 Gallen, Zac 605141 668678 strikeout swinging_strike ... 1 1 0 0 1 Standard Standard 113.0 -0.023 -0.173
2661 KC 2023-03-30 84.7 -2.91 5.87 Gallen, Zac 605141 668678 NaN swinging_strike ... 1 1 0 0 1 Standard Standard 35.0 0.000 -0.055
2662 FF 2023-03-30 94.3 -2.67 6.04 Gallen, Zac 605141 668678 NaN ball ... 1 1 0 0 1 Standard Standard 201.0 0.000 0.028
2663 FF 2023-03-30 94.4 -2.86 6.02 Gallen, Zac 605141 668678 NaN called_strike ... 1 1 0 0 1 Standard Standard 209.0 0.000 -0.038

2664 rows × 92 columns

We’ll subset to a list of variables that we believe are relevant for our purposes.

# define required columns for a swing probability model
pitch_cols = [
    # full pitcher controlled
    "pitch_name",
    # mostly pitcher controlled
    "release_extension",
    "release_pos_x",
    "release_pos_y",
    "release_pos_z",
    # somewhat pitcher controlled
    "release_speed",
    "release_spin_rate",
    "spin_axis",
    "plate_x",
    "plate_z",
    # downstream from pitcher controlled
    "pfx_x",
    "pfx_z",
    # situational information
    "balls",
    "strikes",
    "on_3b",
    "on_2b",
    "on_1b",
    "outs_when_up",
    # fixed batter information
    "stand",
    "sz_top",
    "sz_bot",
    # pitch outcome (to be engineered to swing or not)
    "description",
]

While we will certainly not be able to make any truly causal claims about our model, it is important to understand which variables are controlled by the pitcher. We could imagine a coach using this model to help explain to a pitcher where and how to throw a pitch if they want to induce a swing.

Fully Pitcher Controlled

This variable is fully controlled by the pitcher.

  • pitch_name: The name of the pitch type to be thrown.

Mostly Pitcher Controlled

These variables are largely controlled by the pitcher, but even at the highest levels of baseball, there will be variance based on skill, fatigue, etc.

  • release_extension: Release extension of pitch in feet as tracked by Statcast.
  • release_pos_x: Horizontal Release Position of the ball measured in feet from the catcher’s perspective.
  • release_pos_y: Release position of pitch measured in feet from the catcher’s perspective.
  • release_pos_z: Vertical Release Position of the ball measured in feet from the catcher’s perspective.

Somewhat Pitcher Controlled

These variables are in some sense controlled by the pitcher, but less so than the previous. At the MLB level, pitchers will have some control here, but even at the highest levels, there can be a lot of variance.

  • release_speed: Velocity of the pitch thrown.
  • release_spin_rate: Spin rate of pitch tracked by Statcast.
  • spin_axis: The spin axis in the 2D X-Z plane in degrees from 0 to 360, such that 180 represents a pure backspin fastball and 0 degrees represents a pure topspin (12-6) curveball.
  • plate_x: Horizontal position of the ball when it crosses home plate from the catcher’s perspective.
  • plate_z: Vertical position of the ball when it crosses home plate from the catcher’s perspective.

Downstream Pitcher Controlled

Theses variables are pitch characteristics, and maybe somewhat controlled by the pitcher, but are largely functions of the previous variables.

  • pfx_x: Horizontal movement in feet from the catcher’s perspective.
  • pfx_z: Vertical movement in feet from the catcher’s perspective.

Situational Information

These variables describe part of the game situation when the pitch was thrown. (We have omitted some other obvious variables here like score and inning, just for simplicity.) These are fixed before a pitch is thrown, but could have an effect. Pitchers and batters often act differently based on the game situation. For example, batters are known to “protect” when there are two strikes, thus, much more likely to swing.

  • balls: Pre-pitch number of balls in count.
  • strikes: Pre-pitch number of strikes in count.
  • on_3b: Pre-pitch MLB Player Id of Runner on 3B.
  • on_2b: Pre-pitch MLB Player Id of Runner on 2B.
  • on_1b: Pre-pitch MLB Player Id of Runner on 1B.
  • outs_when_up: Pre-pitch number of outs.

Fixed Batter Information

These variables given some information about the batter facing the pitch. In particular, are they a righty or lefty, and the size of their strike zone, which is a function of their height.

  • stand: Side of the plate batter is standing.
  • sz_top: Top of the batter’s strike zone set by the operator when the ball is halfway to the plate.
  • sz_bot: Bottom of the batter’s strike zone set by the operator when the ball is halfway to the plate.

Pitch Outcome

This variable contains the “outcome” of each pitch, which we will engineer to be whether or not the batter swung.

  • description: Description of the resulting pitch. A “swing” includes all outcomes other than: ball, blocked_ball, called_strike, hit_by_pitch.

The following two functions are used to pre-process the data. These were also applied to the test data. Comments within the functions describe their functionality.

def process_pitches(df, cols):
    # subset to relevant columns
    df = df[cols]
    # remove sinkers due to extremely low data
    # this is arbitrary, but also done by Statcast
    df = df[df["pitch_name"] != "Sinker"]
    # change baserunner variables to be 0-1 indicators
    # currently they contain NaN for no runner or an MLBAM key to indicate a specific player on base
    df[["on_1b", "on_2b", "on_3b"]] = df[["on_1b", "on_2b", "on_3b"]].notnull().astype(int)
    # engineer the swing variable from the description variable
    df["swing"] = df["description"].apply(
        lambda x: 0 if x in ["ball", "blocked_ball", "called_strike", "hit_by_pitch"] else 1
    )
    return df


def get_X_y(df):
    # create the X feature data frame
    X = df.drop(columns=["swing", "description"])
    # create the y target series
    y = df["swing"]
    return X, y
gallen_pitches_train_processed = process_pitches(gallen_pitches_train, pitch_cols)
X_train, y_train = get_X_y(gallen_pitches_train_processed)
X_train
pitch_name release_extension release_pos_x release_pos_y release_pos_z release_speed release_spin_rate spin_axis plate_x plate_z ... pfx_z balls strikes on_3b on_2b on_1b outs_when_up stand sz_top sz_bot
0 Cutter 6.6 -2.76 53.86 5.81 92.6 2376.0 195.0 -0.09 2.79 ... 0.97 3 1 0 0 0 1 L 3.15 1.52
1 Changeup 6.8 -2.87 53.74 5.66 86.3 1511.0 226.0 -1.47 1.84 ... 0.40 2 1 0 0 0 1 L 3.13 1.56
2 Changeup 6.7 -2.83 53.82 5.68 87.9 1570.0 224.0 -1.52 2.38 ... 0.84 1 1 0 0 0 1 L 3.12 1.51
3 Knuckle Curve 6.7 -2.70 53.78 5.78 82.4 2398.0 32.0 0.20 1.04 ... -0.91 1 0 0 0 0 1 L 3.15 1.52
4 Cutter 6.7 -2.64 53.83 5.81 91.0 2427.0 189.0 0.89 1.65 ... 1.00 0 0 0 0 0 1 L 3.12 1.51
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2659 4-Seam Fastball 6.8 -2.64 53.75 5.99 93.4 2411.0 206.0 0.59 2.91 ... 1.45 0 0 0 0 0 1 L 3.50 1.81
2660 Cutter 6.3 -2.91 54.19 5.86 87.3 2541.0 113.0 1.38 1.73 ... -0.01 1 2 0 0 0 0 R 3.19 1.48
2661 Knuckle Curve 6.4 -2.91 54.13 5.87 84.7 2539.0 35.0 0.81 0.11 ... -0.97 1 1 0 0 0 0 R 3.19 1.48
2662 4-Seam Fastball 6.4 -2.67 54.13 6.04 94.3 2531.0 201.0 1.05 1.79 ... 1.52 0 1 0 0 0 0 R 3.03 1.48
2663 4-Seam Fastball 6.3 -2.86 54.19 6.02 94.4 2504.0 209.0 0.21 2.02 ... 1.58 0 0 0 0 0 0 R 3.19 1.48

2653 rows × 21 columns

Summary Statistics (Graded Work)

What summary statistics should be calculated? See the relevant assignment on PrairieLearn! You will want to work with the gallen_pitches_train_processed data frame to do so.

Model Training (Graded Work)

For this lab, you will need to train two separate but related models.

Probability Model: Train a supervised (likely classifier) to predict `swing`` from the other variables. However, this model will not be evaluated on its ability to classify to swing or not. Instead, we will directly asses its ability to estimate the probability of a swing. Thus, you need a well-calibrated model.

The above sklearn user guide page will provide some hints. Importantly, you may need to use CalibratedClassifierCV to further calibrate the probability estimates from a classifier. We also provide a function to produce a calibration plot for a given model.

def plot_calibration_plot(y_true, y_pred_prob):
    # generate "data" for calibration plot
    fraction_of_positives, mean_predicted_value = calibration_curve(
        y_true, y_pred_prob, n_bins=15, pos_label=1
    )

    # plot the calibration curve
    plt.plot(
        mean_predicted_value,
        fraction_of_positives,
        "s-",
        label="Learned Classifier",
        color="#1E3877",
    )

    # plot the diagonal "perfect" line
    plt.plot([0, 1], [0, 1], "--", label="Perfectly calibrated", color="#F5821E")

    # set the plot title and axis labels
    plt.title("Calibration Plot")
    plt.xlabel("Mean Predicted Value")
    plt.ylabel("Fraction of Positives")

    # add a grid
    plt.grid(True, which="both", color="grey", linewidth=0.5)

    # show the legend
    plt.legend()

    # show the plot
    plt.show()

In the autograder, we will use two metrics to assess your submitted model:

  • Expected Calibration Error (ECE): This is essentially how far on average the points on the plot are from the “perfect” line.
  • Maximum Calibration Error (MCE): This is essentially the furthest any point on plot is from the “perfect” line.

We do not recommend worrying about calculating these. Instead, use calibration plots to get a rough sense of them before submitting to the autograder.

For further reference:

Novelty Detector: The second model you will train is an unsupervised novelty detector. It should be fit to the training features only. We will test how many observations it flags as novel (outliers) in the test data. Your detector should detect at least one novel observation in the test data, but flag no more than 5% of the observations. Use 1 for inliers and -1 for outlier as is the default in sklearn.

For this lab, you may train models however you’d like! The only rules are:

Probability Model:

  • Models must start from the given training data, unmodified.
    • Importantly, the type and shape of X_train and y_train should not be changed.
    • The number of features can and should be modified via a pipeline, but the pipeline must start from the given X_train.
  • Your model must have a predict method.
  • Your model must have a predict_proba method.
  • Your serialized probability model must be less than 5MB.
    • Be aware: some models use more disk space than others and CalibratedClassifierCV will increase the size of your models!

Novelty Detector:

  • Models must start from the given training data, unmodified.
    • Importantly, the type and shape of X_train should not be changed.
    • The number of features can be modified via a pipeline, but the pipeline must start from the given X_train.
  • Your model must have a predict method.
  • Your serialized model must be less than 5MB.
    • Be aware: some models use more disk space than others!

Also, the size of the probability model plus the size of the novelty detector must be less than 5MB.


You will submit your chosen models to an autograder for checking. It will calculate your models performance on the test data. Notice that, because you will have unlimited attempts, this somewhat encourages checking against test data multiple times. But you know this is bad in practice. Also, if you use cross-validation to find a good model before submitting, hopefully you’ll only need to submit once!

# use this cell to train probability models
# use this cell to train novelty detectors

To submit your models to the autograder, you will need to serialize them. In the following cell, replace _____ with the model you have found. Notice that we are using compress=9 to help reduce the size of your models. In practice, this makes reading and writing to disk slower, however, we get the benefit of smaller serialized models.

dump(______, "swing_probability.joblib", compress=9)
dump(______, "swing_novelty.joblib", compress=9)

After you run this cell, two files will be written in the same folder as this notebook that you should submit to the autograder. See the relevant question in the relevant lab on PrairieLearn.

Discussion

# use this cell to create and print any supporting statistics

Graded discussion: You likely needed to transform some of the X variables. If you did, which and why? How did you consider balls and strikes? As numeric or categorical? Why? If you tried to use this model as an MLB coach, which variables would you ask the pitcher to modify to induce a swing? Why? Or, can you think of a different use for this model? In the context of this setup, why might the novelty detector be useful?

Submission

Before submitting, please review the Lab Policy document on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.

Be sure that you have added your name at the top of this notebook.

Once you’ve reviewed the lab policy document, head to Canvas to submit your lab notebook.