import sys
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Lab 00: System Setup and MLB Pitching Data
System Setup
Welcome to the first CS 307 lab! This lab is not part of your grade, but instead an opportunity to test your computer setup, and practice the process of completing and submitting labs.
Before you being, please be aware of the Lab Policy document. This document contains directions for completing and submitting your lab, the grading rubric, and deadline information.
One of the goals of this lab is to verify that your system is setup to tackle the course. If you encounter issues along the way, please let us know during discussion! We have reserved time at the end of the first discussion to help work through individual issues.
If you’ve taken STAT 107 and STAT 207, the above cell should run without issue. When you run the cell, you may be asked to select a Python interpreter. Select Python 3.11 if available. Otherwise, the most recent version you see.
First let’s check that you are using a recent enough version of Python. Later in the course, as we begin using more packages, we may need to use a particular version of Python, but for now, we won’t be picky.
assert sys.version_info.major == 3
assert sys.version_info.minor >= 9
f"If you see this message, you're good to go!"
"If you see this message, you're good to go!"
You’ll need to be able to install additional packages throughout CS 307. In this lab, we’ll use the pybaseball
package to obtain some data. To install this package, first open a terminal in VSCode, then run the following:
python3 -m pip install pandas
If this fails, consider trying the following, in order:
pip3 install pandas
pip install pandas
python -m pip install pandas
py -m pip install pandas
However, if the initial command does not work, please alert the course staff. We may be able to improve your Python installation.
Once you’ve completed this setup, you should be able to run the following cell.
from pybaseball import statcast_pitcher
from pybaseball import playerid_lookup
If you’re running a recent version of Python, can install packages, and can run code cells of a Jupyter Notebook in VSCode, you’re all set! With one wrinkle we’ll tackle a bit later.
Tutorial
To further test your setup, we’ll acquire some baseball data and produce a plot.
The playerid_lookup
function from the pybaseball
package was imported above. We can use this function to find the “key” for a particular player. In this tutorial and lab, we’ll focus on pitchers. To demonstrate, we’ll start with Lance Lynn.
The function returns a Pandas data frame with many columns, including keys for other systems, but we are only interested in the key_mlbam
, which is a unique identifier used by MLB Advanced Media. We promptly discard all columns except for the name columns and the relevant key.
Note that the function performs a “fuzzy” search and will return multiple rows if there is not a unique or exact match. This may be useful for you later.
"lynn", "lance")[['name_last', 'name_first', 'key_mlbam']] playerid_lookup(
Gathering player lookup table. This may take a moment.
name_last | name_first | key_mlbam | |
---|---|---|---|
0 | lynn | lance | 458681 |
So, the key that we need is 458681
. To download Lance Lynn’s pitching data, we will use the statcast_pitcher
function, also from the pybaseball
package.
Statcast is a term used for data produced by MLB Advanced Media and published on MLB’s Baseball Savant. One method to obtain data from this service is to use the so-called Statcast Search. However, the pybaseball
package provides the ability to directly obtain data from within Python.
Documentation for this data can be found here: Statcast Search CSV Documentation
In the sport of baseball, a ball is thrown by a pitcher, and then a batter attempts to hit the pitch. So in some sense, a pitch is the most discrete unit of baseball. As such, a key functionality of pybaseball
is the ability to get information about each pitch. There are many ways to do so, but for our purposes, we’ll use the statcast_pitcher
function to obtain every pitch thrown by Lance Lynn between March 30, 2023 and August 23, 2023.
= statcast_pitcher("2023-03-30", "2023-08-23", 458681) lynn_pitches
Gathering Player Data
Conveniently, this function returns a Pandas data frame, so we can inspect it using methods you already know!
lynn_pitches.shape
(2491, 92)
This data contains 2491 rows (pitches) and 92 columns! That’s 92 measurements or categorizations of each pitch! To see a list of the column names, use lynn_pitches.columns
. For our purposes, we will only be interested in:
game_date
: The date of the game in which the pitch was thrown.pitch_type
: The short code for the type of pitch thrown.pitch_name
: The full name for the type of pitch thrown.release_speed
: The velocity of the pitch, in miles per hour.release_spin_rate
: The rate of rotation of the pitch, in revolutions per minute.
= [
pitch_cols "game_date",
"pitch_type",
"pitch_name",
"release_speed",
"release_spin_rate"
]= lynn_pitches[pitch_cols] lynn_pitches
Now that we’ve created a more reasonable subset, let’s take a look.
lynn_pitches
game_date | pitch_type | pitch_name | release_speed | release_spin_rate | |
---|---|---|---|---|---|
0 | 2023-08-17 | FC | Cutter | 89.1 | 2452.0 |
1 | 2023-08-17 | FF | 4-Seam Fastball | 90.9 | 2462.0 |
2 | 2023-08-17 | CU | Curveball | 81.2 | 2536.0 |
3 | 2023-08-17 | CU | Curveball | 81.6 | 2484.0 |
4 | 2023-08-17 | CU | Curveball | 81.8 | 2515.0 |
... | ... | ... | ... | ... | ... |
2486 | 2023-03-31 | SI | Sinker | 92.6 | 2240.0 |
2487 | 2023-03-31 | FC | Cutter | 87.4 | 2634.0 |
2488 | 2023-03-31 | CU | Curveball | 81.4 | 2479.0 |
2489 | 2023-03-31 | SI | Sinker | 91.0 | 2248.0 |
2490 | 2023-03-31 | FF | 4-Seam Fastball | 92.3 | 2412.0 |
2491 rows × 5 columns
What is a pitch type you might ask? Well, it’s complicated. Let’s allow someone else to explain:
As we’ll see in a moment, while the pitch type is technically defined by what the pitcher claims they threw, we can probably infer the pitch type based on only speed and spin.
Now that you’re a pitch type expert, here’s a game to see how well you can identify pitches from video:
That game was difficult wasn’t it? Don’t feel bad! Identifying pitches with your eyes is difficult. It is even more difficult when you realize the cameras are playing tricks on you:
But wait! Then how do television broadcasts of baseball games instantly display the pitch type? You guessed it… machine learning! For a deep dive on how they do this, see here:
The long story short is:
- Have advanced tracking technology that can instantly record speed and spin for each pitch.
- Have a trained classifier for pitch type based on speed and spin.
- In real time, make predictions of pitch type as soon as the speed and spin are recorded.
- Display the result in the stadium and on the broadcast!
This is something we’ll do later in this course! (Well, the model training and predictions anyway. We don’t have high speed cameras and tracking devices. At least not yet…) Today, we’ll simply make a plot of the relevant data that we’ll likely convince you that this sort of prediction is possible.
Looking at some summary statistics, you might already notice some strong differences between pitches. You might also notice that Lance Lynn loves fastballs.
= {
agg_cols 'pitch_type': 'count',
'release_speed': 'mean',
'release_spin_rate': 'mean'
}= {
summary_col_names 'release_speed': 'Average Velocity (mph)',
'release_spin_rate': 'Average Spin (rpm)',
'pitch_type': 'Count'
}= {
name_col 'pitch_name': 'Pitch Name'
}= lynn_pitches.groupby('pitch_name').agg(agg_cols)
lynn_summary 'release_speed'] = lynn_summary['release_speed'].round(1)
lynn_summary['release_spin_rate'] = lynn_summary['release_spin_rate'].round(0)
lynn_summary[= lynn_summary.rename(columns=summary_col_names)
lynn_summary = lynn_summary.reset_index()
lynn_summary = lynn_summary.rename(columns=name_col)
lynn_summary lynn_summary
Pitch Name | Count | Average Velocity (mph) | Average Spin (rpm) | |
---|---|---|---|---|
0 | 4-Seam Fastball | 1065 | 92.5 | 2430.0 |
1 | Changeup | 174 | 84.9 | 1814.0 |
2 | Curveball | 212 | 80.8 | 2507.0 |
3 | Cutter | 592 | 88.5 | 2586.0 |
4 | Sinker | 310 | 91.2 | 2260.0 |
5 | Slider | 133 | 83.1 | 2462.0 |
6 | Sweeper | 5 | 81.3 | 2510.0 |
To make these even easier to see, we’ll plot this data.
= sns.jointplot(
lynn_plot =lynn_pitches,
data='release_speed',
x='release_spin_rate',
y='pitch_name'
hue
)'Velocity (mph)', 'Spin (rpm)')
lynn_plot.set_axis_labels('2023 Lance Lynn: Velocity vs Spin', y=1.05)
lynn_plot.fig.suptitle(='Pitch Type')
lynn_plot.ax_joint.legend(title8, 8) lynn_plot.fig.set_size_inches(
Look at that! With only speed and spin, we can already see reasonably strong separation of the pitch types. What else do you notice?
- Do you see the slider that is probably actually a changeup?
- Notice how sliders and curveballs seem similar?
- Notice how cutters and sinkers and close to fastballs, with quite a lot of overlap between sinker and fastball?
In a future lab, we’ll return to this data to make an end-to-end pitch classifier!
Graded Work
Repeat the above tutorial to produce a similar plot, but for your favorite pitcher. If you are not an MLB fan, and don’t have a favorite pitcher, we’ll suggest Shohei Ohtani.
Before creating the requested plot, create a Pandas data frame that contains all pitches from the 2023 season for your favorite pitcher. Note that the 2023 MLB season began on March 30. Also, as the season is in progress, you are only responsible for pitches that occurred on or before August 23.
Name this data frame pitches
. Include the following columns in this order:
game_date
pitch_type
pitch_name
release_speed
release_spin_rate
To insure that you will produce an interesting graphic, you must choose a pitcher that has thrown at least 1000 pitches during the date range described above.
Similarly, in the tutorial above, the plot for Lance Lynn was named lynn_plot
. Name your plot pitches_plot
.
# delete these comments and place your code here
# consider adding additional cells to make your work more human readable
# keep any new cells you create in the Grade Work section of this notebook
# hint: use a few cells to obtain and wrangle the data
# hint: use one final cell to produce the requested plot
Submission
Before submitting, please review the Lab Policy document on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.
Once you’ve reviewed the lab policy document, head to Canvas to submit your lab.
For Staff Use Only
The remainder of this notebook should not be modified. Any cells below here are for course staff use only. Any modification to the cells below will result in a severe lab grade reduction. However, please feel free to run the cells to check your work before submitting!
# params for testing
= [
col_names 'game_date',
'pitch_type',
'pitch_name',
'release_speed',
'release_spin_rate'
]= '\u2705'
green_check = list(pitches.columns.values)
std_cols
# testing
assert pitches.shape[0] > 1000, "Your pitcher did not pitch enough!"
assert pitches.shape[1] == 5, "You have not included the correct number of columns."
assert std_cols == col_names, "You have not included the correct columns."
assert pitches_plot is not None, "We cannot seem to find your plot!"
# success
print(f"{green_check} Everything looks good! {green_check}")
NameError: name 'pitches' is not defined