import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
Lab 01: Predicting the Weather
Introduction and Data
For this lab, we will utilize weather data acquired from Open-Meteo. In particular, we will look at historical mean daily temperature data for two locations:
- Champaign, IL
- San Diego, CA
The data for both locations has been pre-processed and can be accessed as a .csv
file from the web.
For this lab, we’ll need to import some packages and modules:
Use the following code to import these datasets as a Pandas data frame:
= pd.read_csv("https://cs307.org/lab/lab-01/data/cu-wx.csv")
cu_wx = pd.read_csv("https://cs307.org/lab/lab-01/data/sd-wx.csv") sd_wx
cu_wx.head()
date | daily_temperature_2m_mean | yday | |
---|---|---|---|
0 | 2020-01-01 | 2.3 | 1 |
1 | 2020-01-02 | 6.9 | 2 |
2 | 2020-01-03 | 4.6 | 3 |
3 | 2020-01-04 | 0.6 | 4 |
4 | 2020-01-05 | 1.7 | 5 |
cu_wx.shape, sd_wx.shape
((1096, 3), (1096, 3))
Both datasets have the same shape and contain the same columns. Those columns are:
date
: The date of the temperature measurement formatted as YYYY-MM-DD.daily_temperature_2m_mean
: The average daily temperature at the specified location in Celsius.yday
: The “day of the year”, that is, an integer that represents the day of the year, from 1, which is January 1, to 365, which is December 31. (Technically there might be slight issues with leap years, but for simplicity, we will ignore this, as it will have minimal if any effect on our models.)
Both datasets include each day from January 1, 2020 to December 31, 2022.
Goal: Use k-nearest neighbors to create a model that predicts the daily mean temperature given the day of the year for both locations.
For reasons that we will describe at a later time in the course, we will not perform any meaningful exploratory data analysis until after we have performed a test-train split of the data. So, before we can make any exploratory graphics, we’ll first need to move from pandas
data frames to numpy
arrays, then do some data splitting. We’ll provide the code to do so in this lab, but in future labs, you will be expected to write similar code.
Because we’ll be working with two datasets at the same time, we will preface Champaign data with cu_
and San Diego data with sd_
.
# create X numpy array for Champaign data
= cu_wx["yday"].to_numpy()
cu_X = np.reshape(cu_X, (cu_X.shape[0], 1))
cu_X
# create y numpy array for Champaign data
= cu_wx["daily_temperature_2m_mean"].to_numpy()
cu_y = np.reshape(cu_y, (cu_y.shape[0], 1)) cu_y
# create X numpy array for San Diego data
= sd_wx["yday"].to_numpy()
sd_X = np.reshape(sd_X, (sd_X.shape[0], 1))
sd_X
# create y numpy array for San Diego data
= sd_wx["daily_temperature_2m_mean"].to_numpy()
sd_y = np.reshape(sd_y, (sd_y.shape[0], 1)) sd_y
# create train and test datasets for Champaign data
= train_test_split(
cu_X_train, cu_X_test, cu_y_train, cu_y_test =0.20, random_state=42
cu_X, cu_y, test_size )
# create validation-train and validation datasets for Champaign data
= train_test_split(
cu_X_vtrain, cu_X_val, cu_y_vtrain, cu_y_val =0.20, random_state=42
cu_X_train, cu_y_train, test_size )
# create train and test datasets for San Diego data
= train_test_split(
sd_X_train, sd_X_test, sd_y_train, sd_y_test =0.20, random_state=42
sd_X, sd_y, test_size )
# create validation-train and validation datasets for San Diego data
= train_test_split(
sd_X_vtrain, sd_X_val, sd_y_vtrain, sd_y_val =0.20, random_state=42
sd_X_train, sd_y_train, test_size )
Whew! That is a lot of boilerplate! As a reminder:
_train
indicates a full train dataset_vtrain
is a validation-train dataset that we will use to fit models_val
is a validation dataset that we will use to select models, in this case, select a value of k, based on models that were fit to a validation-train dataset._test
is a test dataset that we will use to report an estimate of generalization error, after first refitting a chosen model to a full train dataset
With that all out of the way, let’s finally look at the data! We’ll skip summary statistics, and move right to graphics.
# setup figure
= plt.subplots(1, 2)
fig, (ax1, ax2) 10, 5)
fig.set_size_inches(100)
fig.set_dpi(
# determine axis limits
= np.min(np.concatenate((cu_y_vtrain, sd_y_vtrain))) - 1
ymin = np.max(np.concatenate((cu_y_vtrain, sd_y_vtrain))) + 1
ymax
# add overall title
'Daily Mean Temperature: 2020 - 2022')
fig.suptitle(
# create subplot for Champaign
"Champaign")
ax1.set_title(="dodgerblue")
ax1.scatter(cu_X_vtrain, cu_y_vtrain, color
ax1.set_ylim(ymin, ymax)"Day of the Year")
ax1.set_xlabel("Average Daily Temperature (Celsius)")
ax1.set_ylabel(True, linestyle='--', color='lightgrey')
ax1.grid(
# create subplot for San Diego
"San Diego")
ax2.set_title(="dodgerblue")
ax2.scatter(sd_X_vtrain, sd_y_vtrain, color
ax2.set_ylim(ymin, ymax)"Day of the Year")
ax2.set_xlabel("Average Daily Temperature (Celsius)")
ax2.set_ylabel(True, linestyle='--', color='lightgrey')
ax2.grid(
# show plot
plt.show()
Notice that we took the time to make sure both sub-plots had the same y-axis limits. (Yes, taking the time to manually label everything is extra work, but it is a very good habit to establish.) If not, the San Diego weather would’ve have looked more variable than it truly is. Also, if you didn’t know already, San Diego has nicer weather than Champaign…
Model Training
Our goal essentially reduces to finding a good value of k for each dataset. First, let’s define and restrict ourselves to certain values of k that we will consider.
= [1, 5, 10, 25, 50, 100, 150, 200, 300, 365] k_values
Next, let’s calculate the validation RMSE for each possible value of k for the Champaign data. To do so, replace each instance of _____
in the following cell with the appropriate code. (The length of _____
does not indicate the length of the code to replace.)
= []
cu_val_rmse for k in k_values: # loop through potential values of k
= KNeighborsRegressor(n_neighbors=_____) # define model based on current k
knn # fit model to the validation-train data
knn.fit(_____, _____) = knn.predict(_____) # make predictions with validation data
pred = np.sqrt(mean_squared_error(_____, _____)) # calculate validation RMSE
rmse # store RMSE cu_val_rmse.append(rmse)
Two things to note here:
- Eventually, we will expect you to be able to write loops like this from scratch.
- In practice, we will eventually see a more systematic way to do this with
sklearn
, but this approach is more instructive at the moment.
If you’ve done the above correctly, then the following cell will not raise an error. Additionally, cu_k
will store the “best” value of k, that is, the value of k that obtains the smallest validation RMSE.
= k_values[np.argmin(cu_val_rmse)]
cu_k assert cu_k == 5
In the following cell, repeat training procedure for the San Diego data. Instead of cu_val_rmse
, collect the results in sd_val_rsme
.
# delete this comment and place your code here
Again, if you’ve done the above correctly, then the following cell will not raise an error. Additionally, sd_k
will store the “best” value of k, that is, the value of k that obtains the smallest validation RMSE.
= k_values[np.argmin(sd_val_rmse)]
sd_k assert sd_k == 10
Let’s plot validation RMSE against k values for both Champaign and San Diego.
# setup figure
= plt.subplots(1, 2)
fig, (ax1, ax2) 10, 5)
fig.set_size_inches(100)
fig.set_dpi(
# determine axis limits
= np.min(np.concatenate((cu_val_rmse, sd_val_rmse))) - 1
ymin = np.max(np.concatenate((cu_val_rmse, sd_val_rmse))) + 1
ymax
# add overall title
'Validation Results')
fig.suptitle(
# create subplot for Champaign
"Champaign")
ax1.set_title(=k_values, y=cu_val_rmse, color="dodgerblue")
ax1.scatter(x="dodgerblue")
ax1.plot(k_values, cu_val_rmse, color
ax1.set_ylim(ymin, ymax)"k (Number of Neighbors)")
ax1.set_xlabel("RMSE")
ax1.set_ylabel(True, linestyle='--', color='lightgrey')
ax1.grid(
# create subplot for San Diego
"San Diego")
ax2.set_title(=k_values, y=sd_val_rmse, color="dodgerblue")
ax2.scatter(x="dodgerblue")
ax2.plot(k_values, sd_val_rmse, color
ax2.set_ylim(ymin, ymax)"k (Number of Neighbors)")
ax2.set_xlabel("RMSE")
ax2.set_ylabel(True, linestyle='--', color='lightgrey')
ax2.grid(
# shot plot
plt.show()
Neither is a perfect “U” shape, but notice for both, RMSE goes down, then back up. Much, much, much more on that soon!
Fill in the blanks (_____
) in the following cell to make predictions on the test set, after refitting the selected value of k for the Champaign model to the full train data.
= KNeighborsRegressor(n_neighbors=_____)
cu_knn_final
cu_knn_final.fit(_____, _____)= cu_knn_final.predict(_____) cu_y_test_pred
Fill in the blanks (_____
) in the following cell to make predictions on the test set, after refitting the selected value of k for the San Diego model to the full train data.
= KNeighborsRegressor(n_neighbors=_____)
sd_knn_final
sd_knn_final.fit(_____, _____)= sd_knn_final.predict(_____) sd_y_test_pred
Fill in the blanks (_____
) in the following cell to calculate test RMSE for both the Champaign and San Diego models.
= np.sqrt(mean_squared_error(_____, _____))
cu_test_rmse = np.sqrt(mean_squared_error(_____, _____)) sd_test_rmse
Recreate the initial exploratory graphic, but this time, for the test data. Add a “line” or more specifically “a wiggly step function line” to each subplot (one for Champaign, one for San Diego) that represents the predicted value of the selected models for each possible “x” value. Use the cell below and the follow the structure indicated by the comments.
# setup figure
= plt.subplots(1, 2)
fig, (ax1, ax2) 10, 5)
fig.set_size_inches(100)
fig.set_dpi(
# determine axis limits
## delete this comment and place code here
# add overall title
## delete this comment and place code here
# create subplot for Champaign
## delete this comment and place code here
# create subplot for San Diego
## delete this comment and place code here
# add selected models
= np.linspace(1, 365, 1000)
x ## delete this comment and place code here
# shot plot
plt.show()
For both selected models, create a “Predicted versus Actual” plot. Use the cell below and follow the structure indicated by the comments. Be sure to add a line with intercept 0 and slope 1.
# setup figure
= plt.subplots(1, 2)
fig, (ax1, ax2) 10, 5)
fig.set_size_inches(100)
fig.set_dpi(
# determine axis limits
## delete this comment and place code here
# add overall title
## delete this comment and place code here
# create subplot for Champaign
## delete this comment and place code here
# create subplot for San Diego
## delete this comment and place code here
# shot plot
plt.show()
Use the following markdown cell to answer the following prompts: Given day of the year, which city is more predictable? (Why?) Would you feel comfortable using either of these models in practice? (Why?) Either way, can you highlight any issues with these models? Can you suggest any improvements? Hint: Think about time.
Submission
Before submitting, please review the Lab Policy document on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.
Once you’ve reviewed the lab policy document, head to Canvas to submit your lab.
For Staff Use Only
The remainder of this notebook should not be modified. Any cells below here are for course staff use only. Any modification to the cells below will result in a severe lab grade reduction. However, please feel free to run the cells to check your work before submitting!
# params for testing
= '\u2705'
green_check = abs(cu_test_rmse - 4.6501718443320120) < 0.0001
cu_test_rmse_check = abs(sd_test_rmse - 1.6440436567970287) < 0.0001
sd_test_rmse_check
# testing
assert cu_k == 5, "You selected the wrong k for Champaign."
assert sd_k == 10, "You selected the wrong k for San Diego"
assert len(cu_val_rmse), "Wrong number of Champaign validation RMSEs."
assert len(sd_val_rmse), "Wrong number of San Diego validation RMSEs."
assert cu_test_rmse_check, "Incorrect Champaign test RMSE."
assert sd_test_rmse_check, "Incorrect San Diego test RMSE."
# success
print(f"{green_check} Everything looks good! {green_check}")