Lab 01: Predicting the Weather

Author

Your Name Here

Published

September 1, 2023

Introduction and Data

For this lab, we will utilize weather data acquired from Open-Meteo. In particular, we will look at historical mean daily temperature data for two locations:

  • Champaign, IL
  • San Diego, CA

The data for both locations has been pre-processed and can be accessed as a .csv file from the web.

For this lab, we’ll need to import some packages and modules:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

Use the following code to import these datasets as a Pandas data frame:

cu_wx = pd.read_csv("https://cs307.org/lab/lab-01/data/cu-wx.csv")
sd_wx = pd.read_csv("https://cs307.org/lab/lab-01/data/sd-wx.csv")
cu_wx.head()
date daily_temperature_2m_mean yday
0 2020-01-01 2.3 1
1 2020-01-02 6.9 2
2 2020-01-03 4.6 3
3 2020-01-04 0.6 4
4 2020-01-05 1.7 5
cu_wx.shape, sd_wx.shape
((1096, 3), (1096, 3))

Both datasets have the same shape and contain the same columns. Those columns are:

  • date: The date of the temperature measurement formatted as YYYY-MM-DD.
  • daily_temperature_2m_mean: The average daily temperature at the specified location in Celsius.
  • yday: The “day of the year”, that is, an integer that represents the day of the year, from 1, which is January 1, to 365, which is December 31. (Technically there might be slight issues with leap years, but for simplicity, we will ignore this, as it will have minimal if any effect on our models.)

Both datasets include each day from January 1, 2020 to December 31, 2022.

Goal: Use k-nearest neighbors to create a model that predicts the daily mean temperature given the day of the year for both locations.

For reasons that we will describe at a later time in the course, we will not perform any meaningful exploratory data analysis until after we have performed a test-train split of the data. So, before we can make any exploratory graphics, we’ll first need to move from pandas data frames to numpy arrays, then do some data splitting. We’ll provide the code to do so in this lab, but in future labs, you will be expected to write similar code.

Because we’ll be working with two datasets at the same time, we will preface Champaign data with cu_ and San Diego data with sd_.

# create X numpy array for Champaign data
cu_X = cu_wx["yday"].to_numpy()
cu_X = np.reshape(cu_X, (cu_X.shape[0], 1))

# create y numpy array for Champaign data
cu_y = cu_wx["daily_temperature_2m_mean"].to_numpy()
cu_y = np.reshape(cu_y, (cu_y.shape[0], 1))
# create X numpy array for San Diego data
sd_X = sd_wx["yday"].to_numpy()
sd_X = np.reshape(sd_X, (sd_X.shape[0], 1))

# create y numpy array for San Diego data
sd_y = sd_wx["daily_temperature_2m_mean"].to_numpy()
sd_y = np.reshape(sd_y, (sd_y.shape[0], 1))
# create train and test datasets for Champaign data
cu_X_train, cu_X_test, cu_y_train, cu_y_test = train_test_split(
    cu_X, cu_y, test_size=0.20, random_state=42
)
# create validation-train and validation datasets for Champaign data
cu_X_vtrain, cu_X_val, cu_y_vtrain, cu_y_val = train_test_split(
    cu_X_train, cu_y_train, test_size=0.20, random_state=42
)
# create train and test datasets for San Diego data
sd_X_train, sd_X_test, sd_y_train, sd_y_test = train_test_split(
    sd_X, sd_y, test_size=0.20, random_state=42
)
# create validation-train and validation datasets for San Diego data
sd_X_vtrain, sd_X_val, sd_y_vtrain, sd_y_val = train_test_split(
    sd_X_train, sd_y_train, test_size=0.20, random_state=42
)

Whew! That is a lot of boilerplate! As a reminder:

  • _train indicates a full train dataset
  • _vtrain is a validation-train dataset that we will use to fit models
  • _val is a validation dataset that we will use to select models, in this case, select a value of k, based on models that were fit to a validation-train dataset.
  • _test is a test dataset that we will use to report an estimate of generalization error, after first refitting a chosen model to a full train dataset

With that all out of the way, let’s finally look at the data! We’ll skip summary statistics, and move right to graphics.

# setup figure
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(10, 5)
fig.set_dpi(100)

# determine axis limits
ymin = np.min(np.concatenate((cu_y_vtrain, sd_y_vtrain))) - 1
ymax = np.max(np.concatenate((cu_y_vtrain, sd_y_vtrain))) + 1

# add overall title
fig.suptitle('Daily Mean Temperature: 2020 - 2022')

# create subplot for Champaign
ax1.set_title("Champaign")
ax1.scatter(cu_X_vtrain, cu_y_vtrain, color="dodgerblue")
ax1.set_ylim(ymin, ymax)
ax1.set_xlabel("Day of the Year")
ax1.set_ylabel("Average Daily Temperature (Celsius)")
ax1.grid(True, linestyle='--', color='lightgrey')

# create subplot for San Diego
ax2.set_title("San Diego")
ax2.scatter(sd_X_vtrain, sd_y_vtrain, color="dodgerblue")
ax2.set_ylim(ymin, ymax)
ax2.set_xlabel("Day of the Year")
ax2.set_ylabel("Average Daily Temperature (Celsius)")
ax2.grid(True, linestyle='--', color='lightgrey')

# show plot
plt.show()

Notice that we took the time to make sure both sub-plots had the same y-axis limits. (Yes, taking the time to manually label everything is extra work, but it is a very good habit to establish.) If not, the San Diego weather would’ve have looked more variable than it truly is. Also, if you didn’t know already, San Diego has nicer weather than Champaign…

Model Training

Our goal essentially reduces to finding a good value of k for each dataset. First, let’s define and restrict ourselves to certain values of k that we will consider.

k_values = [1, 5, 10, 25, 50, 100, 150, 200, 300, 365]

Next, let’s calculate the validation RMSE for each possible value of k for the Champaign data. To do so, replace each instance of _____ in the following cell with the appropriate code. (The length of _____ does not indicate the length of the code to replace.)

cu_val_rmse = []
for k in k_values: # loop through potential values of k
    knn = KNeighborsRegressor(n_neighbors=_____) # define model based on current k
    knn.fit(_____, _____) # fit model to the validation-train data
    pred = knn.predict(_____) # make predictions with validation data
    rmse = np.sqrt(mean_squared_error(_____, _____)) # calculate validation RMSE
    cu_val_rmse.append(rmse) # store RMSE

Two things to note here:

  1. Eventually, we will expect you to be able to write loops like this from scratch.
  2. In practice, we will eventually see a more systematic way to do this with sklearn, but this approach is more instructive at the moment.

If you’ve done the above correctly, then the following cell will not raise an error. Additionally, cu_k will store the “best” value of k, that is, the value of k that obtains the smallest validation RMSE.

cu_k = k_values[np.argmin(cu_val_rmse)]
assert cu_k == 5

In the following cell, repeat training procedure for the San Diego data. Instead of cu_val_rmse, collect the results in sd_val_rsme.

# delete this comment and place your code here

Again, if you’ve done the above correctly, then the following cell will not raise an error. Additionally, sd_k will store the “best” value of k, that is, the value of k that obtains the smallest validation RMSE.

sd_k = k_values[np.argmin(sd_val_rmse)]
assert sd_k == 10

Let’s plot validation RMSE against k values for both Champaign and San Diego.

# setup figure
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(10, 5)
fig.set_dpi(100)

# determine axis limits
ymin = np.min(np.concatenate((cu_val_rmse, sd_val_rmse))) - 1
ymax = np.max(np.concatenate((cu_val_rmse, sd_val_rmse))) + 1

# add overall title
fig.suptitle('Validation Results')

# create subplot for Champaign
ax1.set_title("Champaign")
ax1.scatter(x=k_values, y=cu_val_rmse, color="dodgerblue")
ax1.plot(k_values, cu_val_rmse, color="dodgerblue")
ax1.set_ylim(ymin, ymax)
ax1.set_xlabel("k (Number of Neighbors)")
ax1.set_ylabel("RMSE")
ax1.grid(True, linestyle='--', color='lightgrey')

# create subplot for San Diego
ax2.set_title("San Diego")
ax2.scatter(x=k_values, y=sd_val_rmse, color="dodgerblue")
ax2.plot(k_values, sd_val_rmse, color="dodgerblue")
ax2.set_ylim(ymin, ymax)
ax2.set_xlabel("k (Number of Neighbors)")
ax2.set_ylabel("RMSE")
ax2.grid(True, linestyle='--', color='lightgrey')

# shot plot
plt.show()

Neither is a perfect “U” shape, but notice for both, RMSE goes down, then back up. Much, much, much more on that soon!

Fill in the blanks (_____) in the following cell to make predictions on the test set, after refitting the selected value of k for the Champaign model to the full train data.

cu_knn_final = KNeighborsRegressor(n_neighbors=_____)
cu_knn_final.fit(_____, _____)
cu_y_test_pred = cu_knn_final.predict(_____)

Fill in the blanks (_____) in the following cell to make predictions on the test set, after refitting the selected value of k for the San Diego model to the full train data.

sd_knn_final = KNeighborsRegressor(n_neighbors=_____)
sd_knn_final.fit(_____, _____)
sd_y_test_pred = sd_knn_final.predict(_____)

Fill in the blanks (_____) in the following cell to calculate test RMSE for both the Champaign and San Diego models.

cu_test_rmse = np.sqrt(mean_squared_error(_____, _____))
sd_test_rmse = np.sqrt(mean_squared_error(_____, _____))

Recreate the initial exploratory graphic, but this time, for the test data. Add a “line” or more specifically “a wiggly step function line” to each subplot (one for Champaign, one for San Diego) that represents the predicted value of the selected models for each possible “x” value. Use the cell below and the follow the structure indicated by the comments.

# setup figure
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(10, 5)
fig.set_dpi(100)

# determine axis limits
## delete this comment and place code here

# add overall title
## delete this comment and place code here

# create subplot for Champaign
## delete this comment and place code here

# create subplot for San Diego
## delete this comment and place code here

# add selected models
x = np.linspace(1, 365, 1000)
## delete this comment and place code here

# shot plot
plt.show()

For both selected models, create a “Predicted versus Actual” plot. Use the cell below and follow the structure indicated by the comments. Be sure to add a line with intercept 0 and slope 1.

# setup figure
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(10, 5)
fig.set_dpi(100)

# determine axis limits
## delete this comment and place code here

# add overall title
## delete this comment and place code here

# create subplot for Champaign
## delete this comment and place code here

# create subplot for San Diego
## delete this comment and place code here

# shot plot
plt.show()

Use the following markdown cell to answer the following prompts: Given day of the year, which city is more predictable? (Why?) Would you feel comfortable using either of these models in practice? (Why?) Either way, can you highlight any issues with these models? Can you suggest any improvements? Hint: Think about time.

Submission

Before submitting, please review the Lab Policy document on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.

Once you’ve reviewed the lab policy document, head to Canvas to submit your lab.

For Staff Use Only

The remainder of this notebook should not be modified. Any cells below here are for course staff use only. Any modification to the cells below will result in a severe lab grade reduction. However, please feel free to run the cells to check your work before submitting!

# params for testing
green_check = '\u2705'
cu_test_rmse_check = abs(cu_test_rmse - 4.6501718443320120) < 0.0001
sd_test_rmse_check = abs(sd_test_rmse - 1.6440436567970287) < 0.0001

# testing
assert cu_k == 5, "You selected the wrong k for Champaign."
assert sd_k == 10, "You selected the wrong k for San Diego"
assert len(cu_val_rmse), "Wrong number of Champaign validation RMSEs."
assert len(sd_val_rmse), "Wrong number of San Diego validation RMSEs."
assert cu_test_rmse_check, "Incorrect Champaign test RMSE."
assert sd_test_rmse_check, "Incorrect San Diego test RMSE."

# success
print(f"{green_check} Everything looks good! {green_check}")