Lab 03: Bias-Variance Tradeoff Simulation

Author

Your Name Here

Published

September 15, 2023

Introduction and Data

For this lab, you will perform a simulation study to help understand the bias-variance tradeoff.

Goal: Perform a simulation study that shows how the n_neighbors parameter for k-nearest neighbors effects the bias and variance of predictions made by the model. Do the same for the max_depth parameters for decision trees.

We’ll import the usual packages:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

We define the following data generating process that will be used throughout this lab:

\[\begin{align*} X &\sim \text{Unif}(-2\pi, 2\pi) \\ \epsilon &\sim N(\mu = 0, \sigma = 0.5) \\ Y &= \sin(X) + \epsilon \end{align*}\]

That is:

Generate \(X\) according to a uniform distribution on the interval from \(-2\pi\) to \(2pi\).
Generate noise, \(\epsilon\) according to a mean-zero normal distribution with a standard deviation of \(0.5\).
Generate the signal using \(\sin(X)\), which is also called the true regression function.
- This is what we are trying to learn.
Generate \(Y\) by adding together the signal and the noise.

The following code generates \(100\) samples from this process.

np.random.seed(42)
n = 200
X = np.random.uniform(low=-2*np.pi, high=2*np.pi, size=(n,1))
y = np.sin(X) + np.random.normal(loc=0, scale=0.50, size=(n,1))

We also plot the data to remind ourselves that for the purpose of the lab, we know the true regression function, but in practice we do not.

# setup figure
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(10, 5)
fig.set_dpi(100)

# add overall title
fig.suptitle('Simulated Sine Wave Data')

# x values to make predictions at for plotting purposes
x_plot = np.linspace(-2*np.pi, 2*np.pi, 1000).reshape((1000, 1))

# create subplot for "simulation study"
ax1.set_title("Simulation Study")
ax1.scatter(X, y, color="dodgerblue")
ax1.set_xlabel("x")
ax1.set_ylabel("y")
ax1.grid(True, linestyle='--', color='lightgrey')
# add true regression function, the "signal" that we want to learn
ax1.plot(x_plot, np.sin(x_plot), color='black')

# create subplot for "reality"
ax2.set_title("Reality")
ax2.scatter(X, y, color="dodgerblue")
ax2.set_xlabel("x")
ax2.set_ylabel("y")
ax2.grid(True, linestyle='--', color='lightgrey')

# show plot
plt.show()

Because we will need to repeatedly generate datasets from this process, we write a function to do so.

def simulate_data(n_samples=100):
    X = np.random.uniform(low=-2*np.pi, high=2*np.pi, size=(n_samples, 1))
    y = np.sin(X) + np.random.normal(loc=0, scale=0.5, size=(n_samples, 1))
    return (X, y)

Notice that we have not included any code to set a seed within this function. For the purpose of a simulation study, we will need to generate new datasets repeatedly.

For the purpose of this lab, we will be interesting in the bias and variance of predictions of the quantity \(f(X) = \sin(X)\) when \(X = \frac{\pi}{2}\).

Recall the definition of the bias of an estimator.

\[ \text{bias}(\hat{\theta}) \triangleq \mathbb{E}\left[\hat{\theta}\right] - \theta \]

Also recall the definition of the variance of an estimator.

\[ \mathbb{V}(\hat{\theta}) = \text{var}(\hat{\theta}) \triangleq \mathbb{E}\left [ ( \hat{\theta} -\mathbb{E}\left[\hat{\theta}\right] )^2 \right] \]

In particular, we will train models (k-nearest neighbors and decision trees) and use predictions from these models to estimate \(\sin\left(\frac{\pi}{2}\right)\). We will call these predictions \(\hat{f}\left(\frac{\pi}{2}\right)\).

Lastly, recall that the mean squared error of estimating \(f(x)\) with some \(\hat{f}(x)\) is given by:

\[ \text{MSE}\left(f(x), \hat{f}(x)\right) = \mathbb{E}_{\mathcal{D}} \left[ \left(f(x) - \hat{f}(x) \right)^2 \right] = \underbrace{\left(f(x) - \mathbb{E} \left[ \hat{f}(x) \right] \right)^2}_{\text{bias}^2 \left(\hat{f}(x) \right)} + \underbrace{\mathbb{E} \left[ \left( \hat{f}(x) - \mathbb{E} \left[ \hat{f}(x) \right] \right)^2 \right]}_{\text{var} \left(\hat{f}(x) \right)} \]

So the mean squared error is the bias squared, plus the variance.

Graded Work

Directly calculating bias and variance requires… math. To side step this issue, we will instead use simulation.

Consider a specific model, for example KNN with \(k = 5\). How would we use simulation to estimate the bias and variance of using this fitted model’s predictions, \(\hat{f}\left(\frac{\pi}{2}\right)\) as an estimate of the true value \(\sin\left(\frac{\pi}{2}\right)\)? The basic procedure is as follows:

For some large number of times, which we will pick to be 10000:

Generate a new dataset, X and y. (Hint: Use the function we wrote.)
Instantiate and fit the model to this data.
Make and store the relevant prediction using the trained model, that is, use this model to predict when \(x = \frac{\pi}{2}\).

After doing this, you will have 10000 predictions from 10000 models, each fit to a different dataset, but from the same data generating process.

To estimate the bias, subtract the average of these predictions from the true value.

To estimate the variance, simply take the variance of these predictions.

Your goal is to perform this process for a k-nearest neighbors model with each of the \(k\) values in the list k_values:

k_values = [1, 10, 25, 50, 100]
k_bias = np.zeros(len(k_values))
k_variance = np.zeros(len(k_values))

Similarly, you will perform this process for a decision tree model which each of the the max_depth values in the list depth_values:

depth_values = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
depth_bias = np.zeros(len(depth_values))
depth_variance = np.zeros(len(depth_values))

For the k-nearest neighbors models, store your estimated biases in the k_bias array and your estimates variances in the k_variance array, in the same order as k_values.

For the decision tree models, store your estimated biases in the depth_bias array and your estimates variances in the depth_variance array, in the same order as depth_values.

When creating k-nearest neighbors models, include algorithm='brute' in any calls to KNeighborsRegressor. Similarly, when creating decision tree models, include random_state=1 to control randomization that can arise when fitting trees.

Hint: For k-nearest neighbors, you’ll want to write a nested for loop. That is a for loop within a for loop. The outer loop will loop through the possible values of \(k\), and the inner loop will perform the 10000 simulations for each value of \(k\). Do the same for decision trees.

# delete this comment and use this cell to update k_bias and k_variance

# delete this comment and use this cell to update depth_bias and depth_variance

Plot your results. For both the k-nearest neighbor models and the decision tree models, create a side-by-side plot:

LHS: Plot both the bias squared and variance against the relevant tuning parameter. Order the tuning parameter from least flexible model to most flexible model.
RHS: Plot both the mean squared error against the relevant tuning parameter. Order the tuning parameter from least flexible model to most flexible model.

# delete this comment and create the plots for k-nearest neighbors models here

# delete this comment and create the plots for decision tree models here

Discussion

Graded discussion: Do the plots match your expectations? Are there any unexplained deviations from your expectations?

Submission

Before submitting, please review the Lab Policy document on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.

Be sure that you have added your name at the top of this notebook.

Once you’ve reviewed the lab policy document, head to Canvas to submit your lab.

For Staff Use Only

The remainder of this notebook should not be modified. Any cells below here are for course staff use only. Any modification to the cells below will result in a severe lab grade reduction. However, please feel free to run the cells to check your work before submitting!

# params for testing
green_check = '\u2705'

# testing
assert len(k_bias) == len(k_values)
assert len(k_variance) == len(k_values)
assert len(depth_bias) == len(depth_values)
assert len(depth_variance) == len(depth_values)
assert np.all(np.diff(np.flip(k_bias) ** 2) < 0.01)
assert np.all(np.diff(depth_bias ** 2) < 0.01)
assert np.all(k_variance > 0)
assert np.all(depth_variance > 0)

# success
print(f"{green_check} Everything looks good! {green_check}")