Lab 09: Predicting House Prices

Author

Your Name Here

Published

December 14, 2023

Introduction and Data

Goal: The goal of this lab is to develop a model to predict the selling price of homes in Ames, Iowa, a regression task.

This lab is a part of the final project.

Unlike previous labs, there will be no template provided, and you are not required to calculate any summary statistics. Although you certainly can and should calculate summary statistics on the training data. But for your benefit, not for grading. Additionally, you will only submit a model, not a notebook.

The data (and task) for this lab originally comes from Kaggle.

You should not use this data, but instead the data provided below. However, the descriptions of the variables found on Kaggle will be useful.

After training a model, you will submit to the autograder found on PrairieLearn.

Additional submission details below.

The following code can be used to import the data as X_train and y_train. The y_train data contains the SalePrice variable that we wish to predict.

import pandas as pd

X_train = pd.read_csv("https://cs307.org/lab/lab-09/data/X_train.csv")
y_train = pd.read_csv("https://cs307.org/lab/lab-09/data/y_train.csv")

You are free to import packages and modules as you see fit.

Model Training (Graded Work)

In the autograder, we will use MAPE (mean absolute percentage error) to assess your submitted model.

\[ \text{MAPE}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \frac{{}\left| y_i - \hat{y}_i \right|}{\left| y_i \right|} \]

The model you create must accept as input X_train, completely unmodified. Thus, if you want to exclude columns, you will need to do so within a pipeline. To do so, use a ColumnTransformer. More generally, your “model” should be a pipeline that starts from X_train.

To submit your models to the autograder, you will need to serialize them. In the following cell, replace _____ with the model you have found. Notice that we are using compress=9 to help reduce the size of your models. In practice, this makes reading and writing to disk slower, however, we get the benefit of smaller serialized models.

dump(______, "home_prices.joblib", compress=9)

As always, your submitted model must be less than 5MB.

Notes

Completing this lab will require you to deal with several practical issues. Modeling might be the easiest part of this lab!

  • What variables should be included?
  • What variables are categorical? What variables are numeric?
  • What processing should be done to categorical variables? Numeric?
  • Are there variables with missing data? If so, how should they be dealt with.
    • Hint: There will surely be missing data in the test data.
  • Are there variables with a huge number of categories? If so, do they cause problems?

Discussion

If you were submitting a notebook, we would ask question like:

  • Are there any variables that obviously should not be included?
  • Would you put this model into practice today? (We hope not!)
    • Would you put this model into practice in 2011? (Maybe!)
    • Where would it be appropriate to use this model? (Ames, Iowa!)
  • Even though we used MAPE, and even considering percentage error, does your model perform equally well for low-cost and high-cost homes?

Just some things to think about!