Lab 02: Creating a Credit Rating

Author

Your Name Here

Published

September 8, 2023

Introduction and Data

For this lab, we will utilize Credit card data from the ISLP package. To use this data, you will need to install the ISLP package. To install this package, run the following:

  • %pip install ISLP

Do not include code for installation in any cell of this document as installation code only needs to be run once. Simply delete any temporary cells you create while installing packages.

In addition to ISLP, we’ll need to import several other packages and modules:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from ISLP import load_data

To access the relevant data as a pandas data frame, we will use the following:

Credit = load_data("Credit")
Credit.head()
ID Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
0 1 14.891 3606 283 2 34 11 Male No Yes Caucasian 333
1 2 106.025 6645 483 3 82 15 Female Yes Yes Asian 903
2 3 104.593 7075 514 4 71 11 Male No No Asian 580
3 4 148.924 9504 681 3 36 11 Female No No Asian 964
4 5 55.882 4897 357 2 68 16 Male No Yes Caucasian 331

A data dictionary can be found here: ISLP: Credit Card Balance Data

Importantly, note that this is not real data. It is simulated.

The ID variable serves no purpose, so we’ll remove it before additional preprocessing. We will also remove the Limit variable because it would not make sense to have access to this variable for the purpose we will describe shortly.

Credit = Credit.drop(columns=['ID'])
Credit = Credit.drop(columns=['Limit'])
Credit.dtypes
Income       float64
Rating         int64
Cards          int64
Age            int64
Education      int64
Gender        object
Student       object
Married       object
Ethnicity     object
Balance        int64
dtype: object

We have mostly numeric data, except for the Gender, Student, Married, and Ethnicity variables. While it isn’t technically needed in the setup for this lab, we will convert these variables to categorical.

Credit['Gender'] = Credit['Gender'].astype('category')
Credit['Student'] = Credit['Student'].astype('category')
Credit['Married'] = Credit['Married'].astype('category')
Credit['Ethnicity'] = Credit['Ethnicity'].astype('category')
Credit.head()
Income Rating Cards Age Education Gender Student Married Ethnicity Balance
0 14.891 283 2 34 11 Male No Yes Caucasian 333
1 106.025 483 3 82 15 Female Yes Yes Asian 903
2 104.593 514 4 71 11 Male No No Asian 580
3 148.924 681 3 36 11 Female No No Asian 964
4 55.882 357 2 68 16 Male No Yes Caucasian 331

Next, we’ll need to convert all of the categorical variables to be numeric, through the use of dummy variables.

Credit = pd.get_dummies(Credit)
Credit.head()
Income Rating Cards Age Education Balance Gender_ Male Gender_Female Student_No Student_Yes Married_No Married_Yes Ethnicity_African American Ethnicity_Asian Ethnicity_Caucasian
0 14.891 283 2 34 11 333 1 0 1 0 0 1 0 0 1
1 106.025 483 3 82 15 903 0 1 0 1 0 1 0 1 0
2 104.593 514 4 71 11 580 1 0 1 0 1 0 0 1 0
3 148.924 681 3 36 11 964 0 1 1 0 1 0 0 1 0
4 55.882 357 2 68 16 331 1 0 1 0 0 1 0 0 1

In going so, each categorical variables is now represented by at least two 0 or 1 numeric variables.

Credit.shape
(400, 15)

We have a total of 400 observations of 17 variables, one of which will be the target.

Setup: Suppose you work for a small local bank, perhaps a credit union, that has a credit card product offering. For years, you relied on credit agencies to provide a rating of your customer’s credit, however, this costs your bank money. One day, you realize that it might be possible to reverse engineer your customers’ (and thus potential customers) credit rating based on the credit ratings that you have already purchased, as well as the demographic and credit card information that you already have, such as age, education level, income, etc.

Goal: Use k-nearest neighbors and decision trees to predict credit ratings based on features such as income, number of credit cards, and demographic information.

Let’s note the target variables and feature variables.

target = ['Rating']
features = [
    'Income',
    'Cards',
    'Age',
    'Education',
    'Balance',
    'Gender_ Male',
    'Gender_Female',
    'Student_No',
    'Student_Yes',
    'Married_No',
    'Married_Yes',
    'Ethnicity_African American',
    'Ethnicity_Asian',
    'Ethnicity_Caucasian'
]

Next we will move from pandas to numpy for use with sklearn.

X = Credit[features].to_numpy()
y = Credit[target].to_numpy()

Then, before any modeling, we will split the data into the relevant train, validation, and test datasets.

# create train and test datasets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)
# create validation-train and validation datasets for Champaign data
X_vtrain, X_val, y_vtrain, y_val = train_test_split(
    X_train, y_train, test_size=0.20, random_state=42
)

As a reminder:

  • _train indicates a full train dataset
  • _vtrain is a validation-train dataset that we will use to fit models
  • _val is a validation dataset that we will use to select models, in this case, select a value of k, based on models that were fit to a validation-train dataset.
  • _test is a test dataset that we will use to report an estimate of generalization error, after first refitting a chosen model to a full train dataset

Model Training

For this lab, you will train both a k-nearest neighbors and a decision tree model. For both, use all available features when training. That is, do not modify any of the X_ arrays.

For training a k-nearest neighbors model, consider each of the following \(k\) values.

k_values = [1, 5, 10, 25, 50, 100, 150, 200]

For training a decision tree, try each of the following values for min_samples_split as defined within the DecisionTreeRegressor.

min_split_values = [2, 5, 10, 25, 50, 100, 150, 200]

When creating k-nearest neighbors models, include algorithm='brute' in any calls to KNeighborsRegressor. Similarly, when creating decision tree models, include random_state=1 to control randomization that can arise when fitting trees.

Create two lists,

  • knn_val_rmse: The validation RMSE for each of the k-nearest neighbors models as defined by k_values.
  • tree_val_rmse: The validation RMSE for each of the decision tree models as defined by min_split_values.
# delete this comment and use this cell to create knn_val_rmse
# delete this comment and use this cell to create tree_val_rmse

After training these models, determine which are “best” within both models. To do so, store the chosen value of \(k\) in a variables named k_best and the values of min_samples_split in a variable named m_best.

# delete this comment and use this cell to define `k_best` and `m_best`.

While in true practice we would select whichever is “best” from either model, we will somewhat arbitrarily focus on the decision trees for instructive purposes.

Fit your selected decision tree model to the full training dataset. Call this model dt.

# delete this comment and create `dt` here

Investigate this particular tree in two ways.

First, create a visual representation of your tree by using the plot_tree function. When calling plot_tree add the following to the function call which will use the names of the features when describing the splits: feature_names=features. Make an attempt to size the plot such that the tree is readable. If the tree is simply too large to create a reasonable graphic, use the max_depth parameter to plot_tree to limit the depth of the tree to 3.

# create plot of tree here

Next, to get a sense of which feature variables your chosen tree model uses, run the code in the following cell. It will calculate the so-called “feature importance” for each feature. We will investigate these values in depth later in the course, but for now, realize that the higher the importance, the more often a feature was used for a split. Also, a higher importance suggests splits that are more important for overall model performance. Most importantly, any feature with an importance of zero was simply not used in the tree.

pd.DataFrame({
    "Feature Name": pd.Series(features),
    "Importance": dt.feature_importances_
}).sort_values("Importance", ascending=False)

Discussion

# use this cell to create any supporting statistics
# use this cell to create any supporting graphics
# if you would like to include additional graphics, 
# please add additional cells and create one graphic per cell

Graded discussion: Which model did you determine was best? Would you advise the bank manager to put the model into practice? Why or why not? (For this question, you only need to consider performance and practicality.) You may calculate and report or create and display any additional statistics or graphics to support your claim. Add any cells needed to do so within the Discussion section and above this cell. Hint: If you choose to do neither, you have very likely not provided sufficient justification for you answer.

Un-graded discussion: Was the machine learning you just did ethical? Was it legal? (Thankfully, as this was simulated data, you haven’t done anything wrong!) A very important lesson to learn now that you will have the power of machine learning: Just because you can, does not mean you should.

Comment on the ethics of collecting and using this data to reverse engineer a credit rating system.

Submission

Before submitting, please review the Lab Policy document on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.

Once you’ve reviewed the lab policy document, head to Canvas to submit your lab.

For Staff Use Only

The remainder of this notebook should not be modified. Any cells below here are for course staff use only. Any modification to the cells below will result in a severe lab grade reduction. However, please feel free to run the cells to check your work before submitting!

# params for testing
green_check = '\u2705'

# testing
assert len(knn_val_rmse) == len(k_values)
assert len(tree_val_rmse) == len(min_split_values)
assert all(np.array(knn_val_rmse) > 0)
assert all(np.array(tree_val_rmse) > 0)
assert isinstance(dt, DecisionTreeRegressor)

# success
print(f"{green_check} Everything looks good! {green_check}")