import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from ISLP import load_data
Lab 02: Creating a Credit Rating
Introduction and Data
For this lab, we will utilize Credit card data from the ISLP
package. To use this data, you will need to install the ISLP
package. To install this package, run the following:
%pip install ISLP
Do not include code for installation in any cell of this document as installation code only needs to be run once. Simply delete any temporary cells you create while installing packages.
In addition to ISLP
, we’ll need to import several other packages and modules:
To access the relevant data as a pandas
data frame, we will use the following:
= load_data("Credit") Credit
Credit.head()
ID | Income | Limit | Rating | Cards | Age | Education | Gender | Student | Married | Ethnicity | Balance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 14.891 | 3606 | 283 | 2 | 34 | 11 | Male | No | Yes | Caucasian | 333 |
1 | 2 | 106.025 | 6645 | 483 | 3 | 82 | 15 | Female | Yes | Yes | Asian | 903 |
2 | 3 | 104.593 | 7075 | 514 | 4 | 71 | 11 | Male | No | No | Asian | 580 |
3 | 4 | 148.924 | 9504 | 681 | 3 | 36 | 11 | Female | No | No | Asian | 964 |
4 | 5 | 55.882 | 4897 | 357 | 2 | 68 | 16 | Male | No | Yes | Caucasian | 331 |
A data dictionary can be found here: ISLP: Credit Card Balance Data
Importantly, note that this is not real data. It is simulated.
The ID
variable serves no purpose, so we’ll remove it before additional preprocessing. We will also remove the Limit
variable because it would not make sense to have access to this variable for the purpose we will describe shortly.
= Credit.drop(columns=['ID'])
Credit = Credit.drop(columns=['Limit']) Credit
Credit.dtypes
Income float64
Rating int64
Cards int64
Age int64
Education int64
Gender object
Student object
Married object
Ethnicity object
Balance int64
dtype: object
We have mostly numeric data, except for the Gender
, Student
, Married
, and Ethnicity
variables. While it isn’t technically needed in the setup for this lab, we will convert these variables to categorical.
'Gender'] = Credit['Gender'].astype('category')
Credit['Student'] = Credit['Student'].astype('category')
Credit['Married'] = Credit['Married'].astype('category')
Credit['Ethnicity'] = Credit['Ethnicity'].astype('category') Credit[
Credit.head()
Income | Rating | Cards | Age | Education | Gender | Student | Married | Ethnicity | Balance | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 14.891 | 283 | 2 | 34 | 11 | Male | No | Yes | Caucasian | 333 |
1 | 106.025 | 483 | 3 | 82 | 15 | Female | Yes | Yes | Asian | 903 |
2 | 104.593 | 514 | 4 | 71 | 11 | Male | No | No | Asian | 580 |
3 | 148.924 | 681 | 3 | 36 | 11 | Female | No | No | Asian | 964 |
4 | 55.882 | 357 | 2 | 68 | 16 | Male | No | Yes | Caucasian | 331 |
Next, we’ll need to convert all of the categorical variables to be numeric, through the use of dummy variables.
= pd.get_dummies(Credit)
Credit Credit.head()
Income | Rating | Cards | Age | Education | Balance | Gender_ Male | Gender_Female | Student_No | Student_Yes | Married_No | Married_Yes | Ethnicity_African American | Ethnicity_Asian | Ethnicity_Caucasian | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14.891 | 283 | 2 | 34 | 11 | 333 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
1 | 106.025 | 483 | 3 | 82 | 15 | 903 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
2 | 104.593 | 514 | 4 | 71 | 11 | 580 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
3 | 148.924 | 681 | 3 | 36 | 11 | 964 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
4 | 55.882 | 357 | 2 | 68 | 16 | 331 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
In going so, each categorical variables is now represented by at least two 0
or 1
numeric variables.
Credit.shape
(400, 15)
We have a total of 400 observations of 17 variables, one of which will be the target.
Setup: Suppose you work for a small local bank, perhaps a credit union, that has a credit card product offering. For years, you relied on credit agencies to provide a rating of your customer’s credit, however, this costs your bank money. One day, you realize that it might be possible to reverse engineer your customers’ (and thus potential customers) credit rating based on the credit ratings that you have already purchased, as well as the demographic and credit card information that you already have, such as age, education level, income, etc.
Goal: Use k-nearest neighbors and decision trees to predict credit ratings based on features such as income, number of credit cards, and demographic information.
Let’s note the target variables and feature variables.
= ['Rating']
target = [
features 'Income',
'Cards',
'Age',
'Education',
'Balance',
'Gender_ Male',
'Gender_Female',
'Student_No',
'Student_Yes',
'Married_No',
'Married_Yes',
'Ethnicity_African American',
'Ethnicity_Asian',
'Ethnicity_Caucasian'
]
Next we will move from pandas
to numpy
for use with sklearn
.
= Credit[features].to_numpy()
X = Credit[target].to_numpy() y
Then, before any modeling, we will split the data into the relevant train, validation, and test datasets.
# create train and test datasets
= train_test_split(
X_train, X_test, y_train, y_test =0.20, random_state=42
X, y, test_size )
# create validation-train and validation datasets for Champaign data
= train_test_split(
X_vtrain, X_val, y_vtrain, y_val =0.20, random_state=42
X_train, y_train, test_size )
As a reminder:
_train
indicates a full train dataset_vtrain
is a validation-train dataset that we will use to fit models_val
is a validation dataset that we will use to select models, in this case, select a value of k, based on models that were fit to a validation-train dataset._test
is a test dataset that we will use to report an estimate of generalization error, after first refitting a chosen model to a full train dataset
Model Training
For this lab, you will train both a k-nearest neighbors and a decision tree model. For both, use all available features when training. That is, do not modify any of the X_
arrays.
For training a k-nearest neighbors model, consider each of the following \(k\) values.
= [1, 5, 10, 25, 50, 100, 150, 200] k_values
For training a decision tree, try each of the following values for min_samples_split
as defined within the DecisionTreeRegressor
.
= [2, 5, 10, 25, 50, 100, 150, 200] min_split_values
When creating k-nearest neighbors models, include algorithm='brute'
in any calls to KNeighborsRegressor
. Similarly, when creating decision tree models, include random_state=1
to control randomization that can arise when fitting trees.
Create two lists,
knn_val_rmse
: The validation RMSE for each of the k-nearest neighbors models as defined byk_values
.tree_val_rmse
: The validation RMSE for each of the decision tree models as defined bymin_split_values
.
# delete this comment and use this cell to create knn_val_rmse
# delete this comment and use this cell to create tree_val_rmse
After training these models, determine which are “best” within both models. To do so, store the chosen value of \(k\) in a variables named k_best
and the values of min_samples_split
in a variable named m_best
.
# delete this comment and use this cell to define `k_best` and `m_best`.
While in true practice we would select whichever is “best” from either model, we will somewhat arbitrarily focus on the decision trees for instructive purposes.
Fit your selected decision tree model to the full training dataset. Call this model dt
.
# delete this comment and create `dt` here
Investigate this particular tree in two ways.
First, create a visual representation of your tree by using the plot_tree
function. When calling plot_tree
add the following to the function call which will use the names of the features when describing the splits: feature_names=features
. Make an attempt to size the plot such that the tree is readable. If the tree is simply too large to create a reasonable graphic, use the max_depth
parameter to plot_tree
to limit the depth of the tree to 3
.
# create plot of tree here
Next, to get a sense of which feature variables your chosen tree model uses, run the code in the following cell. It will calculate the so-called “feature importance” for each feature. We will investigate these values in depth later in the course, but for now, realize that the higher the importance, the more often a feature was used for a split. Also, a higher importance suggests splits that are more important for overall model performance. Most importantly, any feature with an importance of zero was simply not used in the tree.
pd.DataFrame({"Feature Name": pd.Series(features),
"Importance": dt.feature_importances_
"Importance", ascending=False) }).sort_values(
Discussion
# use this cell to create any supporting statistics
# use this cell to create any supporting graphics
# if you would like to include additional graphics,
# please add additional cells and create one graphic per cell
Graded discussion: Which model did you determine was best? Would you advise the bank manager to put the model into practice? Why or why not? (For this question, you only need to consider performance and practicality.) You may calculate and report or create and display any additional statistics or graphics to support your claim. Add any cells needed to do so within the Discussion section and above this cell. Hint: If you choose to do neither, you have very likely not provided sufficient justification for you answer.
Un-graded discussion: Was the machine learning you just did ethical? Was it legal? (Thankfully, as this was simulated data, you haven’t done anything wrong!) A very important lesson to learn now that you will have the power of machine learning: Just because you can, does not mean you should.
Comment on the ethics of collecting and using this data to reverse engineer a credit rating system.
Submission
Before submitting, please review the Lab Policy document on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.
Once you’ve reviewed the lab policy document, head to Canvas to submit your lab.
For Staff Use Only
The remainder of this notebook should not be modified. Any cells below here are for course staff use only. Any modification to the cells below will result in a severe lab grade reduction. However, please feel free to run the cells to check your work before submitting!
# params for testing
= '\u2705'
green_check
# testing
assert len(knn_val_rmse) == len(k_values)
assert len(tree_val_rmse) == len(min_split_values)
assert all(np.array(knn_val_rmse) > 0)
assert all(np.array(tree_val_rmse) > 0)
assert isinstance(dt, DecisionTreeRegressor)
# success
print(f"{green_check} Everything looks good! {green_check}")