Lab 06: Building a Fraud Detector

Author

Your Name Here

Published

October 20, 2023

Introduction and Data

Goal: The goal of this lab is to create a fraud detector that can be used as a part of an automated banking system. It should predict whether or not each transaction is fraud or not.

To do this, you’ll need to import the following:

# basics
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from joblib import dump, load
import warnings

# machine learning
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

You are free to import additional packages and modules as you see fit, and you will almost certainly need to.

The data for this lab originally comes from Kaggle.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

We are providing a slightly modified subset of the data for this lab. Beyond subsetting, we have:

  • Removed the Time variable as it is misleading.
  • Slightly modified the ratio of fraud to not fraud.

Note that PCA is a method that we will learn about later in the course. For now, know that it takes some number of features as inputs, and outputs either the same or fewer features, that retain most of the original information in the features. You can assume things like location and type of purchase were among the original input features. (Ever had a credit card transaction denied while traveling?)

We present the train data as both a complete data frame, or the X and y data. The former will be useful for calculating summary statistics. The latter will be useful for model training.

Note that we are not providing a test dataset. Instead, the test dataset will live within the autograder, and once you submit, you will receive feedback and metrics based on the test data. (Therefor, cross-validation or a validation set will be your friend here.)

train = pd.read_csv("https://cs307.org/lab/lab-06/data/credit_train.csv")
train
# create X and y for train data
X_train = train.drop("Class", axis=1)
y_train = train["Class"]

Summary Statistics (Graded Work)

What summary statistics should be calculated? See the relevant assignment on PrairieLearn!

Model Training (Graded Work)

For this lab, you may train models however you’d like!

The only rules are:

  • Models must start from the given training data, unmodified.
    • Importantly the type and shape of X_train and y_train should not be changed, and should be the input to your models.
    • That is, any pipeline must start from these. After that, do whatever!
  • Your model must have a predict method.
  • Your model must have a predict_proba method.

You will submit your chosen model to an autograder for checking. It will calculate your models performance on the test data. Notice that, because you will have unlimited attempts, this somewhat encourages checking against test data multiple times. But you know this is bad in practice. Also, if you use cross-validation to find a good model before submitting, hopefully you’ll only need to submit once!

# use this cell to train models

To submit your model to the autograder, you will need to serialize them. In the following cell, replace _____ with the model you have found.

dump(______, "fraud_detector.joblib")

After you run this cell, a file will be written in the same folder as this notebook that you should submit to the autograder. See the relevant question in the relevant lab on PrairieLearn.

Discussion

# use this cell to create and print any supporting statistics
# the autograder will give you tn, fp, fn, tp for the test data
# with that information, you can calculate any and all relevant metrics

Graded discussion: Do you think that your model is good enough for a bank to use in production? Justify your answer? Describe the potential real-world risks of both false positives and false negatives in this case.

Submission

Before submitting, please review the Lab Policy document on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.

Be sure that you have added your name at the top of this notebook.

Once you’ve reviewed the lab policy document, head to Canvas to submit your lab notebook.