Lab 04: Creating a Wine Review System

Author

Your Name Here

Published

September 22, 2023

Introduction and Data

For this lab, we will utilize the “Wine Quality” dataset from the UCI Machine Learning Repository.

UCI MLR: Wine Quality

While the data could be obtained from this website, we will provide a specific and slightly pre-processed version of the data. However, this is a good website to be aware of! It houses many datasets that are useful for practice training machine learning models.

To complete this lab, you’ll need to import the following:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_percentage_error

To access the relevant data as a pandas data frame, we will use the following:

wine_red = pd.read_csv("https://cs307.org/lab/lab-04/data/winequality-red.csv", delimiter=";")
wine_white = pd.read_csv("https://cs307.org/lab/lab-04/data/winequality-white.csv", delimiter=";")

Notice that we are actually importing two datasets here, one for red white and one for white wine. This is mostly to demonstrate that both exist. However, we will focus on the white wine dataset.

wine_white

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.0	0.27	0.36	20.7	0.045	45.0	170.0	1.00100	3.00	0.45	8.8	6
1	6.3	0.30	0.34	1.6	0.049	14.0	132.0	0.99400	3.30	0.49	9.5	6
2	8.1	0.28	0.40	6.9	0.050	30.0	97.0	0.99510	3.26	0.44	10.1	6
3	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.99560	3.19	0.40	9.9	6
4	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.99560	3.19	0.40	9.9	6
...	...	...	...	...	...	...	...	...	...	...	...	...
4893	6.2	0.21	0.29	1.6	0.039	24.0	92.0	0.99114	3.27	0.50	11.2	6
4894	6.6	0.32	0.36	8.0	0.047	57.0	168.0	0.99490	3.15	0.46	9.6	5
4895	6.5	0.24	0.19	1.2	0.041	30.0	111.0	0.99254	2.99	0.46	9.4	6
4896	5.5	0.29	0.30	1.1	0.022	20.0	110.0	0.98869	3.34	0.38	12.8	7
4897	6.0	0.21	0.38	0.8	0.020	22.0	98.0	0.98941	3.26	0.32	11.8	6

4898 rows × 12 columns

Each of the 4898 rows represents a particular wine. 11 of the variables are physicochemical measurements, and the last variables, quality, is a (subjective) rating or quality.

For background on the dataset, the UCI website contains a minimal data dictionary:

UCI MLR: Wine Quality

However, more specifically, the columns that represent the features are:

fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol

The target is quality. It’s meaning, for the original paper, is:

Regarding the preferences, each sample was evaluated by a minimum of three sensory assessors (using blind tastes), which graded the wine in a scale that ranges from 0 (very bad) to 10 (excellent).

In this case, the features are the physicochemical data. That is, these are chemical properties of the individual wines that can be measured in a lab. For our purposes, we are not super interested in the details of these, but if you are interested, additional details are given in the original paper. But for the purposes of this lab, assume that there is some fixed cost to process and obtain these 11 measurements for any wine, but that doing so is far cheaper than paying humans to taste and review the wine.

The target is the sensory data, as “measured” by humans tasting the wine and reviewing it.

Goal: Find a model that is useful for predicting the quality of a wine based on its physicochemical properties, for the purpose of potentially removing the need for human testers.

Before we begin modeling, let’s take care of some pre-processing, in particular:

Specifying the features and target
Moving from pandas to numpy
Scale the X data
- Note: We are simply going to do this before the test-train split. This is not necessarily recommended in practice, but instead is for ease of completing the lab. Because we are scaling all the data immediately, you do not have to worry about scaling at all. Not at train time, and not at test time.
Splitting the data for training, validation, and testing

# specify target and feature variables
target = ['quality']
features = [
    'fixed acidity',
    'volatile acidity',
    'citric acid',
    'residual sugar',
    'chlorides',
    'free sulfur dioxide',
    'total sulfur dioxide',
    'density',
    'pH',
    'sulphates',
    'alcohol'
]

# create numpy arrays
X = wine_white[features].to_numpy()
y = wine_white[target].to_numpy().ravel()

# scale the data, warning: do not do it like this in practice (more on this later)
scale = StandardScaler()
scale.fit(X)
X = scale.transform(X)

# create train and test datasets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

# create validation-train and validation datasets
X_vtrain, X_val, y_vtrain, y_val = train_test_split(
    X_train, y_train, test_size=0.20, random_state=42
)

As a reminder:

_train indicates a full train dataset
_vtrain is a validation-train dataset that we will use to fit models
_val is a validation dataset that we will use to select models, in this case, select a value of k, based on models that were fit to a validation-train dataset.
_test is a test dataset that we will use to report an estimate of generalization error, after first refitting a chosen model to a full train dataset

Model Training (Graded Work)

Your job will be to find a reasonable model from within each of the following broad categories:

Linear Model
Lasso
Ridge
Decision Tree
Random Forest

Linear Model

For the linear model, simply fit a linear model with all available feature variables.

Lasso

Fit a Lasso model using all available features, and consider the following potential values of alpha.

alpha = [0.0001, 0.001, 0.01, 0.1, 1]

Ridge

Fit a Ridge model using all available features, and consider the same values of alpha as the Lasso model.

Decision Tree

Fit a Decision Tree with max_depth = 5 using all available features.

Random Forest

Fit a Random Forest with max_features = 'sqrt' using all available features.

For this lab, you are asked to train, validate, and select a model by yourself! You will write all the necessary code from here!

# use this cell for the linear model

# use this cell for the lasso model

# use this cell for the ridge model

# use this cell for the decision tree model

# use this cell for the random forest model

Because we will not be checking for exact values of validation metrics, etc, be sure to very clearly include relevant metrics in your discussion.

Discussion

# use this cell to create and print any supporting statistics

# use this cell to create any supporting graphics
# if you would like to include additional graphics, 
# please add additional cells and create one graphic per cell

Graded discussion: Which model did you determine was best? Why? Would you use it to replace wine testers? Are there any limitations of this model you would want someone who will use it to know about? Are there any aspect of your model outside its ability to predict that you find useful? (Use any statistics or graphics you see fit to provide justifications.)

Un-graded discussion: Is this actually a regression problem? Comment on the pros and cons of treating this task as a regression or a classification.

Submission

Before submitting, please review the Lab Policy document on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.

Be sure that you have added your name at the top of this notebook.

Once you’ve reviewed the lab policy document, head to Canvas to submit your lab.

For Staff Use Only

The remainder of this notebook should not be modified. Any cells below here are for course staff use only. Any modification to the cells below will result in a severe lab grade reduction. However, please feel free to run the cells to check your work before submitting!

# params for testing
green_check = '\u2705'

# testing
assert 1 == 1

# success
print(f"{green_check} Everything looks good! {green_check}")