Lab 07: Airline Tweets Sentiment Analysis

Author

Your Name Here

Published

October 27, 2023

Introduction and Data

Goal: The goal of this lab is to create a sentiment classifier that can automatically classify tweets at US airlines as one of three sentiments: negative, neutral, or positive.

To do this, you’ll need to import, at least, the following:

# basics
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from joblib import dump, load

# machine learning
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

You are free to import additional packages and modules as you see fit, and you will almost certainly need to.

The data for this lab originally comes from Kaggle.

Kaggle: Twitter US Airline Sentiment

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as “late flight” or “rude service”).

TODO: What modifications have been done?

We present the modified train data as both a complete data frame, or the X and y data. The former will be useful for calculating summary statistics. The latter will be useful for model training.

Note that we are not providing a test dataset. Instead, the test dataset will live within the autograder, and once you submit, you will receive feedback and metrics based on the test data. (Therefor, cross-validation or a validation set will be your friend here.)

train = pd.read_csv("https://cs307.org/lab/lab-07/data/tweets_train.csv")
train

	airline_sentiment	airline	text
0	negative	United	@united at its worse. Can't figure how to pack...
1	negative	Delta	@JetBlue I did not report the updated info - d...
2	negative	Delta	@JetBlue I'll give u a chance but I don't thin...
3	negative	United	@united Yo yo yo stuck on the tarmac for over ...
4	negative	US Airways	@USAirways yes, I was rebooked the next day (W...
...	...	...	...
10975	positive	Southwest	@SouthwestAir replacing @vitaminwater with bee...
10976	negative	American	@AmericanAir at LAX and your service reps just...
10977	negative	Southwest	@SouthwestAir Been on hold for over an hour - ...
10978	negative	United	@united we would...how do I contact you to dis...
10979	neutral	Delta	@JetBlue that's ok! It just sure seemed like i...

10980 rows × 3 columns

The airline column is only available for illistrative purposes and should not be used for model training. It is removed in the code below.

# create X and y for train data
X_train = train["text"]
y_train = train["airline_sentiment"]

Bag-of-Words

To use the text of the tweets as input to machine learning models, you will need to do some preprocessing, this text cannot simply be input into the models we have seen.

X_train

0        @united at its worse. Can't figure how to pack...
1        @JetBlue I did not report the updated info - d...
2        @JetBlue I'll give u a chance but I don't thin...
3        @united Yo yo yo stuck on the tarmac for over ...
4        @USAirways yes, I was rebooked the next day (W...
                               ...                        
10975    @SouthwestAir replacing @vitaminwater with bee...
10976    @AmericanAir at LAX and your service reps just...
10977    @SouthwestAir Been on hold for over an hour - ...
10978    @united we would...how do I contact you to dis...
10979    @JetBlue that's ok! It just sure seemed like i...
Name: text, Length: 10980, dtype: object

To do so, we will create a so-called bag-of-words. Let’s see what that looks like with a smaller set of string.

word_counter = CountVectorizer()

word_counts = word_counter.fit_transform(
    [
        "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo",
        "The quick brown fox jumps over the lazy dog",
        "",
    ]
).todense()

print(word_counter.vocabulary_)

{'buffalo': 1, 'the': 8, 'quick': 7, 'brown': 0, 'fox': 3, 'jumps': 4, 'over': 6, 'lazy': 5, 'dog': 2}

sorted(list(word_counter.vocabulary_.keys()))

['brown', 'buffalo', 'dog', 'fox', 'jumps', 'lazy', 'over', 'quick', 'the']

print(word_counts)

[[0 8 0 0 0 0 0 0 0]
 [1 0 1 1 1 1 1 1 2]
 [0 0 0 0 0 0 0 0 0]]

pd.DataFrame(word_counts, columns=sorted(list(word_counter.vocabulary_.keys())))

	brown	buffalo	dog	fox	jumps	lazy	over	quick	the
0	0	8	0	0	0	0	0	0	0
1	1	0	1	1	1	1	1	1	2
2	0	0	0	0	0	0	0	0	0

Essentially, we’ve created a number of feature variables, each one counting how many times words in the vocabulary appears in a sample’s text.

Let’s find the 100 most common words in the train tweets at the airlines.

top_100_counter = CountVectorizer(max_features=100)
X_top_100 = top_100_counter.fit_transform(X_train)
print("Top 100 Words:")
print(top_100_counter.get_feature_names_out())
print("")

Top 100 Words:
['about' 'after' 'airline' 'all' 'am' 'americanair' 'amp' 'an' 'and' 'any'
 'are' 'as' 'at' 'back' 'bag' 'be' 'been' 'but' 'by' 'call' 'can'
 'cancelled' 'co' 'customer' 'delayed' 'do' 'don' 'flight' 'flightled'
 'flights' 'for' 'from' 'gate' 'get' 'got' 'guys' 'had' 'has' 'have'
 'help' 'hold' 'hour' 'hours' 'how' 'http' 'if' 'in' 'is' 'it' 'jetblue'
 'just' 'late' 'like' 'me' 'more' 'my' 'need' 'no' 'not' 'now' 'of' 'on'
 'one' 'or' 'our' 'out' 'over' 'phone' 'plane' 'please' 'service' 'so'
 'southwestair' 'still' 'thank' 'thanks' 'that' 'the' 'there' 'they'
 'this' 'time' 'to' 'today' 'united' 'up' 'us' 'usairways' 've'
 'virginamerica' 'was' 'we' 'what' 'when' 'why' 'will' 'with' 'would'
 'you' 'your']

X_top_100_dense = X_top_100.todense()
X_top_100_dense

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 1],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 2, 1, 0],
        [0, 0, 0, ..., 0, 0, 0]])

X_top_100.shape

(10980, 100)

plane_idx = np.where(top_100_counter.get_feature_names_out() == "plane")
plane_count = np.sum(X_top_100.todense()[:, plane_idx])
print('The Word "plane" Appears:', plane_count)

The Word "plane" Appears: 482

Note that you’ll need to do this same process, but within a pipeline! You might also consider looking into other techniques to process text for input to models.

Summary Statistics (Graded Work)

What summary statistics should be calculated? See the relevant assignment on PrairieLearn!

Lab 07: Airline Tweets Sentiment Analysis

Model Training (Graded Work)

For this lab, you may train models however you’d like! The only rules are:

Models must start from the given training data, unmodified.
- Importantly, the type and shape of X_train and y_train should not be changed.
- The number of features can and should be modified via a pipeline, but the pipeline must start from the given X_train.
Your model must have a fit method.
Your model must have a predict method.
Your model must have a predict_proba method.
Your serialized model must be less than 5MB.
- Be aware: some models use more disk space than others!

You will submit your chosen model to an autograder for checking. It will calculate your models performance on the test data. Notice that, because you will have unlimited attempts, this somewhat encourages checking against test data multiple times. But you know this is bad in practice. Also, if you use cross-validation to find a good model before submitting, hopefully you’ll only need to submit once!

# use this cell to train models

To submit your model to the autograder, you will need to serialize them. In the following cell, replace _____ with the model you have found.

dump(______, "airline_sentiment.joblib", compress=3)

After you run this cell, a file will be written in the same folder as this notebook that you should submit to the autograder. See the relevant question in the relevant lab on PrairieLearn.

Lab 07: Airline Tweets Sentiment Analysis

Discussion

# use this cell to create and print any supporting statistics
# the autograder will give you a test confusion matrix
# with that information, you can calculate any and all relevant metrics

Graded discussion: Describe how an airline could use a model like this as part of their operations. In doing so, what are the potential mistakes the model could make, and what are the severities of these mistakes?

Submission

Before submitting, please review the Lab Policy document on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.

Be sure that you have added your name at the top of this notebook.

Once you’ve reviewed the lab policy document, head to Canvas to submit your lab notebook.