# basics
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from joblib import dump, load
# machine learning
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
Lab 07: Airline Tweets Sentiment Analysis
Introduction and Data
Goal: The goal of this lab is to create a sentiment classifier that can automatically classify tweets at US airlines as one of three sentiments: negative, neutral, or positive.
To do this, you’ll need to import, at least, the following:
You are free to import additional packages and modules as you see fit, and you will almost certainly need to.
The data for this lab originally comes from Kaggle.
A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as “late flight” or “rude service”).
TODO
: What modifications have been done?
We present the modified train data as both a complete data frame, or the X
and y
data. The former will be useful for calculating summary statistics. The latter will be useful for model training.
Note that we are not providing a test dataset. Instead, the test dataset will live within the autograder, and once you submit, you will receive feedback and metrics based on the test data. (Therefor, cross-validation or a validation set will be your friend here.)
= pd.read_csv("https://cs307.org/lab/lab-07/data/tweets_train.csv")
train train
airline_sentiment | airline | text | |
---|---|---|---|
0 | negative | United | @united at its worse. Can't figure how to pack... |
1 | negative | Delta | @JetBlue I did not report the updated info - d... |
2 | negative | Delta | @JetBlue I'll give u a chance but I don't thin... |
3 | negative | United | @united Yo yo yo stuck on the tarmac for over ... |
4 | negative | US Airways | @USAirways yes, I was rebooked the next day (W... |
... | ... | ... | ... |
10975 | positive | Southwest | @SouthwestAir replacing @vitaminwater with bee... |
10976 | negative | American | @AmericanAir at LAX and your service reps just... |
10977 | negative | Southwest | @SouthwestAir Been on hold for over an hour - ... |
10978 | negative | United | @united we would...how do I contact you to dis... |
10979 | neutral | Delta | @JetBlue that's ok! It just sure seemed like i... |
10980 rows × 3 columns
The airline
column is only available for illistrative purposes and should not be used for model training. It is removed in the code below.
# create X and y for train data
= train["text"]
X_train = train["airline_sentiment"] y_train
Bag-of-Words
To use the text of the tweets as input to machine learning models, you will need to do some preprocessing, this text cannot simply be input into the models we have seen.
X_train
0 @united at its worse. Can't figure how to pack...
1 @JetBlue I did not report the updated info - d...
2 @JetBlue I'll give u a chance but I don't thin...
3 @united Yo yo yo stuck on the tarmac for over ...
4 @USAirways yes, I was rebooked the next day (W...
...
10975 @SouthwestAir replacing @vitaminwater with bee...
10976 @AmericanAir at LAX and your service reps just...
10977 @SouthwestAir Been on hold for over an hour - ...
10978 @united we would...how do I contact you to dis...
10979 @JetBlue that's ok! It just sure seemed like i...
Name: text, Length: 10980, dtype: object
To do so, we will create a so-called bag-of-words. Let’s see what that looks like with a smaller set of string.
= CountVectorizer() word_counter
= word_counter.fit_transform(
word_counts
["Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo",
"The quick brown fox jumps over the lazy dog",
"",
] ).todense()
print(word_counter.vocabulary_)
{'buffalo': 1, 'the': 8, 'quick': 7, 'brown': 0, 'fox': 3, 'jumps': 4, 'over': 6, 'lazy': 5, 'dog': 2}
sorted(list(word_counter.vocabulary_.keys()))
['brown', 'buffalo', 'dog', 'fox', 'jumps', 'lazy', 'over', 'quick', 'the']
print(word_counts)
[[0 8 0 0 0 0 0 0 0]
[1 0 1 1 1 1 1 1 2]
[0 0 0 0 0 0 0 0 0]]
=sorted(list(word_counter.vocabulary_.keys()))) pd.DataFrame(word_counts, columns
brown | buffalo | dog | fox | jumps | lazy | over | quick | the | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 2 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Essentially, we’ve created a number of feature variables, each one counting how many times words in the vocabulary appears in a sample’s text.
Let’s find the 100 most common words in the train tweets at the airlines.
= CountVectorizer(max_features=100)
top_100_counter = top_100_counter.fit_transform(X_train)
X_top_100 print("Top 100 Words:")
print(top_100_counter.get_feature_names_out())
print("")
Top 100 Words:
['about' 'after' 'airline' 'all' 'am' 'americanair' 'amp' 'an' 'and' 'any'
'are' 'as' 'at' 'back' 'bag' 'be' 'been' 'but' 'by' 'call' 'can'
'cancelled' 'co' 'customer' 'delayed' 'do' 'don' 'flight' 'flightled'
'flights' 'for' 'from' 'gate' 'get' 'got' 'guys' 'had' 'has' 'have'
'help' 'hold' 'hour' 'hours' 'how' 'http' 'if' 'in' 'is' 'it' 'jetblue'
'just' 'late' 'like' 'me' 'more' 'my' 'need' 'no' 'not' 'now' 'of' 'on'
'one' 'or' 'our' 'out' 'over' 'phone' 'plane' 'please' 'service' 'so'
'southwestair' 'still' 'thank' 'thanks' 'that' 'the' 'there' 'they'
'this' 'time' 'to' 'today' 'united' 'up' 'us' 'usairways' 've'
'virginamerica' 'was' 'we' 'what' 'when' 'why' 'will' 'with' 'would'
'you' 'your']
= X_top_100.todense()
X_top_100_dense X_top_100_dense
matrix([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 1],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 2, 1, 0],
[0, 0, 0, ..., 0, 0, 0]])
X_top_100.shape
(10980, 100)
= np.where(top_100_counter.get_feature_names_out() == "plane")
plane_idx = np.sum(X_top_100.todense()[:, plane_idx])
plane_count print('The Word "plane" Appears:', plane_count)
The Word "plane" Appears: 482
Note that you’ll need to do this same process, but within a pipeline! You might also consider looking into other techniques to process text for input to models.
Summary Statistics (Graded Work)
What summary statistics should be calculated? See the relevant assignment on PrairieLearn!
Model Training (Graded Work)
For this lab, you may train models however you’d like! The only rules are:
- Models must start from the given training data, unmodified.
- Importantly, the type and shape of
X_train
andy_train
should not be changed. - The number of features can and should be modified via a pipeline, but the pipeline must start from the given
X_train
.
- Importantly, the type and shape of
- Your model must have a
fit
method. - Your model must have a
predict
method. - Your model must have a
predict_proba
method. - Your serialized model must be less than 5MB.
- Be aware: some models use more disk space than others!
You will submit your chosen model to an autograder for checking. It will calculate your models performance on the test data. Notice that, because you will have unlimited attempts, this somewhat encourages checking against test data multiple times. But you know this is bad in practice. Also, if you use cross-validation to find a good model before submitting, hopefully you’ll only need to submit once!
# use this cell to train models
To submit your model to the autograder, you will need to serialize them. In the following cell, replace _____
with the model you have found.
"airline_sentiment.joblib", compress=3) dump(______,
After you run this cell, a file will be written in the same folder as this notebook that you should submit to the autograder. See the relevant question in the relevant lab on PrairieLearn.
Discussion
# use this cell to create and print any supporting statistics
# the autograder will give you a test confusion matrix
# with that information, you can calculate any and all relevant metrics
Graded discussion: Describe how an airline could use a model like this as part of their operations. In doing so, what are the potential mistakes the model could make, and what are the severities of these mistakes?
Submission
Before submitting, please review the Lab Policy document on the course website. This document contains additional directions for completing and submitting your lab. It also defines the grading rubric for the lab.
Be sure that you have added your name at the top of this notebook.
Once you’ve reviewed the lab policy document, head to Canvas to submit your lab notebook.