One of the hot-trend topics in Natural Language Processing (NLP) is sentiment analysis. Sentiment analysis involves extraction of subjective information from documents like posts and reviews to determine the opinion with respect to products, service, events, or ideas.
This project uses the customer review data from Amazon.com to perform a supervised binary (positive or negative) sentiment classification analysis. We use various data pre-processing techniques and demonstrate their effectiveness in improving the classification. We also compare three machine learning models, namely, the multinomial Naive Bayes classification model (MultinomialNB), the Logistic regression model (LogisticRegression), and the linear support vector classification model (LinearSVC).
The result of the analysis shows that adding negation handling and n-grams modeling techniques into data preprocessing can significantly increase the model accuracy. The result also indicates that SVC model provides the best prediction accuracy.
# user definded function to indicate processing progress
import my_utils
import html
import time
import functools
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords as sw
import string
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
import pickle
The data comes from the website "Amazon product data" managed by Dr. Julian McAuley from UCSD. We choose the smaller subset of the customer review data from the Kindle store of Amazon.com. The data is in the JSON format, which contains 982,619 reviews and metadata spanning May 1996 - July 2014.
import json
data = pd.read_json('Kindle_Store_5.json', lines=True)
data.head()
Reviews with overall rating of 1, 2, or 3 are labeled as negative ("neg"), and reviews with overall rating of 4 or 5 are labeled as positive ("pos").
data.columns.values
# Make a new column named 'pos_neg', which has value 'neg' if the overall rating is 1, 2 ,or 3,
# and has value 'pos' of the overall rating is 4 or 5.
data.loc[data.overall.isin([1,2,3]), 'pos_neg'] = 'neg'
data.loc[data.overall.isin([4,5]), 'pos_neg'] = 'pos'
# Choose only the columns 'pop_neg' and 'reviewText'
df = data[['pos_neg', 'reviewText']]
df.head()
Since the dataset is imbalanced that more than 84% of the reviews are positive, we undersample the positive reviews (the majority class) to have exactly the same number of reviews as in the negative ones.
# Check whethere there is any missing data
df.isnull().sum()
data.shape
# Check the distribution of the positive and negative reviews
df.pos_neg.value_counts()
# Sample positive reveiws to get a balanced dataset
neg = df.loc[df.pos_neg=='neg']
pos = df.loc[df.pos_neg=='pos'].sample(n=df.pos_neg.value_counts()['neg'], random_state=42)
print(type(pos))
print("pos:", len(pos), ", neg:", len(neg))
Data preprocessing process uses the following steps:
lemmatizer = nltk.WordNetLemmatizer()
stopwords = sw.words('english')
stopwords = stopwords + ['not_' + w for w in stopwords]
# transform punctuation to blanks
trans_punct = str.maketrans(string.punctuation,' '*len(string.punctuation))
# pad punctuation with blanks
pad_punct = str.maketrans({key: " {0} ".format(key) for key in string.punctuation})
# remove "_" from string.punctuation
invalidChars = str(string.punctuation.replace("_", ""))
def preprocessing(line, ngram=1, neg_handling=True, remove_stop=False):
"""
Preprocessing the review texts
@params:
line - Required: the input text (Str)
ngram - Optional: number n in the n-gram model(Int, 1, 2, or 3)
neg_handling - Optional: whether to perform negation handling (Boolean)
remove_stop -Optional: whether to remove the stop words (Boolean)
"""
line = html.unescape(str(line))
line = str(line).replace("can't", "can not")
line = str(line).replace("n't", " not")
if neg_handling:
line = str(line).translate(pad_punct) # If performing negation handling, pad punctuations with blanks
line = nltk.word_tokenize(line.lower()) # Word normalization and tokenization
tokens = []
negated = False
for t in line:
if t in ['not', 'no']:
negated = not negated
elif t in string.punctuation or not t.isalpha():
negated = False
else:
tokens.append('not_' + t if negated else t) # add "not_" prefix to words behind "not", or "no"
else:
line = str(line).translate(trans_punct) # If not performing negation handling, remove punctuations
line = nltk.word_tokenize(line.lower()) # Word normalization and tokenization
tokens = line
if ngram==2:
bi_tokens = list(nltk.bigrams(line))
bi_tokens = list(map('_'.join, bi_tokens))
bi_tokens = [i for i in bi_tokens if all(j not in invalidChars for j in i)]
tokens = tokens + bi_tokens
if ngram==3:
bi_tokens = list(nltk.bigrams(line))
bi_tokens = list(map('_'.join, bi_tokens))
bi_tokens = [i for i in bi_tokens if all(j not in invalidChars for j in i)]
tri_tokens = list(nltk.trigrams(line))
tri_tokens = list(map('_'.join, tri_tokens))
tri_tokens = [i for i in tri_tokens if all(j not in invalidChars for j in i)]
tokens = tokens + bi_tokens + tri_tokens
if remove_stop:
line = [lemmatizer.lemmatize(t) for t in tokens if t not in stopwords]
else:
line = [lemmatizer.lemmatize(t) for t in tokens]
return ' '.join(line)
line = "I don't think this book has any decent information!!! It is full of typos and factual errors that I can't ignore."
preprocessing(line, ngram=1, neg_handling=False, remove_stop=False)
preprocessing(line, ngram=1, neg_handling=False, remove_stop=True)
preprocessing(line, ngram=1, neg_handling=True, remove_stop=False)
preprocessing(line, ngram=3, neg_handling=True, remove_stop=False)
# Preprocessing the positive reveiws
pos_data = []
n_pos = len(pos)
for i, p in enumerate(pos['reviewText']):
pos_data.append(preprocessing(p, ngram=3))
my_utils.print_progress(bar_length=50, decimals=0, iteration=i+1, total=n_pos, prefix='Preprocessing pos data: ')
# Preprocessing the negative reveiws
neg_data = []
n_neg = len(neg)
for i, n in enumerate(neg['reviewText']):
neg_data.append(preprocessing(n, ngram=3))
my_utils.print_progress(bar_length=50, decimals=0, iteration=i+1, total=n_neg, prefix='Preprocessing neg data: ')
# Combine the preprocessed data
data = pos_data + neg_data
labels = np.concatenate((pos['pos_neg'].values, neg['pos_neg'].values))
We split the whole dataset randomly to the training set, validation set, and testing set by the proportion of 60%, 20%, and 20% respectively.
# split the dataset to training, validation, test sets by 60-20-20
train_data, rest_data, train_labels, rest_labels = train_test_split(data, labels, test_size=0.4,
stratify=labels, random_state=1234)
valid_data, test_data, valid_labels, test_labels = train_test_split(rest_data, rest_labels, test_size=0.5,
stratify=rest_labels, random_state=1234)
print("training size = ", len(train_data), "validation size = ", len(valid_data), "testing size = ", len(test_data))
We use vectorization process to turn the collection of text documents into numerical feature vectors. To extract numerical features from text content, we use the Bag of Words strategy:
A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus, while completely ignoring the relative position information of the words in the document.
# Push all tokens and compute the frequency of words
tokens = [word for line in train_data for word in nltk.word_tokenize(line)]
word_features = nltk.FreqDist(tokens)
print(word_features)
# Print the 10 most common words
word_features.most_common(10)
# Remove features (words) which occur only once (This is to be used in the basic modeling process)
topwords = [fpair[0] for fpair in list(word_features.most_common(len(word_features))) if fpair[1]>=2]
len(topwords)
# Convert a collection of raw documents to a matrix of TF-IDF features.
# Equivalent to CountVectorizer followed by TfidfTransformer.
tf_vec = TfidfVectorizer()
tf_vec.fit_transform([' '.join(topwords)])
# Extract features from training set
# Vocabulary is from topwords
train_features = tf_vec.transform(train_data)
# Array[n_train_data * n_features]
train_features.shape
# Extract features from test set
test_features = tf_vec.transform(test_data)
test_features.shape
Next we demonstrate the effectiveness of negation handling and n-gram modeling techniques, and compare three machine learning algorithms, namely, the multinomial Naive Bayes classification model (MultinomialNB), the Logistic regression model (LogisticRegression), and the linear support vector classification model (LinearSVC).
As a basic feature selection procedure, we remove features/tokens which occur only once to avoid over-fitting. We also use the default penalty parameter in each machine learning algorithm.
The following table illustrates the model accuracy on the testing dataset by using different preprocessing procedures and different machine learning algorithms:
Preprocessing procedure Added | Number of features/tokens | MultinomialNB | LogisticRegression | LinearSVC |
---|---|---|---|---|
Basic preprocessing^ | 56,558 | 0.8329 | 0.8453 | 0.8485 |
Adding negation handling | 71,853 | 0.8262 | 0.8519 | 0.8562 |
Adding bigrams and trigrams | 2,027,753 | 0.8584 | 0.8675 | 0.8731 |
^ Basic preprocessing procedures include procedures with uni-gram modeling but without negation handling.
The above table clearly shows that adding negation handling and n-grams modeling techniques can significantly increase the model accuracy. The table also indicates that SVC model provides the best prediction accuracy.
We omit the codes for model accuracy evaluation with basic preprocssing. In the rest of the code in this chapter, we perform the modeling processes based on the following data preprocessing and feature selection:
mnb_model = MultinomialNB()
mnb_model
# Train Model
mnb_model.fit(train_features, train_labels)
# Predict
pred = mnb_model.predict(test_features)
print(pred)
# Metrics
accuracy = metrics.accuracy_score(test_labels, pred)
print(accuracy)
print(metrics.classification_report(y_true=test_labels, y_pred=pred, digits=4))
lgr_model = LogisticRegression()
print(lgr_model, end='\n'*2)
lgr_model.fit(train_features, train_labels)
lgr_pred = lgr_model.predict(test_features)
print('Accuracy = %.5f' % metrics.accuracy_score(test_labels, lgr_pred))
print(metrics.classification_report(y_pred=lgr_pred, y_true=test_labels, digits=4))
svc_model = LinearSVC()
print(svc_model, end='\n'*2)
svc_model.fit(train_features, train_labels)
svc_pred = svc_model.predict(test_features)
print('Accuracy = %.5f' % metrics.accuracy_score(test_labels, svc_pred))
print(metrics.classification_report(y_pred=svc_pred, y_true=test_labels, digits=4))
The models trained in the above table come with a rather coarse feature selection procedure, that is, we simply remove features/tokens which occur only once.
To reach a better prediction power, we can fine-tune the number of features needed for each algorithm by using the validation dataset. Here we perform all of the preprocessing procedures, including negation handling and bigrams/trigrams modeling.
def train_with_n_topwords(n, model_name='MultinomialNB', tfidf=True, valid=True, alpha=1.0):
"""
Training the dataset using selected model and settings
@params:
n - Required: the number of features used to train the model (Int)
model_name - Optional: the model name('MultinomialNB' | 'LogisticRegression' | 'LinearSVC')
tfidf - Optional: whether to perform the tfidf transformation (Boolean)
valid - Optional: whether to use the validation set or the test set (Boolean)
alpha - Optional: the penalty parameter in the training model (Float)
"""
if model_name not in ['MultinomialNB', 'LinearSVC', 'LogisticRegression']:
print("Wrong model name.")
return
topwords = [fpair[0] for fpair in list(word_features.most_common(n))]
if tfidf:
vec = TfidfVectorizer()
else:
vec = CountVectorizer()
vec.fit_transform([' '.join(topwords)])
# Model
if model_name == 'MultinomialNB':
model = MultinomialNB(alpha=alpha)
elif model_name == 'LinearSVC':
model = LinearSVC(C=alpha)
elif model_name == 'LogisticRegression':
model = LogisticRegression(C=alpha)
train_X = vec.transform(train_data)
model.fit(train_X, train_labels)
if valid:
valid_X = vec.transform(valid_data)
pred = model.predict(valid_X)
metr = metrics.accuracy_score(valid_labels, pred)
else:
test_X = vec.transform(test_data)
pred = model.predict(test_X)
metr = metrics.accuracy_score(test_labels, pred)
print("N of topwords:", n, "alpha:", alpha, "accuracy:", metr)
return metr, vec, model
#MultinomialNB
possible_n = [100000 * i for i in range(1, 21)]
mnb_tfidf_accuracies = []
for i, n in enumerate(possible_n):
metr = train_with_n_topwords(n, model_name='MultinomialNB')[0]
mnb_tfidf_accuracies.append([n, metr])
mnb_accu = pd.DataFrame(mnb_tfidf_accuracies, columns=['topwords', 'accuracy'])
fig = plt.figure(figsize=(10,6))
plt.plot(mnb_accu.topwords, mnb_accu.accuracy, label='MultinomialNB')
#plt.legend()
plt.xlabel("Number of features")
plt.ylabel("Accuracy")
plt.title("Multinomial Naive Bayes Model")
fig.savefig("model_mnb_large_100k.png")
mnb_accu[mnb_accu.accuracy==max(mnb_accu.accuracy)]
best_n_topwords_mnb = mnb_accu[mnb_accu.accuracy==max(mnb_accu.accuracy)].iloc[0,0]
best_n_topwords_mnb
# Logistic
possible_n = [100000 * i for i in range(1, 21)]
lgr_tfidf_accuracies = []
for i, n in enumerate(possible_n):
metr = train_with_n_topwords(n, model_name="LogisticRegression")[0]
lgr_tfidf_accuracies.append([n, metr])
lgr_accu = pd.DataFrame(lgr_tfidf_accuracies, columns=['topwords', 'accuracy'])
fig = plt.figure(figsize=(10,6))
plt.plot(lgr_accu.topwords, lgr_accu.accuracy, label='LogisticRegression')
#plt.legend()
plt.xlabel("Number of features")
plt.ylabel("Accuracy")
plt.title("Logistic Regression Model")
fig.savefig("model_lgr_large_100k.png")
lgr_accu[lgr_accu.accuracy==max(lgr_accu.accuracy)]
best_n_topwords_lgr = lgr_accu[lgr_accu.accuracy==max(lgr_accu.accuracy)].iloc[0,0]
best_n_topwords_lgr
# Linear SVC
possible_n = [100000 * i for i in range(1, 21)]
svc_tfidf_accuracies = []
for i, n in enumerate(possible_n):
metr = train_with_n_topwords(n, model_name="LinearSVC")[0]
svc_tfidf_accuracies.append([n, metr])
svc_accu = pd.DataFrame(svc_tfidf_accuracies, columns=['topwords', 'accuracy'])
fig = plt.figure(figsize=(10,6))
plt.plot(svc_accu.topwords, svc_accu.accuracy, label='LinearSCV')
#plt.legend()
plt.xlabel("Number of features")
plt.ylabel("Accuracy")
plt.title("Linear SVC Model")
fig.savefig("model_svc_large_100k.png")
svc_accu[svc_accu.accuracy==max(svc_accu.accuracy)]
best_n_topwords_svc = svc_accu[svc_accu.accuracy==max(svc_accu.accuracy)].iloc[0,0]
best_n_topwords_svc
fig = plt.figure(figsize=(10,6))
plt.plot(mnb_accu.topwords, mnb_accu.accuracy, label='MultinomialNB')
plt.plot(lgr_accu.topwords, lgr_accu.accuracy, label='LogisticRegression')
plt.plot(svc_accu.topwords, svc_accu.accuracy, label='LinearSVC')
plt.legend()
plt.xlabel("Number of features")
plt.ylabel("Accuracy")
plt.title("MultinomialNB vs. LinearSVC vs. LogisticRegression models")
fig.savefig("model_all3_large_100k.png")
A plot of Model Accuracy vs Number of Features is shown above. It clearly shows that LinearSVC has the highest accuracy consistently, with LogisticRegression the less, and MultinomialNB the least. The following table summarizes the best number of features and the model accuracy on validation set.
MultinomialNB | LogisticRegression | LinearSVC | |
---|---|---|---|
Best number of features | 1,000,000 | 500,000 | 1,700,000 |
Accuracy on validation set | 0.8580 | 0.8697 | 0.8746 |
To get a better accuaracy, we can perform grid search on the number of features and the penalty parameter at the same time.
As a demonstration, we only perform the search on LinearSVC model.
# MultinomialNB
mnb_grid_tfidf_accuracies = []
for n in [100000 * i for i in range(1, 21)]:
for alpha in [0.1 * j for j in range(1, 11)]:
metr = train_with_n_topwords(n, model_name="MultinomialNB", alpha=alpha)[0]
mnb_grid_tfidf_accuracies.append([n, alpha, metr])
mnb_grid_accu = pd.DataFrame(mnb_grid_tfidf_accuracies, columns=['topwords', 'alpha', 'accuracy'])
mnb_grid_accu[mnb_grid_accu.accuracy==max(mnb_grid_accu.accuracy)]
# Logistic
lgr_grid_tfidf_accuracies = []
for n in [100000 * i for i in range(1, 21)]:
for alpha in [0.1 * j for j in range(1, 11)]:
metr = train_with_n_topwords(n, model_name="LogisticRegression", alpha=alpha)[0]
lgr_grid_tfidf_accuracies.append([n, alpha, metr])
lgr_grid_accu = pd.DataFrame(lgr_grid_tfidf_accuracies, columns=['topwords', 'C', 'accuracy'])
lgr_grid_accu[lgr_grid_accu.accuracy==max(lgr_grid_accu.accuracy)]
As a simply demonstration and in order to save the training time, we select only the best number of features (based on penalty parameter C=1.0), and tune the penalty parameter C.
# Linear SVC
svc_grid_tfidf_accuracies = []
for n in [best_n_topwords_svc]:
for alpha in [0.1 * j for j in range(1, 11)]:
metr = train_with_n_topwords(n, model_name="LinearSVC", alpha=alpha)[0]
svc_grid_tfidf_accuracies.append([n, alpha, metr])
svc_grid_accu = pd.DataFrame(svc_grid_tfidf_accuracies, columns=['topwords', 'C', 'accuracy'])
svc_grid_accu[svc_grid_accu.accuracy==max(svc_grid_accu.accuracy)]
best_alpha_svc = svc_grid_accu[svc_grid_accu.accuracy==max(svc_grid_accu.accuracy)].iloc[0,1]
best_alpha_svc
MultinomialNB | LogisticRegression | LinearSVC | LinearSVC(with penalty parameter 0.4) | |
---|---|---|---|---|
Best number of features | 1,000,000 | 500,000 | 1,700,000 | 1,700,000 |
Accuracy on testing set | 0.8585 | 0.8682 | 0.8730 | 0.8742 |
# MultinomialNB model accuracy based on the default model penalty parameter alpha=1.0
mnb_accuarcy, mnb_vec, mnb_model = train_with_n_topwords(n=best_n_topwords_mnb, model_name="MultinomialNB", valid=False)
# LogisticRegression model accuracy based on the default model penalty parameter C=1.0
lgr_accuarcy, lgr_vec, lgr_model = train_with_n_topwords(n=best_n_topwords_lgr, model_name="LogisticRegression", valid=False)
# LinearSVC model accuracy based on the default model penalty parameter C=1.0
svc_accuarcy, svc_vec, svc_model = train_with_n_topwords(n=best_n_topwords_svc, model_name="LinearSVC", valid=False)
# LinearSVC model accuracy based on the tuned model penalty parameter C=0.4
svc_accuarcy, svc_vec, svc_model = train_with_n_topwords(n=best_n_topwords_svc, model_name="LinearSVC", valid=False,
alpha=best_alpha_svc)
def predict_new(text, model_name='LinearSVC'):
"""
Predict the sentiment of the 'text'.
@params:
text - Required: the text to predict (Str)
model_name - Optional: the model name('LinearSVC' | 'MultinomialNB' | 'LogisticRegression')
"""
sentence = preprocessing(text, ngram=3)
if model_name == 'LinearSVC':
features = svc_vec.transform([text])
pred = svc_model.predict(features)
elif model_name == 'MultinomialNB':
features = mnb_vec.transform([text])
pred = mnb_model.predict(features)
elif model_name == 'LogisticRegression':
features = lgr_vec.transform([text])
pred = lgr_model.predict(features)
return pred[0]
# An example of new review text
review_text = 'This book seems to have some decent information, but it is full of typos and factual errors.'
predict_new(review_text, model_name='MultinomialNB')
predict_new(review_text, model_name='LogisticRegression')
predict_new(review_text, model_name='LinearSVC')
Use module "pickle" to save the model for furture use.
# Save the vectorizer
with open('tf_vec.pkl', 'wb') as pkl_file:
pickle.dump(svc_vec, pkl_file)
# Save the model
with open('svc_model.pkl', 'wb') as pkl_file:
pickle.dump(svc_model, pkl_file)
To demonstrate this project, we wrote a flask (a lightweight WSGI web application framework [link]) web application. You can enter some review text and let the app analyze the sentiment of your entry. Have fun!
Here are some ending throughs through this project.
We tried to remove stop words during the preprocessing stage but it decreased the model accuracy. We only used the stop words from nltk package, which may be the reason. A better way worth a try is to manually select stop words which are related to the current topic of the dataset. But since we have a very large dataset and we also perform tf-idf reweighting process, removing stop words may not be necessary.
To save tuning time, we did a pretty rough tuning on number of features with 100,000 number of features increment in each tuning. A finer tuning may get a better prediction power.
We can also tune the penalty parameter in each machine learning algorithm. We currently only use the default penalty parameter. By tuning this parameter, we may get a better prediction power.
We notice that under the current three machine learning models, parameter tuning may not provide significant accuracy boost. More advanced models such as Long short-term memory (LSTM) may be adopted.