Sentiment Analysis of Amazon Customer Reviews

by Qiang (Jesse) Zhen

One of the hot-trend topics in Natural Language Processing (NLP) is sentiment analysis. Sentiment analysis involves extraction of subjective information from documents like posts and reviews to determine the opinion with respect to products, service, events, or ideas.

This project uses the customer review data from Amazon.com to perform a supervised binary (positive or negative) sentiment classification analysis. We use various data pre-processing techniques and demonstrate their effectiveness in improving the classification. We also compare three machine learning models, namely, the multinomial Naive Bayes classification model (MultinomialNB), the Logistic regression model (LogisticRegression), and the linear support vector classification model (LinearSVC).

The result of the analysis shows that adding negation handling and n-grams modeling techniques into data preprocessing can significantly increase the model accuracy. The result also indicates that SVC model provides the best prediction accuracy.

1. Data Understanding

1.1 Import modulus

In [1]:
# user definded function to indicate processing progress
import my_utils

import html
import time
import functools
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords as sw
import string
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression

import pickle

1.2 Load data

The data comes from the website "Amazon product data" managed by Dr. Julian McAuley from UCSD. We choose the smaller subset of the customer review data from the Kindle store of Amazon.com. The data is in the JSON format, which contains 982,619 reviews and metadata spanning May 1996 - July 2014.

In [2]:
import json
data = pd.read_json('Kindle_Store_5.json', lines=True)
In [3]:
data.head()
Out[3]:
asin helpful overall reviewText reviewTime reviewerID reviewerName summary unixReviewTime
0 B000F83SZQ [0, 0] 5 I enjoy vintage books and movies so I enjoyed ... 05 5, 2014 A1F6404F1VG29J Avidreader Nice vintage story 1399248000
1 B000F83SZQ [2, 2] 4 This book is a reissue of an old one; the auth... 01 6, 2014 AN0N05A9LIJEQ critters Different... 1388966400
2 B000F83SZQ [2, 2] 4 This was a fairly interesting read. It had ol... 04 4, 2014 A795DMNCJILA6 dot Oldie 1396569600
3 B000F83SZQ [1, 1] 5 I'd never read any of the Amy Brewster mysteri... 02 19, 2014 A1FV0SX13TWVXQ Elaine H. Turley "Montana Songbird" I really liked it. 1392768000
4 B000F83SZQ [0, 1] 4 If you like period pieces - clothing, lingo, y... 03 19, 2014 A3SPTOKDG7WBLN Father Dowling Fan Period Mystery 1395187200

1.3 Generate the sentiment label

Reviews with overall rating of 1, 2, or 3 are labeled as negative ("neg"), and reviews with overall rating of 4 or 5 are labeled as positive ("pos").

  • Make a new column named 'pos_neg':
    • pos_neg = 'neg' when overall=1, 2 ,or 3,
    • pos_neg = 'pos' when overall=4 or 5
In [4]:
data.columns.values
Out[4]:
array(['asin', 'helpful', 'overall', 'reviewText', 'reviewTime',
       'reviewerID', 'reviewerName', 'summary', 'unixReviewTime'], dtype=object)
In [5]:
# Make a new column named 'pos_neg', which has value 'neg' if the overall rating is 1, 2 ,or 3, 
# and has value 'pos' of the overall rating is 4 or 5. 
data.loc[data.overall.isin([1,2,3]), 'pos_neg'] = 'neg'
data.loc[data.overall.isin([4,5]), 'pos_neg'] = 'pos'

1.4 Select the required columns

  • Choose only the columns 'pop_neg' and 'reviewText'
In [6]:
# Choose only the columns 'pop_neg' and 'reviewText'
df = data[['pos_neg', 'reviewText']]
In [7]:
df.head()
Out[7]:
pos_neg reviewText
0 pos I enjoy vintage books and movies so I enjoyed ...
1 pos This book is a reissue of an old one; the auth...
2 pos This was a fairly interesting read. It had ol...
3 pos I'd never read any of the Amy Brewster mysteri...
4 pos If you like period pieces - clothing, lingo, y...

2. Data Preparation

2.1 Under-sampling

Since the dataset is imbalanced that more than 84% of the reviews are positive, we undersample the positive reviews (the majority class) to have exactly the same number of reviews as in the negative ones.

In [8]:
# Check whethere there is any missing data
df.isnull().sum()  
Out[8]:
pos_neg       0
reviewText    0
dtype: int64
In [9]:
data.shape
Out[9]:
(982619, 10)
In [14]:
# Check the distribution of the positive and negative reviews
df.pos_neg.value_counts()
Out[14]:
pos    829277
neg    153342
Name: pos_neg, dtype: int64
In [11]:
# Sample positive reveiws to get a balanced dataset
neg = df.loc[df.pos_neg=='neg']
pos = df.loc[df.pos_neg=='pos'].sample(n=df.pos_neg.value_counts()['neg'], random_state=42)
In [13]:
print(type(pos))
print("pos:", len(pos), ", neg:", len(neg))
<class 'pandas.core.frame.DataFrame'>
pos: 153342 , neg: 153342

2.2 Data preprocessing

Data preprocessing process uses the following steps:

  • Use HTMLParser to un-escape the text
  • Change "can't" to "can not", and change "n't" to "not" (This is useful for the negation handling process)
  • Pad punctuations with blanks
    • Note: if choosing not to perform negation handling, then directly remove punctuations
  • Word normalization: lowercase every word
  • Word tokenization
  • Perform negation handling
    • A major problem faced during sentiment analysis is to handle negations.
    • The algorithm used here comes from Narayanan, Arora, and Bhatia's paper "Fast and accurate sentiment classification using an enhanced Naive Bayes model" (link)
    • The algorithm:
      • Use a state variable to store the negation state
      • Transform a word followed by a "not" or "no" into “not_” + word
      • Whenever the negation state variable is set, the words read are treated as “not_” + word
      • The state variable is reset when a punctuation mark is encountered or when there is double negation
  • Use bigram and/or trigram models
    • Information about sentiment is often conveyed by adjectives ore more specifically by certain combinations of adjectives.
    • This information can be captured by adding features like consecutive pairs of words (bigrams) or even triplets of words (trigrams).
  • Remove stopwords (optional)
  • Word lemmatization

2.2.1 Define the preprocessing function

In [15]:
lemmatizer = nltk.WordNetLemmatizer()
stopwords = sw.words('english')
stopwords = stopwords + ['not_' + w for w in stopwords]

# transform punctuation to blanks
trans_punct = str.maketrans(string.punctuation,' '*len(string.punctuation)) 

# pad punctuation with blanks
pad_punct = str.maketrans({key: " {0} ".format(key) for key in string.punctuation}) 
# remove "_" from string.punctuation
invalidChars = str(string.punctuation.replace("_", ""))  
In [18]:
def preprocessing(line, ngram=1, neg_handling=True, remove_stop=False):
    """
    Preprocessing the review texts
    @params:
        line                       - Required: the input text (Str)
        ngram                  - Optional: number n in the n-gram model(Int, 1, 2, or 3)
        neg_handling       - Optional: whether to perform negation handling (Boolean)
        remove_stop        -Optional: whether to remove the stop words (Boolean)
    """
        
    line = html.unescape(str(line))
    line = str(line).replace("can't", "can not")
    line = str(line).replace("n't", " not")
    
    if neg_handling:
        line = str(line).translate(pad_punct)  # If performing negation handling, pad punctuations with blanks
        line = nltk.word_tokenize(line.lower()) # Word normalization and tokenization
        tokens = []
        negated = False
        for t in line:
            if t in ['not', 'no']:
                negated = not negated
            elif t in string.punctuation or not t.isalpha():
                negated = False
            else:
                tokens.append('not_' + t if negated else t)  # add "not_" prefix to words behind "not", or "no"     
    else:
        line = str(line).translate(trans_punct)  # If not performing negation handling, remove punctuations
        line = nltk.word_tokenize(line.lower()) # Word normalization and tokenization
        tokens = line
    
    if ngram==2:
        bi_tokens = list(nltk.bigrams(line))
        bi_tokens = list(map('_'.join, bi_tokens))
        bi_tokens = [i for i in bi_tokens if all(j not in invalidChars for j in i)]
        tokens = tokens + bi_tokens

    if ngram==3:
        bi_tokens = list(nltk.bigrams(line))
        bi_tokens = list(map('_'.join, bi_tokens))
        bi_tokens = [i for i in bi_tokens if all(j not in invalidChars for j in i)]
        tri_tokens = list(nltk.trigrams(line))
        tri_tokens = list(map('_'.join, tri_tokens))
        tri_tokens = [i for i in tri_tokens if all(j not in invalidChars for j in i)]
        tokens = tokens + bi_tokens + tri_tokens    
     
    if remove_stop:
        line = [lemmatizer.lemmatize(t) for t in tokens if t not in stopwords]
    else:
        line = [lemmatizer.lemmatize(t) for t in tokens] 
    
    return ' '.join(line)

2.2.2 An example

In [30]:
line = "I don't think this book has any decent information!!! It is full of typos and factual errors that I can't ignore."
In [32]:
preprocessing(line, ngram=1, neg_handling=False, remove_stop=False)
Out[32]:
'i do not think this book ha any decent information it is full of typo and factual error that i can not ignore'
In [33]:
preprocessing(line, ngram=1, neg_handling=False, remove_stop=True)
Out[33]:
'think book decent information full typo factual error ignore'
In [34]:
preprocessing(line, ngram=1, neg_handling=True, remove_stop=False)
Out[34]:
'i do not_think not_this not_book not_has not_any not_decent not_information it is full of typo and factual error that i can not_ignore'
In [35]:
preprocessing(line, ngram=3, neg_handling=True, remove_stop=False)
Out[35]:
'i do not_think not_this not_book not_has not_any not_decent not_information it is full of typo and factual error that i can not_ignore i_do do_not not_think think_this this_book book_has has_any any_decent decent_information it_is is_full full_of of_typos typos_and and_factual factual_errors errors_that that_i i_can can_not not_ignore i_do_not do_not_think not_think_this think_this_book this_book_has book_has_any has_any_decent any_decent_information it_is_full is_full_of full_of_typos of_typos_and typos_and_factual and_factual_errors factual_errors_that errors_that_i that_i_can i_can_not can_not_ignore'

2.2.3 Preproessing the data

  • Perform data preprocessing with negation handling, tri-gram modeling, without removing the stop words
In [17]:
# Preprocessing the positive reveiws
pos_data = []
n_pos = len(pos)
for i, p in enumerate(pos['reviewText']):
    pos_data.append(preprocessing(p, ngram=3))
    my_utils.print_progress(bar_length=50, decimals=0, iteration=i+1, total=n_pos, prefix='Preprocessing pos data: ')
Preprocessing pos data:  |==================================================| 100% 
In [18]:
# Preprocessing the negative reveiws
neg_data = []
n_neg = len(neg)
for i, n in enumerate(neg['reviewText']):
    neg_data.append(preprocessing(n, ngram=3))
    my_utils.print_progress(bar_length=50, decimals=0, iteration=i+1, total=n_neg, prefix='Preprocessing neg data: ')
Preprocessing neg data:  |==================================================| 100% 
In [21]:
# Combine the preprocessed data
data = pos_data + neg_data
labels = np.concatenate((pos['pos_neg'].values, neg['pos_neg'].values))

2.3 Split dataset to training, validation, and test sets

We split the whole dataset randomly to the training set, validation set, and testing set by the proportion of 60%, 20%, and 20% respectively.

In [25]:
# split the dataset to training, validation, test sets by 60-20-20
train_data, rest_data, train_labels, rest_labels = train_test_split(data, labels, test_size=0.4, 
                                                                    stratify=labels, random_state=1234)
valid_data, test_data, valid_labels, test_labels = train_test_split(rest_data, rest_labels, test_size=0.5, 
                                                                    stratify=rest_labels, random_state=1234)
print("training size = ", len(train_data), "validation size = ", len(valid_data), "testing size = ", len(test_data))
training size =  184010 validation size =  61337 testing size =  61337

3. Feature Extraction

We use vectorization process to turn the collection of text documents into numerical feature vectors. To extract numerical features from text content, we use the Bag of Words strategy:

  • tokenizing strings and giving an integer id for each possible token, by using white-spaces as token separators
  • counting the occurrences of tokens in each document
  • normalizing and tf-idf weighting with diminishing importance tokens that occur in the majority of documents

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus, while completely ignoring the relative position information of the words in the document.

3.1 Compute the frequency of words

In [26]:
# Push all tokens and compute the frequency of words
tokens = [word for line in train_data for word in nltk.word_tokenize(line)]
word_features = nltk.FreqDist(tokens)
In [27]:
print(word_features)
<FreqDist with 8128698 samples and 56232053 outcomes>
In [28]:
# Print the 10 most common words
word_features.most_common(10)
Out[28]:
[('the', 904966),
 ('a', 629222),
 ('and', 605310),
 ('i', 583304),
 ('to', 503199),
 ('is', 455444),
 ('of', 369937),
 ('it', 332547),
 ('this', 290662),
 ('wa', 271098)]
In [134]:
# Remove features (words) which occur only once (This is to be used in the basic modeling process)
topwords = [fpair[0] for fpair in list(word_features.most_common(len(word_features))) if fpair[1]>=2] 
len(topwords) 
Out[134]:
2029001

3.2 Vectorizer and Tf–idf term weighting

In [135]:
# Convert a collection of raw documents to a matrix of TF-IDF features.
# Equivalent to CountVectorizer followed by TfidfTransformer.
tf_vec = TfidfVectorizer()

tf_vec.fit_transform([' '.join(topwords)])
Out[135]:
<1x2027753 sparse matrix of type '<class 'numpy.float64'>'
	with 2027753 stored elements in Compressed Sparse Row format>

3.3 Feature Extraction

In [136]:
# Extract features from training set
# Vocabulary is from topwords
train_features = tf_vec.transform(train_data)
In [137]:
# Array[n_train_data * n_features]
train_features.shape
Out[137]:
(184010, 2027753)
In [138]:
# Extract features from test set
test_features = tf_vec.transform(test_data)
In [139]:
test_features.shape
Out[139]:
(61337, 2027753)

4. Basic Modeling

Next we demonstrate the effectiveness of negation handling and n-gram modeling techniques, and compare three machine learning algorithms, namely, the multinomial Naive Bayes classification model (MultinomialNB), the Logistic regression model (LogisticRegression), and the linear support vector classification model (LinearSVC).

As a basic feature selection procedure, we remove features/tokens which occur only once to avoid over-fitting. We also use the default penalty parameter in each machine learning algorithm.

The following table illustrates the model accuracy on the testing dataset by using different preprocessing procedures and different machine learning algorithms:

Preprocessing procedure Added Number of features/tokens MultinomialNB LogisticRegression LinearSVC
Basic preprocessing^ 56,558 0.8329 0.8453 0.8485
Adding negation handling 71,853 0.8262 0.8519 0.8562
Adding bigrams and trigrams 2,027,753 0.8584 0.8675 0.8731

^ Basic preprocessing procedures include procedures with uni-gram modeling but without negation handling.

The above table clearly shows that adding negation handling and n-grams modeling techniques can significantly increase the model accuracy. The table also indicates that SVC model provides the best prediction accuracy.

We omit the codes for model accuracy evaluation with basic preprocssing. In the rest of the code in this chapter, we perform the modeling processes based on the following data preprocessing and feature selection:

  • Data preprocessing:
    • Negation handeling
    • Tri-gram modeling
    • Without removing stop words
  • Feature selection:
    • Removing the features which occur only once
    • The remaining number of features is 2,029,001 (out of the total 8,128,698 features)

4.1 Multinomial Naive Bayes classification model (MultinomialNB)

In [140]:
mnb_model = MultinomialNB()
mnb_model
Out[140]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [141]:
# Train Model
mnb_model.fit(train_features, train_labels)
Out[141]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [142]:
# Predict
pred = mnb_model.predict(test_features)
print(pred)
['neg' 'pos' 'pos' ... 'neg' 'pos' 'neg']
In [143]:
# Metrics
accuracy = metrics.accuracy_score(test_labels, pred)
print(accuracy)
0.8583889006635473
In [144]:
print(metrics.classification_report(y_true=test_labels, y_pred=pred, digits=4))
             precision    recall  f1-score   support

        neg     0.8560    0.8618    0.8589     30668
        pos     0.8608    0.8550    0.8579     30669

avg / total     0.8584    0.8584    0.8584     61337

4.2 Logistic regression model (LogisticRegression)

In [145]:
lgr_model = LogisticRegression()
print(lgr_model, end='\n'*2)


lgr_model.fit(train_features, train_labels)
lgr_pred = lgr_model.predict(test_features)

print('Accuracy = %.5f' % metrics.accuracy_score(test_labels, lgr_pred))
print(metrics.classification_report(y_pred=lgr_pred, y_true=test_labels, digits=4))
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Accuracy = 0.86749
             precision    recall  f1-score   support

        neg     0.8711    0.8626    0.8668     30668
        pos     0.8639    0.8724    0.8681     30669

avg / total     0.8675    0.8675    0.8675     61337

4.3 Linear support vector classification model (LinearSVC)

In [146]:
svc_model = LinearSVC()
print(svc_model, end='\n'*2)

svc_model.fit(train_features, train_labels)
svc_pred = svc_model.predict(test_features)

print('Accuracy = %.5f' % metrics.accuracy_score(test_labels, svc_pred))
print(metrics.classification_report(y_pred=svc_pred, y_true=test_labels, digits=4))
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

Accuracy = 0.87314
             precision    recall  f1-score   support

        neg     0.8798    0.8644    0.8720     30668
        pos     0.8667    0.8819    0.8742     30669

avg / total     0.8733    0.8731    0.8731     61337

5. Fine-tuning the Number of Features

The models trained in the above table come with a rather coarse feature selection procedure, that is, we simply remove features/tokens which occur only once.

To reach a better prediction power, we can fine-tune the number of features needed for each algorithm by using the validation dataset. Here we perform all of the preprocessing procedures, including negation handling and bigrams/trigrams modeling.

5.1 Define the training function

In [105]:
def train_with_n_topwords(n, model_name='MultinomialNB', tfidf=True, valid=True, alpha=1.0):
    """
    Training the dataset using selected model and settings
    @params:
        n                       - Required: the number of features used to train the model (Int)
        model_name    - Optional: the model name('MultinomialNB' | 'LogisticRegression' | 'LinearSVC')
        tfidf                  - Optional: whether to perform the tfidf transformation (Boolean)
        valid                 - Optional: whether to use the validation set or the test set (Boolean)
        alpha                - Optional: the penalty parameter in the training model (Float)
    """
            
    if model_name not in ['MultinomialNB', 'LinearSVC', 'LogisticRegression']:
        print("Wrong model name.")
        return
    
    topwords = [fpair[0] for fpair in list(word_features.most_common(n))]
    
    if tfidf:
        vec = TfidfVectorizer()
    else:
        vec = CountVectorizer()
        
    vec.fit_transform([' '.join(topwords)])
    
    # Model
    if model_name == 'MultinomialNB':
        model = MultinomialNB(alpha=alpha)
    elif model_name == 'LinearSVC':
        model = LinearSVC(C=alpha)
    elif model_name == 'LogisticRegression':
        model = LogisticRegression(C=alpha)

    train_X = vec.transform(train_data)
    model.fit(train_X, train_labels)   
    
    if valid: 
        valid_X = vec.transform(valid_data)
        pred = model.predict(valid_X)
        metr = metrics.accuracy_score(valid_labels, pred)
    else:
        test_X = vec.transform(test_data)
        pred = model.predict(test_X)
        metr = metrics.accuracy_score(test_labels, pred)
    
    print("N of topwords:", n, "alpha:", alpha, "accuracy:", metr)
    return metr, vec, model

5.2 Fine-tuning MultinomialNB

In [83]:
#MultinomialNB
possible_n = [100000 * i for i in range(1, 21)]

mnb_tfidf_accuracies = []

for i, n in enumerate(possible_n):
    metr = train_with_n_topwords(n, model_name='MultinomialNB')[0]
    mnb_tfidf_accuracies.append([n, metr])
N of topwords: 100000 alpha: 1.0 accuracy: 0.8542315405057307
N of topwords: 200000 alpha: 1.0 accuracy: 0.8564487992565661
N of topwords: 300000 alpha: 1.0 accuracy: 0.8567585633467565
N of topwords: 400000 alpha: 1.0 accuracy: 0.8571335409296184
N of topwords: 500000 alpha: 1.0 accuracy: 0.8572639679149616
N of topwords: 600000 alpha: 1.0 accuracy: 0.8576552488709914
N of topwords: 700000 alpha: 1.0 accuracy: 0.8577856758563347
N of topwords: 800000 alpha: 1.0 accuracy: 0.85789979946851
N of topwords: 900000 alpha: 1.0 accuracy: 0.857916102841678
N of topwords: 1000000 alpha: 1.0 accuracy: 0.8579650129611817
N of topwords: 1100000 alpha: 1.0 accuracy: 0.8579650129611817
N of topwords: 1200000 alpha: 1.0 accuracy: 0.8578182826026705
N of topwords: 1300000 alpha: 1.0 accuracy: 0.8578834960953421
N of topwords: 1400000 alpha: 1.0 accuracy: 0.8578182826026705
N of topwords: 1500000 alpha: 1.0 accuracy: 0.85789979946851
N of topwords: 1600000 alpha: 1.0 accuracy: 0.857916102841678
N of topwords: 1700000 alpha: 1.0 accuracy: 0.8577856758563347
N of topwords: 1800000 alpha: 1.0 accuracy: 0.8577041589904951
N of topwords: 1900000 alpha: 1.0 accuracy: 0.8578019792295026
N of topwords: 2000000 alpha: 1.0 accuracy: 0.8577693724831668
In [84]:
mnb_accu = pd.DataFrame(mnb_tfidf_accuracies, columns=['topwords', 'accuracy'])
fig = plt.figure(figsize=(10,6))
plt.plot(mnb_accu.topwords, mnb_accu.accuracy, label='MultinomialNB')
#plt.legend()
plt.xlabel("Number of features")
plt.ylabel("Accuracy")
plt.title("Multinomial Naive Bayes Model")
fig.savefig("model_mnb_large_100k.png")
In [85]:
mnb_accu[mnb_accu.accuracy==max(mnb_accu.accuracy)]
Out[85]:
topwords accuracy
9 1000000 0.857965
10 1100000 0.857965
In [87]:
best_n_topwords_mnb = mnb_accu[mnb_accu.accuracy==max(mnb_accu.accuracy)].iloc[0,0]
best_n_topwords_mnb
Out[87]:
1000000

5.3 Fine-tuning LogisticRegression

In [88]:
# Logistic
possible_n = [100000 * i for i in range(1, 21)]

lgr_tfidf_accuracies = []

for i, n in enumerate(possible_n):
    metr = train_with_n_topwords(n, model_name="LogisticRegression")[0]
    lgr_tfidf_accuracies.append([n, metr])
N of topwords: 100000 alpha: 1.0 accuracy: 0.8695730146567324
N of topwords: 200000 alpha: 1.0 accuracy: 0.8695567112835646
N of topwords: 300000 alpha: 1.0 accuracy: 0.8695567112835646
N of topwords: 400000 alpha: 1.0 accuracy: 0.8696219247762362
N of topwords: 500000 alpha: 1.0 accuracy: 0.8697034416420757
N of topwords: 600000 alpha: 1.0 accuracy: 0.8695404079103967
N of topwords: 700000 alpha: 1.0 accuracy: 0.869475194417725
N of topwords: 800000 alpha: 1.0 accuracy: 0.869475194417725
N of topwords: 900000 alpha: 1.0 accuracy: 0.8694262842982213
N of topwords: 1000000 alpha: 1.0 accuracy: 0.8694262842982213
N of topwords: 1100000 alpha: 1.0 accuracy: 0.8692958573128781
N of topwords: 1200000 alpha: 1.0 accuracy: 0.8692795539397101
N of topwords: 1300000 alpha: 1.0 accuracy: 0.8692795539397101
N of topwords: 1400000 alpha: 1.0 accuracy: 0.8691817337007027
N of topwords: 1500000 alpha: 1.0 accuracy: 0.8691654303275348
N of topwords: 1600000 alpha: 1.0 accuracy: 0.8691002168348632
N of topwords: 1700000 alpha: 1.0 accuracy: 0.8691165202080311
N of topwords: 1800000 alpha: 1.0 accuracy: 0.8690676100885273
N of topwords: 1900000 alpha: 1.0 accuracy: 0.8689697898495199
N of topwords: 2000000 alpha: 1.0 accuracy: 0.8688719696105124
In [89]:
lgr_accu = pd.DataFrame(lgr_tfidf_accuracies, columns=['topwords', 'accuracy'])
fig = plt.figure(figsize=(10,6))
plt.plot(lgr_accu.topwords, lgr_accu.accuracy, label='LogisticRegression')
#plt.legend()
plt.xlabel("Number of features")
plt.ylabel("Accuracy")
plt.title("Logistic Regression Model")
fig.savefig("model_lgr_large_100k.png")
In [90]:
lgr_accu[lgr_accu.accuracy==max(lgr_accu.accuracy)]
Out[90]:
topwords accuracy
4 500000 0.869703
In [91]:
best_n_topwords_lgr = lgr_accu[lgr_accu.accuracy==max(lgr_accu.accuracy)].iloc[0,0]
best_n_topwords_lgr
Out[91]:
500000

5.4 Fine-tuning LinearSVC

In [92]:
# Linear SVC
possible_n = [100000 * i for i in range(1, 21)]

svc_tfidf_accuracies = []

for i, n in enumerate(possible_n):
    metr = train_with_n_topwords(n, model_name="LinearSVC")[0]
    svc_tfidf_accuracies.append([n, metr])
N of topwords: 100000 alpha: 1.0 accuracy: 0.8726217454391314
N of topwords: 200000 alpha: 1.0 accuracy: 0.8736814646950454
N of topwords: 300000 alpha: 1.0 accuracy: 0.8736325545755417
N of topwords: 400000 alpha: 1.0 accuracy: 0.8733227904853514
N of topwords: 500000 alpha: 1.0 accuracy: 0.8736651613218775
N of topwords: 600000 alpha: 1.0 accuracy: 0.873583644456038
N of topwords: 700000 alpha: 1.0 accuracy: 0.8738771051730603
N of topwords: 800000 alpha: 1.0 accuracy: 0.8740075321584035
N of topwords: 900000 alpha: 1.0 accuracy: 0.874105352397411
N of topwords: 1000000 alpha: 1.0 accuracy: 0.8741542625169147
N of topwords: 1100000 alpha: 1.0 accuracy: 0.8743172962485939
N of topwords: 1200000 alpha: 1.0 accuracy: 0.8743172962485939
N of topwords: 1300000 alpha: 1.0 accuracy: 0.8744803299802729
N of topwords: 1400000 alpha: 1.0 accuracy: 0.8745944535924483
N of topwords: 1500000 alpha: 1.0 accuracy: 0.8744803299802729
N of topwords: 1600000 alpha: 1.0 accuracy: 0.8744151164876013
N of topwords: 1700000 alpha: 1.0 accuracy: 0.8746107569656162
N of topwords: 1800000 alpha: 1.0 accuracy: 0.8744803299802729
N of topwords: 1900000 alpha: 1.0 accuracy: 0.8745129367266087
N of topwords: 2000000 alpha: 1.0 accuracy: 0.8744966333534409
In [93]:
svc_accu = pd.DataFrame(svc_tfidf_accuracies, columns=['topwords', 'accuracy'])
fig = plt.figure(figsize=(10,6))
plt.plot(svc_accu.topwords, svc_accu.accuracy, label='LinearSCV')
#plt.legend()
plt.xlabel("Number of features")
plt.ylabel("Accuracy")
plt.title("Linear SVC Model")
fig.savefig("model_svc_large_100k.png")
In [94]:
svc_accu[svc_accu.accuracy==max(svc_accu.accuracy)]
Out[94]:
topwords accuracy
16 1700000 0.874611
In [95]:
best_n_topwords_svc = svc_accu[svc_accu.accuracy==max(svc_accu.accuracy)].iloc[0,0]
best_n_topwords_svc
Out[95]:
1700000

5.5 Summary of the three models

In [96]:
fig = plt.figure(figsize=(10,6))
plt.plot(mnb_accu.topwords, mnb_accu.accuracy, label='MultinomialNB')
plt.plot(lgr_accu.topwords, lgr_accu.accuracy, label='LogisticRegression')
plt.plot(svc_accu.topwords, svc_accu.accuracy, label='LinearSVC')
plt.legend()
plt.xlabel("Number of features")
plt.ylabel("Accuracy")
plt.title("MultinomialNB vs. LinearSVC vs. LogisticRegression models")
fig.savefig("model_all3_large_100k.png")

A plot of Model Accuracy vs Number of Features is shown above. It clearly shows that LinearSVC has the highest accuracy consistently, with LogisticRegression the less, and MultinomialNB the least. The following table summarizes the best number of features and the model accuracy on validation set.

MultinomialNB LogisticRegression LinearSVC
Best number of features 1,000,000 500,000 1,700,000
Accuracy on validation set 0.8580 0.8697 0.8746

6. Grid Search for Parameter Tuning

To get a better accuaracy, we can perform grid search on the number of features and the penalty parameter at the same time.

As a demonstration, we only perform the search on LinearSVC model.

6.1 Grid Search on MultinomialNB

In [ ]:
# MultinomialNB
mnb_grid_tfidf_accuracies = []

for n in [100000 * i for i in range(1, 21)]:
    for alpha in [0.1 * j for j in range(1, 11)]:
        metr = train_with_n_topwords(n, model_name="MultinomialNB", alpha=alpha)[0]  
        mnb_grid_tfidf_accuracies.append([n, alpha, metr])
In [ ]:
mnb_grid_accu = pd.DataFrame(mnb_grid_tfidf_accuracies, columns=['topwords', 'alpha', 'accuracy'])
mnb_grid_accu[mnb_grid_accu.accuracy==max(mnb_grid_accu.accuracy)]

6.2 Grid Search on LogisticRegression

In [ ]:
# Logistic
lgr_grid_tfidf_accuracies = []

for n in [100000 * i for i in range(1, 21)]:
    for alpha in [0.1 * j for j in range(1, 11)]:
        metr = train_with_n_topwords(n, model_name="LogisticRegression", alpha=alpha)[0]  
        lgr_grid_tfidf_accuracies.append([n, alpha, metr])
In [ ]:
lgr_grid_accu = pd.DataFrame(lgr_grid_tfidf_accuracies, columns=['topwords', 'C', 'accuracy'])
lgr_grid_accu[lgr_grid_accu.accuracy==max(lgr_grid_accu.accuracy)]

6.3 Grid Search on LinearSVC

As a simply demonstration and in order to save the training time, we select only the best number of features (based on penalty parameter C=1.0), and tune the penalty parameter C.

  • Note: To get a better accuracy, we should tune the number of features and the penalty parameter at the same time.
In [150]:
# Linear SVC
svc_grid_tfidf_accuracies = []

for n in [best_n_topwords_svc]:
    for alpha in [0.1 * j for j in range(1, 11)]:
        metr = train_with_n_topwords(n, model_name="LinearSVC", alpha=alpha)[0]  
        svc_grid_tfidf_accuracies.append([n, alpha, metr])
N of topwords: 1700000 alpha: 0.1 accuracy: 0.8704207900614637
N of topwords: 1700000 alpha: 0.2 accuracy: 0.8755074424898511
N of topwords: 1700000 alpha: 0.30000000000000004 accuracy: 0.876241094282407
N of topwords: 1700000 alpha: 0.4 accuracy: 0.8763552178945824
N of topwords: 1700000 alpha: 0.5 accuracy: 0.8759639369385526
N of topwords: 1700000 alpha: 0.6000000000000001 accuracy: 0.8754911391166832
N of topwords: 1700000 alpha: 0.7000000000000001 accuracy: 0.8753444087581721
N of topwords: 1700000 alpha: 0.8 accuracy: 0.8750998581606534
N of topwords: 1700000 alpha: 0.9 accuracy: 0.8748390041899669
N of topwords: 1700000 alpha: 1.0 accuracy: 0.8746107569656162
In [151]:
svc_grid_accu = pd.DataFrame(svc_grid_tfidf_accuracies, columns=['topwords', 'C', 'accuracy'])
svc_grid_accu[svc_grid_accu.accuracy==max(svc_grid_accu.accuracy)]
Out[151]:
topwords C accuracy
3 1700000 0.4 0.876355
In [156]:
best_alpha_svc = svc_grid_accu[svc_grid_accu.accuracy==max(svc_grid_accu.accuracy)].iloc[0,1]
best_alpha_svc
Out[156]:
0.4

7. Model Accuracy on the Test Set

Summary: The model accuracy on testing set

MultinomialNB LogisticRegression LinearSVC LinearSVC(with penalty parameter 0.4)
Best number of features 1,000,000 500,000 1,700,000 1,700,000
Accuracy on testing set 0.8585 0.8682 0.8730 0.8742

7.1 MultinomialNB model accuracy

In [109]:
# MultinomialNB model accuracy based on the default model penalty parameter alpha=1.0
mnb_accuarcy, mnb_vec, mnb_model = train_with_n_topwords(n=best_n_topwords_mnb, model_name="MultinomialNB", valid=False)
N of topwords: 1000000 alpha: 1.0 accuracy: 0.8584867209025547

7.2 LogisticRegression model accuracy

In [108]:
# LogisticRegression model accuracy based on the default model penalty parameter C=1.0
lgr_accuarcy, lgr_vec, lgr_model = train_with_n_topwords(n=best_n_topwords_lgr, model_name="LogisticRegression", valid=False)
N of topwords: 500000 alpha: 1.0 accuracy: 0.8682035313106282

7.3 LinearSVC model accuracy

In [107]:
# LinearSVC model accuracy based on the default model penalty parameter C=1.0
svc_accuarcy, svc_vec, svc_model = train_with_n_topwords(n=best_n_topwords_svc, model_name="LinearSVC", valid=False)
N of topwords: 1700000 alpha: 1.0 accuracy: 0.8729804196488253
In [157]:
# LinearSVC model accuracy based on the tuned model penalty parameter C=0.4
svc_accuarcy, svc_vec, svc_model = train_with_n_topwords(n=best_n_topwords_svc, model_name="LinearSVC", valid=False, 
                                                         alpha=best_alpha_svc)
N of topwords: 1700000 alpha: 0.4 accuracy: 0.8742357793827543

8. Prediction on New Entries

In [113]:
def predict_new(text, model_name='LinearSVC'):
    """
    Predict the sentiment of the 'text'.
    @params:
        text                   - Required: the text to predict (Str)
        model_name    - Optional: the model name('LinearSVC' | 'MultinomialNB' | 'LogisticRegression')
    """  

    sentence = preprocessing(text, ngram=3)
    if model_name == 'LinearSVC':
        features = svc_vec.transform([text])
        pred = svc_model.predict(features)        
    elif model_name == 'MultinomialNB':
        features = mnb_vec.transform([text])
        pred = mnb_model.predict(features)
    elif model_name == 'LogisticRegression':
        features = lgr_vec.transform([text])
        pred = lgr_model.predict(features)
        
    return pred[0]
In [117]:
# An example of new review text
review_text = 'This book seems to have some decent information, but it is full of typos and factual errors.'
In [119]:
predict_new(review_text, model_name='MultinomialNB')
Out[119]:
'neg'
In [120]:
predict_new(review_text, model_name='LogisticRegression')
Out[120]:
'neg'
In [118]:
predict_new(review_text, model_name='LinearSVC')
Out[118]:
'neg'

9. Model Output

Use module "pickle" to save the model for furture use.

In [121]:
# Save the vectorizer
with open('tf_vec.pkl', 'wb') as pkl_file:
    pickle.dump(svc_vec, pkl_file)
In [122]:
# Save the model
with open('svc_model.pkl', 'wb') as pkl_file:
    pickle.dump(svc_model, pkl_file)

10. Web Application

To demonstrate this project, we wrote a flask (a lightweight WSGI web application framework [link]) web application. You can enter some review text and let the app analyze the sentiment of your entry. Have fun!

NLP Sentiment Analysis App

11. Ending Discussion

Here are some ending throughs through this project.

  • We tried to remove stop words during the preprocessing stage but it decreased the model accuracy. We only used the stop words from nltk package, which may be the reason. A better way worth a try is to manually select stop words which are related to the current topic of the dataset. But since we have a very large dataset and we also perform tf-idf reweighting process, removing stop words may not be necessary.

  • To save tuning time, we did a pretty rough tuning on number of features with 100,000 number of features increment in each tuning. A finer tuning may get a better prediction power.

  • We can also tune the penalty parameter in each machine learning algorithm. We currently only use the default penalty parameter. By tuning this parameter, we may get a better prediction power.

  • We notice that under the current three machine learning models, parameter tuning may not provide significant accuracy boost. More advanced models such as Long short-term memory (LSTM) may be adopted.