In this project, sentiment analysis of the tweets provided from the dataset was done. Three classifiers namely Logistic Regression, Bernoulli Naive Bayes, and Support Vector Machine along with using Term Frequency - Inverse Document Frequency (TF-IDF) were used during the analysis. The efficiency of the algorithms was analysed by looking at the area covered under the AUC-ROC curve.
The complete code is accessible here
In this project, we tried to analyze the data collected from this Kaggle data set. This data set contains sixteen lakh tweets. Twitter API was used for the data extraction. The following are the attributes of the dataset
target - Represents the sentiment of the tweet (Positive or Negative)
ids - Represents the tweet id
date - Represents the date of the tweet
flag - Represents the query. If there is no query, then this value is NO_QUERY.
user - User name of the person who tweeted
text - Tweet text
In this section, all the required libraries are imported. Kindly make sure that all these libraries are pre installed. The libraries used include regex, numpy, pandas, seaborn, wordcloud, matplotlib, scikitlearn, nltk.
# Library to work with regular expressions, required during cleaning the data
import re
# Working with numpy arrays during data exploration, like count
import numpy as np
# Used to store the csv data in data frames and perform various operations like groupby etc
import pandas as pd
# The below given libraries are used for plotting various graphs during data exploration
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# The below given library is used for stemminga dn lemmatization
from nltk.stem import WordNetLemmatizer
# Scikit Learn is the library used to apply various algorithms as shown below
# Import used for Support Vector Machine Library
from sklearn.svm import LinearSVC
# Import to implement Beroulli Naive Bayes Algorithm for classification
from sklearn.naive_bayes import BernoulliNB
# Logistic regresssion for classification
from sklearn.linear_model import LogisticRegression
# Import to split the complete dataset into train and test data
from sklearn.model_selection import train_test_split
# Import used for working on Term Frequency and Inverse Docuemnt Frequency
from sklearn.feature_extraction.text import TfidfVectorizer
#Code used to asses the algorithm using Accuracy, F1Score etc, and also used to build the confusion matrix. To draw the AUC-ROC Curve
from sklearn.metrics import confusion_matrix, classification_report
We followed a step by step process during analysing the Twitter data
Importing the dataset into pandas data frame
Analyzing the shape of the dataset and datatypes of the attributes
Checking the unique 'target' values
Data visualization of the target variables using seaborn and matplotlib
Code for cleaning the dataset can be accessed here
Post data exploration, various cleaning and pre-processing steps were performed. Data pre-processing involved, removal of punctuations, deleting unnecessary words, converting the corpus into the lowercase for better generalization. removal of URLs and numbers.
Subsequently, we then performed Stemming and Lemmatization for better results.
# Stemming
import nltk
st = nltk.PorterStemmer()
def stemming_on_text(data):
text = [st.stem(word) for word in data]
return text
dataset['text']= dataset['text'].apply(lambda x: stemming_on_text(x))
dataset['text'].head()
# Lemmatization
import nltk
nltk.download('wordnet')
lm = nltk.WordNetLemmatizer()
def lemmatizer_on_text(data):
text = [lm.lemmatize(word) for word in data]
final_text=" ".join([word for word in text])
return final_text
dataset['text'] = dataset['text'].apply(lambda x: lemmatizer_on_text(x))
dataset['text'].head()
Post pre-processing, we plotted word cloud for positive and negative tweets
After plotting the word cloud, the dataset was split into training and testing dataset. Subsequently, the dataset was transformed using TF-IDF Vectorizer
In this project, we used 3 models namely Logistic Regression, Bernoulli Naive Bayes, and Support Vector Machine.
After training the model we then applied the evaluation measures to check the performance of models. Accordingly, we used the following evaluation parameters to check the performance of models
# Bernoulli Naive Byes Classification and model evaluation
BNBmodel = BernoulliNB()
BNBmodel.fit(X_train, y_train)
model_Evaluate(BNBmodel)
y_pred1 = BNBmodel.predict(X_test)
# Support Vector Machine Classification and model evaluation
SVCmodel = LinearSVC()
SVCmodel.fit(X_train, y_train)
model_Evaluate(SVCmodel)
y_pred2 = SVCmodel.predict(X_test)
# Logistic Regression Classification and model evaluation
LRmodel = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1)
LRmodel.fit(X_train, y_train)
model_Evaluate(LRmodel)
y_pred3 = LRmodel.predict(X_test)
Considering Accuracy, F1-Score and ROC-AUC Curve scores, Logistic Regression was a better performer than SVM and Bernoulli Naive Bayes Classification algorithms.