Twitter Sentiment Analysis

Introduction

Data Set Overview

Methodology

Importing all the necessary Libraries

Data Exploration

Data Cleaning and Pre-Processing

Word Cloud Creation

Models

Conclusion

Introduction

In this project, sentiment analysis of the tweets provided from the dataset was done. Three classifiers namely Logistic Regression, Bernoulli Naive Bayes, and Support Vector Machine along with using Term Frequency - Inverse Document Frequency (TF-IDF) were used during the analysis. The efficiency of the algorithms was analysed by looking at the area covered under the AUC-ROC curve.

The complete code is accessible here

Data Set Overview

In this project, we tried to analyze the data collected from this Kaggle data set. This data set contains sixteen lakh tweets. Twitter API was used for the data extraction. The following are the attributes of the dataset

target - Represents the sentiment of the tweet (Positive or Negative)
ids - Represents the tweet id
date - Represents the date of the tweet
flag - Represents the query. If there is no query, then this value is NO_QUERY.
user - User name of the person who tweeted
text - Tweet text

Methodology

Importing all the necessary Libraries

In this section, all the required libraries are imported. Kindly make sure that all these libraries are pre installed. The libraries used include regex, numpy, pandas, seaborn, wordcloud, matplotlib, scikitlearn, nltk.

Code Snippet

# Library to work with regular expressions, required during cleaning the data

import re

# Working with numpy arrays during data exploration, like count

import numpy as np

# Used to store the csv data in data frames and perform various operations like groupby etc

import pandas as pd

# The below given libraries are used for plotting various graphs during data exploration

import seaborn as sns

from wordcloud import WordCloud

import matplotlib.pyplot as plt

# The below given library is used for stemminga dn lemmatization

from nltk.stem import WordNetLemmatizer

# Scikit Learn is the library used to apply various algorithms as shown below

# Import used for Support Vector Machine Library

from sklearn.svm import LinearSVC

# Import to implement Beroulli Naive Bayes Algorithm for classification

from sklearn.naive_bayes import BernoulliNB

# Logistic regresssion for classification

from sklearn.linear_model import LogisticRegression

# Import to split the complete dataset into train and test data

from sklearn.model_selection import train_test_split

# Import used for working on Term Frequency and Inverse Docuemnt Frequency

from sklearn.feature_extraction.text import TfidfVectorizer

#Code used to asses the algorithm using Accuracy, F1Score etc, and also used to build the confusion matrix. To draw the AUC-ROC Curve

from sklearn.metrics import confusion_matrix, classification_report

Data Exploration

We followed a step by step process during analysing the Twitter data

Importing the dataset into pandas data frame
Analyzing the shape of the dataset and datatypes of the attributes
Checking the unique 'target' values
Data visualization of the target variables using seaborn and matplotlib

Data Cleaning and Pre-Processing

Code for cleaning the dataset can be accessed here

Post data exploration, various cleaning and pre-processing steps were performed. Data pre-processing involved, removal of punctuations, deleting unnecessary words, converting the corpus into the lowercase for better generalization. removal of URLs and numbers.

Subsequently, we then performed Stemming and Lemmatization for better results.

Code Snippet

# Stemming

import nltk

st = nltk.PorterStemmer()

def stemming_on_text(data):

text = [st.stem(word) for word in data]

return text

dataset['text']= dataset['text'].apply(lambda x: stemming_on_text(x))

dataset['text'].head()

# Lemmatization

import nltk

nltk.download('wordnet')

lm = nltk.WordNetLemmatizer()

def lemmatizer_on_text(data):

text = [lm.lemmatize(word) for word in data]

final_text=" ".join([word for word in text])

return final_text

dataset['text'] = dataset['text'].apply(lambda x: lemmatizer_on_text(x))

dataset['text'].head()

Word Cloud Creation

Post pre-processing, we plotted word cloud for positive and negative tweets

Models

After plotting the word cloud, the dataset was split into training and testing dataset. Subsequently, the dataset was transformed using TF-IDF Vectorizer

In this project, we used 3 models namely Logistic Regression, Bernoulli Naive Bayes, and Support Vector Machine.

After training the model we then applied the evaluation measures to check the performance of models. Accordingly, we used the following evaluation parameters to check the performance of models

Bernoulli Naive Bayes

# Bernoulli Naive Byes Classification and model evaluation

BNBmodel = BernoulliNB()

BNBmodel.fit(X_train, y_train)

model_Evaluate(BNBmodel)

y_pred1 = BNBmodel.predict(X_test)

Support Vector Machine

# Support Vector Machine Classification and model evaluation

SVCmodel = LinearSVC()

SVCmodel.fit(X_train, y_train)

model_Evaluate(SVCmodel)

y_pred2 = SVCmodel.predict(X_test)

Logistic Regression

# Logistic Regression Classification and model evaluation

LRmodel = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1)

LRmodel.fit(X_train, y_train)

model_Evaluate(LRmodel)

y_pred3 = LRmodel.predict(X_test)

Conclusion

Considering Accuracy, F1-Score and ROC-AUC Curve scores, Logistic Regression was a better performer than SVM and Bernoulli Naive Bayes Classification algorithms.

Previous Project

Page updated

Google Sites

Report abuse