Twitter Sentiment Analysis


The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist sentiment associated with it. So, the task is to classify racist tweets from other tweets.


Formally, given a training sample of tweets and labels, where label ‘1’ denotes the tweet is racist and label ‘0’ denotes the tweet is not racist, your objective is to predict the labels on the test dataset.


import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import nltk
import warnings
from google.colab import files
import io

sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

Upload Dataset

uploaded = files.upload()

Overview Of Dataset


Pandas dataframe.append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object.

combine = train.append(test,ignore_index=True,sort=True)

Pattern Function

def remove_pattern(text,pattern):

r = re.findall(pattern,text)

for i in r:
text = re.sub(i,"",text)

return text

Removing Twitter Handles

Here NumPy Vectorization ‘np.vectorize()’ is used because it is much more faster than the conventional for loops when working on datasets of medium to large sizes.

combine['Tidy_Tweets'] = 
np.vectorize(remove_pattern)(combine['tweet'], "@[\w]*")

Removing Punctuation, Numbers And Special Characters

Here ‘str.replace()’ is used to convert the punctuation, number and special characters with white space.

combine['Tidy_Tweets'] = 
combine['Tidy_Tweets'].str.replace("[^a-zA-Z#]", " ")

Removing Stop Words

Here ‘lambda’(The expression is evaluated and returned) and ‘join()’ is used to string from evaluated object.

combine['Tidy_Tweets'] = 
combine['Tidy_Tweets'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))


Now we will tokenize all the cleaned tweets in our dataset. Tokens are individual terms or words, and tokenization is the process of splitting a string of text into tokens. We tokenize our tweets because we will apply Stemming in the next step.

tokenized_tweet = combine['Tidy_Tweets'].apply(lambda x: x.split())


Stemming is a process of stripping the suffixes from a tokenized tweet.

tokenized_tweet = 
tokenized_tweet.apply(lambda x: [ps.stem(i) for i in x])

Combining Back To Tweet

for i in range(len(tokenized_tweet)):
tokenized_tweet[i] = ' '.join(tokenized_tweet[i])
combine['Tidy_Tweets'] = tokenized_tweet

Extract Hashtags

def Hashtags_Extract(x):

for i in x:
ht = re.findall(r'#(\w+)',i)

return hashtags

Positive Hashtags

word_freq_positive = nltk.FreqDist(ht_positive_unnest)

Negative Hashtags

word_freq_negative = nltk.FreqDist(ht_negative_unnest)

Extracting Features from cleaned Tweets

Bag Of Word

Bag of Words is a method to extract features from text documents. These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set. This way of training features containing term frequencies of each word in each document. This is called bag-of-words approach since the number of occurrence and not sequence or order of words matters in this approach. So,let’s apply this word embedding technique to our available dataset. We have a package called CountVectorizer to perform this task.

bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')
bow = bow_vectorizer.fit_transform(combine['Tidy_Tweets'])
df_bow = pd.DataFrame(bow.todense())


TF-IDF stands for Term Frequency-Inverse Document Frequency, and the TF-IDF weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

  1. The first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document.
  2. The second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.


We have a package available for this in Scikit-Learn known as TfidfVectorizer.

tfidf=TfidfVectorizer(max_df=0.90, min_df=2,max_features=1000,stop_words='english')
df_tfidf = pd.DataFrame(tfidf_matrix.todense())

Splitting our dataset into Training and Validation Set

train_bow = bow[:31962]
train_tfidf_matrix = tfidf_matrix[:31962]

Applying ML Models

1. Supervised Machine Learning

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.

Y = f(X)

The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance.

Supervised learning problems can be further grouped into regression and classification problems.

Classification: A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”.

Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”. Some common types of problems built on top of classification and regression include recommendation and time series prediction respectively.

Some popular examples of supervised machine learning algorithms are:

Linear regression for regression problems.

Random forest for classification and regression problems.

Support vector machines for classification problems.

2. Unsupervised Machine Learning

Unsupervised learning is where you only have input data (X) and no corresponding output variables.

The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.

These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. Algorithms are left to their own devises to discover and present the interesting structure in the data.

Unsupervised learning problems can be further grouped into clustering and association problems.

Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.

Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y. Some popular examples of unsupervised learning algorithms are:

k-means for clustering problems.

Apriori algorithm for association rule learning problems.

Fitting the Random Forest Classifier model

rfc = RandomForestClassifier(n_estimators= 10, criterion="entropy"), y_train_tfidf)

Predicting the probabilities

prediction_tfidf = rfc.predict(x_valid_tfidf)

Calculating the Accuracy


Highest Accuracy -> Random Forest Classifier 95%



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store