Now, we are jumping into the Machine Learning Approach
Unlike the lexicon approach, implementing it in machine learning is more complicated. To analyze sentiment in text, it can be implemented in both supervised and unsupervised learning, depending on whether the data is labeled or not. As an unsupervised approach, it doesn’t require an initial training label, with which sentiment is associated. For unsupervised learning techniques in text analytics, clustering is known as the data segmentation method. K-means or DBSCAN are common uses for text clustering. With respect to supervised learning techniques, support vector machines (SMV) and Naïve Bayes are the common algorithms used to classify the text problem.
For machine learning approach, we need to do some extra work to pre-processing first.
- Cleaning the text
This step is to remove all special characters and numbers and convert the string to lowercase. Without converting to lowercase, it will cause an issue when we create vectors of these words. The result can be seen in the column called “cleannedReview”.
#Define a function to clean the text
def clean(text):
#Remove all special characters and numericals leaving the alphabets
text = re.sub('[^A-Za-z]+', ' ', str(text))
#lower case conversion
text = text.lower()
return text#Cleaning the text in the review column
df['cleanedReview'] = df['reviewText'].apply(clean)
df.head()
2. Tokenization
Tokenization refers to breaking down the sentence into a sequence list of words called tokens. I will be performing word-level tokenization using nltk tokenize function word_tokenize().
#Define a function to tokenize the text
def tokenizeText(text):
token = nltk.word_tokenize(text)
return token#Apply tokenize funtion into the cleaned text
df['tokenReview'] = df['cleanedReview'].apply(tokenizeText)
df.head()
3. Stopwords removal
Stopwords are words that carry very little meaning or useful information. For example, ‘A’, ‘The’, I, you. Figure 12 shows a list of ‘stopwords’ in English.
This is the implementation of removing stop words as shown in the picture below. After removing all the stops words, the tokened words were converted back into sentence, but this time all Article and meaningless words were eliminated. The result is demonstrated in ‘stopwordsRomoved’ column.
#define a funtion for stopwords
def stopword(text):
new_text = (" ").join(ele for ele in text if ele.lower() not in set(stopwords.words('english')))
return new_text
#Apply
df['stopwordRemoved'] = df['tokenReview'].apply(stopword)
df.head()
4. Enrichment — POS tagging
After tokenizing the sentence, POS tagging will come along, when POS stands for Part-of-Speech. This step corresponds to stem words — lemmatization, which will be explained in the next step. Basically, the idea is to tag ‘part of speech’ for each word, for example, VBN is verb, NNS is noun, and JJ is ‘adjective. ’The figure below shows the result at the column called “posReview”.
#Define a function to POS the text
def POS(text):
tags = nltk.pos_tag(nltk.word_tokenize(text))
newlist = []
for word, tag in tags:
if word.lower() not in set(stopwords.words('english')):
newlist.append(tuple([word, tag]))
return newlist#Apply POS funtion into the token text
df['posReview'] = df['stopwordRemoved'].apply(POS)
df[["cleanedReview","tokenReview", "stopwordRemoved","posReview"]].head()
5. Normalization
A stem is a part of a word responsible for its lexical meaning. The two commonly used for normalizing words are stemming and lemmatization. Stemming is the process to chop off the root word and remain the core word, and lemmatization is when we want to complete the meaning back to the root word. The given example shows in the figure below.
wordnet_lemmatizer = WordNetLemmatizer()
ps = PorterStemmer()def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wn.ADJ
elif treebank_tag.startswith('V'):
return wn.VERB
elif treebank_tag.startswith('N'):
return wn.NOUN
elif treebank_tag.startswith('R'):
return wn.ADV
else:
return wn.NOUN
def lemmatize(pos_data):
lemma_rew = " "
for word, pos in pos_data:
#lemmaw = ps.stem(word)
lemma = wordnet_lemmatizer.lemmatize(word, get_wordnet_pos(pos))
lemma_rew = lemma_rew + " " + lemma
return lemma_rew
df['Lemma'] = df['posReview'].apply(lemmatize)
df[["cleanedReview","tokenReview", "stopwordRemoved","posReview","Lemma"]].head()
6. Vectorization
In this approach, Natural Language Processing (NPL) plays a crucial role in text modeling. Data needs to go through the feature engineering process, the step of converting text to vector, before modeling machine learning. For example, Bag of words, Term Frequency-Inverse Document Frequency (TF-IDF) are NLP techniques to convert variable-length texts into a fixed-length vector in order to fix messy unstructured text used for machine learning.
# sublinear_df is set to True to use a logarithmic form for frequency
# min_df is the minimum numbers of documents a word must be present in to be kept
# norm is set to l2, to ensure all our feature vectors have a euclidian norm of 1
# ngram_range is set to (1, 2) to indicate that we want to consider both unigrams and bigrams
# stop_words is set to "english" to remove all common pronouns
#("a", "the", ...) to reduce the number of noisy features
vectorizer = TfidfVectorizer(sublinear_tf= True, min_df=10, norm='l2', ngram_range=(1, 3), stop_words='english')
X_train_vc = vectorizer.fit_transform(df["Lemma"])pd.DataFrame(X_train_vc.toarray(), columns=vectorizer.get_feature_names())
Now It’s time to train the models with different algorithm
a.) K-mean (Unsupervised learning)
K-mean clustering is one of the simplest and most popular unsupervised machine learning algorithms. Kmeans algorithm starts with a first group of random centroids, which are used as the centers of each cluster. Then it performs iterative calculations to optimize the positions of the centroids.
If you want to try this algorithm, you might need to delete the label column first. Because unsupervised learning doesn’t use labels to cluster the group.
from sklearn.cluster import KMeans
km = KMeans(n_clusters = 3, init = "k-means++", max_iter = 100, n_init = 1)
km.fit(desc_matrix)
clusters = km.labels_.tolist()cluster = {'review': df['Lemma'].tolist(), 'star': df['starReview'].tolist(),
'cluster': clusters}
clusterResult = pd.DataFrame(cluster)
clusterResult['cluster'].value_counts()
# Visualize the amount of cluster
fig = plt.figure(figsize = (5,5))
sns.countplot(x = 'cluster', data = clusterResult)
# split df - positive and negative sentiment:
positive = clusterResult[clusterResult['cluster'] == 1]
negative = clusterResult[clusterResult['cluster'] == 2]
neutral = clusterResult[clusterResult['cluster'] == 0]
from wordcloud import WordCloud
text3 = ' '.join([word for word in positive['review']])
plt.figure(figsize = (20,15), facecolor = 'None')
wordcloud3 = WordCloud(max_words= 300,background_color ='white', width = 1600, height = 800).generate(text3)
plt.title("Positive")
plt.imshow(wordcloud3)#Negative
text4 = ' '.join([word for word in negative['review']])
plt.figure(figsize = (20,15), facecolor = 'None')
wordcloud4 = WordCloud(max_words= 300,background_color ='white', width = 1600, height = 800).generate(text4)
plt.imshow(wordcloud4)
#Neutral
text5 = ' '.join([word for word in neutral['review']])
plt.figure(figsize = (20,15), facecolor = 'None')
wordcloud5 = WordCloud(max_words= 300,background_color ='white', width = 1600, height = 800).generate(text5)
plt.imshow(wordcloud5)
It seems like this approach didn’t do well with text classification. It is grouping the word based on the meaning not a feeling, as you can see in the cloud word visualization.
b.) Naïve Bayes (Supervised learning)
Next, let’s go with Naive Bayes. It is a probability-based classification. These rely on Bayes’s theorem. A class of incredibly simple and quick classification techniques known as naive Bayes models is frequently appropriate for very high-dimensional datasets. They wind up being quite helpful as a quick-and-dirty baseline for a classification challenge because they are so quick and have so few configurable parameters.
#Vectorize features
vect = TfidfVectorizer(ngram_range=(1, 3)).fit(df_ready['Lemma'])
feature_names = vect.get_feature_names()
X = df_ready['Lemma']
Y = df_ready['label']
X = vect.transform(X)#Buiding the model
#Train and test split (Test size = 0.2, ratio = 80 / 20)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=24)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
#define Evaluate model
def evaluate(y_true, y_pred):
acc = metrics.accuracy_score(y_true, y_pred)
conf = confusion_matrix(y_true, y_pred)
# prec = metrics.precision_score(y_true, y_pred)
# sensitivity = metrics.recall_score(y_true, y_pred)
return {'accuracy': acc,
'confusion': conf
# "precision:" , prec,
# "sensitivity:", sensitivity
}
#NB
kf = KFold(n_splits = 5)
results_Naive_Bayes = []
for train_index, val_index in kf.split(X_train,y_train):
X_trData, X_valData = X_train[train_index], X_train[val_index]
y_trData, y_valData = y_train.iloc[train_index], y_train.iloc[val_index]
# Model Generation Using Multinomial Naive Bayes
clf = MultinomialNB(alpha = 0.6).fit(X_trData, y_trData)
#evaluate model
y_true = y_valData
y_pred = clf.predict(X_valData)
res = evaluate(y_true, y_pred)
results_Naive_Bayes.append(res)
#Performance on test set
#visualize confusion matrix
Class_label = ["Negative", "Neutral", "Positive"]
df_con_NB_op = pd.DataFrame(results_Naive_Bayes_final[0]['confusion'], index = Class_label, columns = Class_label)sns.heatmap(df_con_NB_op, annot = True, cmap = "YlGnBu", fmt = 'd')
plt.title("Naive Bayes Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
plt.show()
c.) SVM (Supervised learning)
A support vector machine is a type of supervised machine learning classification. It uses hyperplane as a decision boundary that helps to classify the data points. The data points that are close to the hyperplane are called support vectors. It influences the position of the hyperplane, maximizing the margin of the classifier.
kf = KFold(n_splits = 5)
resultsSVM = []
for train_index, test_index in kf.split(X_train,y_train):
X_trData, X_valData = X_train[train_index], X_train[test_index]
y_trData, y_valData = y_train.iloc[train_index], y_train.iloc[test_index]# Model Generation Using Multinomial Naive Bayes
clf2 = svm.SVC()
clf2.fit(X_trData, y_trData)
#evaluate model
y_true = y_valData
y_pred = clf.predict(X_valData)
res = evaluate(y_true, y_pred)
resultsSVM.append(res)
#Performance on test set
#visualize confusion matrix
Class_label = ["Negative", "Neutral", "Positive"]
df_con_SVM_op = pd.DataFrame(results_SVM_final[0]['confusion'], index = Class_label, columns = Class_label)sns.heatmap(df_con_SVM_op, annot = True, cmap = "YlGnBu", fmt = 'd')
plt.title("SVM Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
plt.show()