How to perform Exploratory Data Analysis on text data for Natural Language Processing
Exploratory Data Analysis (EDA) for text data is more than counting characters and terms. To take your EDA to the next level, you can look at each word and categorize it or you can analyze the overall sentiment of a text.
Exploratory Data Analysis for text data is more than counting characters and terms.
In this article, we will look at some intermediate EDA techniques for text data:
To simplify the examples, we will use 450 positive reviews (
rating == 5) and 450 negative reviews (
rating == 1). This reduces the number of data points to 900 rows, reduces the number of rating classes to two, and balances the positive and negative reviews.
Additionally, we will only use two columns: the review text and the rating.
The DataFrame’s head of the reduced dataset looks like this:
In the fundamental EDA techniques, we covered the most frequent words and bi-grams and noticed that adjectives like “great” and “perfect” were among the most frequent words in the positive reviews.
With POS tagging, you could refine the EDA on the most frequent terms. E.g., you could explore, which adjectives or verbs are most common.
POS tagging takes every token in a text and categorizes it as nouns, verbs, adjectives, and so on, as shown below:
If you are curious about how I visualized this sentence, you can check out my tutorial here:
To check which POS tags are the most common, we will start by creating a corpus of all review texts in the DataFrame:
corpus = df["text"].values.tolist()
Next, we’ll tokenize the entire corpus as preparation for POS tagging.
from nltk import word_tokenize
tokens = word_tokenize(" ".join(corpus))
Then, we’ll POS tag each token in the corpus with the coarse tag set “universal”:
tags = nltk.pos_tag(tokens,
tagset = "universal")
As in the Term Frequency analysis of the previous article, we will create a list of tags by removing all stopwords. Additionally, we will only include words of a specific tag, e.g. adjectives.
Then all we have to do is to use the
Counter class as in the previous article.
from collections import Countertag = "ADJ"
stop = set(stopwords.words("english"))# Get all tokens that are tagged as adjectives
tags = [word for word, pos in tags if ((pos == tag) & ( word not in stop))]# Count most common adjectives
most_common = Counter(tags).most_common(10)# Visualize most common tags as bar plots
words, frequency = , 
for word, count in most_common:
sns.barplot(x = frequency, y = words)
Below, you can see the top 10 most common adjectives for the negative and positive reviews:
From this technique, we can see that words like “small”, “fit”, “big”, and “large” are most common. This might indicate that customers are most disappointed about a piece of clothing’s fit than e.g. about its quality.
The main idea of sentiment analysis is to get an understanding of whether a text has a positive or negative tone. E.g., the sentence “I love this top.” has a positive sentiment, and the sentence “I hate the color.” has a negative sentiment.
You can use TextBlob for simple sentiment analysis as shown below:
from textblob import TextBlobblob = TextBlob("I love the cut")blob.polarity
Polarity is an indicator of whether a statement is positive or negative and is a number between -1 (negative) and 1 (positive). The sentence “I love the cut” has a polarity of 0.5, while the sentence “I hate the color” has a polarity of -0.8.
The combined sentence “I love the cut but I hate the color” has a polarity of -0.15.
For multiple sentences in a text, you can get the polarity of each sentence as shown below:
text = “I love the cut. I get a lot of compliments. I love it.”
[sentence.polarity for sentence in TextBlob(text).sentences]
This code returns an array of polarities of [0.5, 0.0, 0.5]. That means that the first and last sentences have a positive sentiment while the second sentence has a neutral sentiment.
If we apply this sentiment analysis to the whole DataFrame like this,
df["polarity"] = df["text"].map(lambda x: np.mean([sentence.polarity for sentence in TextBlob(x).sentences]))
we can plot a boxplot comparison with the following code:
sns.boxplot(data = df,
y = "polarity",
x = "rating")
Below, you can see the polarity boxplots for the negative and positive reviews:
As you would expect, we can see that negative reviews (
rating == 1) have an overall lower polarity than positive reviews (
rating == 5).
In this article, we looked at some intermediate EDA techniques for text data:
- Part-of-Speech Tagging: We looked at Part-of-Speech tagging and how to use it to get the most frequent adjectives as an example.
- Sentiment Analysis: We looked at sentiment analysis and explored the review texts’ polarities.
Below you can find all code snippets for quick copying: