Get rid of meaningless TFIDF features and make your model breathe fresh air with this simple step.
TFIDF still remains one my favorite word embedding techniques, despite the fact that GPT-3 and other transformer based models have long taken the state-of-the-art by storm. It’s simple to grasp and a good starting point for Natural Language Processing and information retrieval. I still use it from time to time for training baseline models as it can be implemented quickly. And some problems just simply don’t require the SOTA.
Unfortunately, I can not help but feel physically uncomfortable having to think back to times where I have learned models on TFIDF features without properly validating or selecting the extracted features. In other words, naively configuring the parameters of the feature extractor by only tracking model performance instead of understanding the underlying extracted features.
In this blog, I will show you one simple overlooked/underutilized step to extract the most meaningful features from your dataset and boost your model performance.
Text data can consist of a large vocabulary and a variety of words that can be mistaken for meaningful vocabulary. Prior to doing feature extraction with TFIDF, it is important that you understand the cleanliness of your textual data. It is a good practice to clean and normalize your data as much as possible by filtering out stop-words, symbols, numbers, and lemmatizing words. For example, when working with Twitter data, you could remove mentions and URLs as they likely will not be useful for making predictions.
Ultimately, we want features that make sense for our model to learn and representative. And most importantly, keep the number of features limited so we do not end up with sparse vectors and unnecessarily high dimensions. The goal is to make room for the best possible features for the model to learn, and filter out the noise that somehow receive a TFIDF score in our dataset. The way to do that is to first ensure that you understand your textual data, normalize it if needed, and then apply some type of feature selection on the extracted features.
The first question you may ask is, why is feature selection even necessary? Isn’t the whole point of TFIDF to extract meaningful features from a large set of possible features? Well, yes. However, TFIDF does not always guarantee that the extracted features will be effective. In other words, your TFIDF vectorizer may extracts words and characters that have little meaning for the classes that you are trying to predict. Because of this, we must apply a feature selection method that selects TFIDF features which will be most informative to predict your target classes.
So what is the CHI² test?
“Pearson’s chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.”
Simply put, it’s a test to determine whether two categorical variables are independent. It is a good exercise to find a toy example, and manually calculate the CHI² metric using the observed/expected frequencies in order to really grasp what the metric and the CHI² distribution means.
In this blog, we will calculate CHI² to select TFIDF terms that are dependent to the class we are trying to predict. This method will not guarantee that you end up with terms under 5% significance. It would require a more rigorous approach which is out of scope for this blog.
To demonstrate feature selection with CHI², I will use the Twitter US Airline Sentiment data to train and test a sentiment prediction model. You can download the train and test data here.
Loading and clean the Twitter data
Using CHI² to select meaningful features
In order to be able to compare both methods, we’ve added a flag to disable the CHI² feature selection.
Examine the extracted features (with and without CHI²)
Let’s look at some of the extracted features with and without selecting them with CHI². We can clearly see an increased quality of extracted terms and an increase in more complex terms to express sentiment.
Training and evaluate the model
A multinomial Naive Bayes model is trained on the selected features. We clearly see in the confusion matrix below the difference in performance with and without selecting our features with CHI².
You can find the complete code in the following Colab notebook.
It’s extremely important to understand the features that your model learns to make predictions. You should never train models without properly validating your input features. This evidently leads to the classic garbage in, garbage out.
In this blog, we used CHI² to not only reduce/limit the dimensions of our embedded TFIDF vectors, but we also boosted the model performance by extracting the most meaningful features for our sentiment analysis problem.
I hereby hope to have improved your sentiment with this blog and as always, happy coding!