Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Machine Learning

Your TFIDF features are garbage. Here’s how to fix it. | by Kenan Ekici | Sep, 2022

admin by admin
September 11, 2022
in Machine Learning


Get rid of meaningless TFIDF features and make your model breathe fresh air with this simple step.

TFIDF still remains one my favorite word embedding techniques, despite the fact that GPT-3 and other transformer based models have long taken the state-of-the-art by storm. It’s simple to grasp and a good starting point for Natural Language Processing and information retrieval. I still use it from time to time for training baseline models as it can be implemented quickly. And some problems just simply don’t require the SOTA.

Unfortunately, I can not help but feel physically uncomfortable having to think back to times where I have learned models on TFIDF features without properly validating or selecting the extracted features. In other words, naively configuring the parameters of the feature extractor by only tracking model performance instead of understanding the underlying extracted features.

In this blog, I will show you one simple overlooked/underutilized step to extract the most meaningful features from your dataset and boost your model performance.

Text data can consist of a large vocabulary and a variety of words that can be mistaken for meaningful vocabulary. Prior to doing feature extraction with TFIDF, it is important that you understand the cleanliness of your textual data. It is a good practice to clean and normalize your data as much as possible by filtering out stop-words, symbols, numbers, and lemmatizing words. For example, when working with Twitter data, you could remove mentions and URLs as they likely will not be useful for making predictions.

Ultimately, we want features that make sense for our model to learn and representative. And most importantly, keep the number of features limited so we do not end up with sparse vectors and unnecessarily high dimensions. The goal is to make room for the best possible features for the model to learn, and filter out the noise that somehow receive a TFIDF score in our dataset. The way to do that is to first ensure that you understand your textual data, normalize it if needed, and then apply some type of feature selection on the extracted features.

The first question you may ask is, why is feature selection even necessary? Isn’t the whole point of TFIDF to extract meaningful features from a large set of possible features? Well, yes. However, TFIDF does not always guarantee that the extracted features will be effective. In other words, your TFIDF vectorizer may extracts words and characters that have little meaning for the classes that you are trying to predict. Because of this, we must apply a feature selection method that selects TFIDF features which will be most informative to predict your target classes.

So what is the CHI² test?

“Pearson’s chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.”

Simply put, it’s a test to determine whether two categorical variables are independent. It is a good exercise to find a toy example, and manually calculate the CHI² metric using the observed/expected frequencies in order to really grasp what the metric and the CHI² distribution means.

In this blog, we will calculate CHI² to select TFIDF terms that are dependent to the class we are trying to predict. This method will not guarantee that you end up with terms under 5% significance. It would require a more rigorous approach which is out of scope for this blog.

To demonstrate feature selection with CHI², I will use the Twitter US Airline Sentiment data to train and test a sentiment prediction model. You can download the train and test data here.

Loading and clean the Twitter data

Using CHI² to select meaningful features

In order to be able to compare both methods, we’ve added a flag to disable the CHI² feature selection.

Examine the extracted features (with and without CHI²)

Let’s look at some of the extracted features with and without selecting them with CHI². We can clearly see an increased quality of extracted terms and an increase in more complex terms to express sentiment.

Extracted TFIDF terms with CHI² selection (left) and without (right)

Training and evaluate the model

A multinomial Naive Bayes model is trained on the selected features. We clearly see in the confusion matrix below the difference in performance with and without selecting our features with CHI².

We can see the difference in performance with CHI² feature selection (left) and without (right)

You can find the complete code in the following Colab notebook.

It’s extremely important to understand the features that your model learns to make predictions. You should never train models without properly validating your input features. This evidently leads to the classic garbage in, garbage out.

In this blog, we used CHI² to not only reduce/limit the dimensions of our embedded TFIDF vectors, but we also boosted the model performance by extracting the most meaningful features for our sentiment analysis problem.

I hereby hope to have improved your sentiment with this blog and as always, happy coding!



Source link

Previous Post

The Ethical and Moral Conundrum Of Self Driving Cars | by Throttle Girl | Sep, 2022

Next Post

Is Github Copilot actually kind of good? | by Sarah Cross | Sep, 2022

Next Post

Is Github Copilot actually kind of good? | by Sarah Cross | Sep, 2022

Choosing Computer Vision board in 2022 | by Anton Maltsev | Sep, 2022

Recent Developments in the field of Fog Computing part1 (AI + IOT) | by Monodeep Mukherjee | Sep, 2022

Related Post

Artificial Intelligence

Dates and Subqueries in SQL. Working with dates in SQL | by Michael Grogan | Jan, 2023

by admin
January 27, 2023
Machine Learning

ChatGPT Is Here To Stay For A Long Time | by Jack Martin | Jan, 2023

by admin
January 27, 2023
Machine Learning

5 steps to organize digital files effectively

by admin
January 27, 2023
Artificial Intelligence

Explain text classification model predictions using Amazon SageMaker Clarify

by admin
January 27, 2023
Artificial Intelligence

Human Resource Management Challenges and The Role of Artificial Intelligence in 2023 | by Ghulam Mustafa Shoaib | Jan, 2023

by admin
January 27, 2023
Deep Learning

Training Neural Nets: a Hacker’s Perspective

by admin
January 27, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.