Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Machine Learning

Using Execute Python operator for calculating Feature Importance in RapidMiner Studio | by Mostafa Saeidi | Jan, 2023

admin by admin
January 18, 2023
in Machine Learning


Feature Selection Importance

In machine learning, one of the main components of data preprocessing is feature selection. Each column in the dataset that is fed into our machine learning model is called a feature, also known as a variable or attribute. If we use too many features to train a model, the model can learn from unimportant patterns. The process of selecting the most important features for developing a predictive model is called Feature Selection. Feature selection techniques reduce the number of input variables or features by removing redundant or irrelevant features and reducing the set of features down to those that are most relevant to our modeling approach. Feature selection provides the following benefits:

1) Simplifies the modeling by increasing the explainability of the results

2) Reduces the training time (and scoring time) and the required space volume

3) Increases the model precision

We can classify the feature selection methods for supervised approaches (labeled datasets) into three main groups:

1) Filter methods: In these methods, a statistical metric is used to remove irrelevant attributes. Information gain, Fisher score, and ANOVA F-value are some examples of Filter methods for feature selection.

2) Wrapper methods: In these methods, a subset of variables is compared with other combinations of features which results in the detection of possible interaction between features. Some of the popular wrapper methods are Backward Elimination, Forward Selection and Recursive Feature Elimination methods.

3) Embedded methods: In these methods, machine learning algorithms are used to calculate the importance of each feature. Lasso Regression and tree-based models like Random Forest Importance are among the most popular Embedded methods for feature selection.

In this article, we will focus on the Embedded methods and implement the feature selection based on the XGBoost model.

Hands-on Experience in RapidMiner

We use the latest version of RapidMiner Studio (V10.0) to implement an example.

EDA

We use the House Price dataset, which contains 81 explanatory variables describing every aspect of 1460 residential homes in Ames, Iowa. This dataset is mainly used to predict the final price of each home based on the described features. “SalePrice” is our dependent variable (DV).

By performing a quick explanatory data analysis (EDA) over the dataset, we can see that some features are more relevant to “SalePrice”. For example, in the following plot, we can see there is almost a linear relationship between “OverallQual” and “SalePrice”. Moreover, we colorized this plot based on “YearBuilt” which is showing that newer homes have higher overall quality and sale prices.

Figure 1: Scatter plot of “SalePrice” and “OverallQual”. Color is based on “YearBuild”

As it was mentioned above, this dataset has 81 features that complicate the EDA process and the interpretation of the final model. As the first step of preprocessing, we use the Turbo Prep tool in RapidMiner to remove the features with low-quality data. This tool automatically determines features with high nominal or integer ID-ness, missing values, stability, and the number of categories and removes them.

Figure 2: Removing low-quality features in Turbo Prep

After performing Turbo Prep, we managed to reduce the number of variables from 81 to 58.

Then, we removed highly correlated features by using the Remove Correlated Attributes operator.

Since many of the features are categorical, we use the Nominal to Numerical operator to change types of variables. Considering that we had many categorical variables in the initial dataset, the number of features increases to 229.

Feature Selection

The next step is feature selection. We plan to develop an XGBoost model in Python for calculating the importance of input variables.

The importance gives a value that shows how valuable each variable is in the trained model. The more a variable is used to make key decisions with decision trees, the higher its importance. This importance is calculated for each input variable, and then variables are ranked and compared to each other. Importance is calculated for a single decision tree by the amount that each split point improves the performance measure, weighted by the number of observations in the node. The feature importances are then averaged across all of the decision trees within the model [3].

To develop this, we use Execute Python in RapidMiner Studio to complete our workflow. Execute Python operator is part of the Python Scripting extension, which is available in the Marketplace, and accessible via the Extensions menu. Python Scripting extension allows you to execute Python scripts within RapidMiner’s process.

Figure 3: Python Scripting extension available in RapidMiner Marketplace

The following picture is showing the steps we take to build this process in RapidMiner.

Figure 4: RapidMiner process (workflow) for feature selection

How to use Execute Python operator in RapidMiner

In the Execute Python operator, we define “SalePrice” as the DV and the rest of variables would be the independent variables (IVs). Then, we develop an XGBoost model (XGBRegressor) and call feature_importances_ to extract the importance of each feature. Then, we make a list of the top important features and pass them to the output of this operator.

Figure 5: The feature selection script within Execute Python operator

We connect the output port to the result port and run the process. The table of important features opens up. The top 20 important features are shown in the following table:

Figure 6: The list of top 20 important features for predicting “SalePrice”

By identifying the most important features, we can use them for developing a model to predict “SalePrice” which is the price of each home.

Final words

Feature selection is always an important step for simplifying the modeling or EDA of datasets with a lot of columns. Although many powerful feature selection methods are available in RapidMiner, we can always use the Execute Python operator to develop any other method of choice.

References:

[1] https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

[2] RapidMiner Studio Help (within the software)

[3] https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/



Source link

Previous Post

What is automated loan underwriting?

Next Post

XGBoost Now Supports MAE as an Objective | by Saupin Guillaume | Jan, 2023

Next Post

XGBoost Now Supports MAE as an Objective | by Saupin Guillaume | Jan, 2023

Image Segmentation with Deep Learning (Guide)

Operationalization and Orchestration: the Keys to Data Project Success – BMC Software

Related Post

Artificial Intelligence

3 Ways to Build a Geographical Map in Python Altair | by Angelica Lo Duca | Jan, 2023

by admin
January 30, 2023
Machine Learning

Want to get a quick and profound overview of the 42 most common used Machine Learning Algorithms? | by Murat Durmus (CEO @AISOMA_AG) | Jan, 2023

by admin
January 30, 2023
Machine Learning

Scan Business Cards to Excel or Google Contacts

by admin
January 30, 2023
Artificial Intelligence

Amazon SageMaker built-in LightGBM now offers distributed training using Dask

by admin
January 30, 2023
Artificial Intelligence

Don’t blame a Data Scientist on failed projects! | by Darya Petrashka | Dec, 2022

by admin
January 30, 2023
Edge AI

BrainChip Tapes Out AKD1500 Chip in GlobalFoundries 22nm FD SOI Process

by admin
January 30, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.