[ad_1]

In machine learning, feature selection is the process of selecting a subset of the predictor or independent variable, which is important to model building. The importance of the technique is numerous, ranging from avoiding curse of dimensionality to enhancing better generalization by avoiding overfitting; reducing model training time; improved accuracy; and to ensuring a computational less expensive model.

The idea of feature selection is that real-world data contains features which are not relevant to model building and removing them does not make us lose a significant amount of information. However, it is important to note that there is a difference between feature selection and feature extraction.

The techniques can be either model specific or not model specific. They can be further classified into three groups:

- Filter methods.
- Wrapper methods.
- Embedded methods.

I will illustrate the filter method using the Boston house price dataset. The dataset has fourteen features in addition to the target feature, which is the Median Value of house prices in Boston (MEDV). Removing the target feature, we will be left with thirteen variables.

**FILTER METHODS**

The methods are amazingly fast, easy to compute and less computationally expensive as compared to other methods. These methods are seen as part of the preprocessing steps, and they are more suited for quick screenings and removal of irrelevant features. Techniques in filter methods are not algorithm specific, thus, features produced by this method are not tuned to a specific type of predictive model. They are general and will yield less performance than those of wrapper methods. To get a better model, it is advisable to further apply either a wrapper or embedded method after using the filter method.

The following are techniques that can be used as filter methods.

i. *Basic method:* This method involves the removal of constant and quasi-constant features. Constant features are those variables that have the same values while quasi-constant features–as the name suggests–are those which have the same values in majority of the observations e.g. when one of the observations is dominant 99.1% of the time.

Luckily, Variance Threshold of Sklearn can be used to remove all constant and quasi-constant variables in our dataset by adjusting the threshold argument.

- The variance threshold method can take as an argument, threshold which when set to a particular value e.g., 0, will remove all constant variables and when set to 0.01, will remove both constant and almost constant variables. After this, we can introduce the get_support method, which will show the number of variables retained. Using sum over the get_support, we get 13 features, which is still equivalent to the number of features we already have. Thus, this implies that there are no constant or quasi-constant features in our dataset.

ii. *Univariate feature selection:* This works by selecting all the best features of the dataset based on a univariate statistical test e.g. chi2, pearson-correlation e.t.c.

The methods used include SelectKBest, SelectPercentile e.t.c.

For example, SelectKBest or any other method will take as input, an appropriate statistical test as a scoring function and the desired number of features.

iii. *Correlation matrix*: Even when we don’t have a dataset with a large number of variables, an effective way to understand the relationship between variables is through correlation. This can also be used to predict one variable from the other. An important feature is one that is highly correlated with the target variable, and uncorrelated with another predictor variable. Using Pearson correlation, the returned correlation coefficient values range from +1 to -1. Thus, if two predictor variables have a coefficient of 0, this implies that they are uncorrelated and changing one variable will not affect another. However, when two features are highly correlated, we can decide on which one we want to drop and which to keep.

The scale of the map shows that the very light pink and deep purple colors are the highly correlated variables. A closer look at the map shows that the variable ‘TAX’ is highly correlated with another independent variable, ‘RAD’. We must drop one of the correlated variables.

**WRAPPER METHODS**

Unlike filter methods which are not model specific, wrapper methods use predictive models to add or remove from the subset of features used in training an algorithm. Since this method trains a new model with each of the subsets, it is therefore computationally expensive. However, they provide the best performing model for any kind of problem. Common examples include Forward Selection, Backward Elimination, Recursive Feature Elimination, and Recursive Feature Elimination with cross validation.

i. **Forward Selection**: This is an iterative model. It starts by having no features. In each iteration, more features which best improve the model performance are added, until addition of new features doesn’t improve the model performance any longer. Best features are determined based on pre-set elevation criteria like auc_roc for classification problems and R2 values for regression problems. MLxtend is a special package in python that allows us to perform forward selection.

ii. **Backward Elimination:** Here, we start with all the features and remove the least important feature at each iteration. This continues until no improvement is observed with removal of features.

**EMBEDDED METHOD**

Embedded feature selection methods are unique in that they combine the unique attributes of the other two methods into one. In fact, algorithms that use this method have an in-built feature selection method. Common examples are LASSO and RIDGE.

*I will write a detailed article about the Wrapper Method and Embedded Method in later posts.*

[ad_2]

Source link