[ad_1]

Machine learning works on a simple rule — if you put garbage in, you will only get garbage to come out. By garbage here, I mean noise in data. This becomes even more important when the number of features are very large. You need not use every feature at your disposal for creating an algorithm. You can assist your algorithm by feeding in only those features that are really important. I have myself witnessed feature subsets giving better results than complete set of feature for the same algorithm.

**Top reasons to use feature selection are:**

It enables the machine learning algorithm to train faster.

It improves the accuracy of a model

It reduces overfitting

**FILTER METHODS**

**Numerical Input, Numerical Output**

Pearson’s Correlation -Pearson’s Correlation -Its value varies from -1 to +1. A value of 0 means no correlation. The value must be interpreted, where often a value below -0.5 or above 0.5 indicates a notable correlation, and values below those values suggests a less notable correlation

Pearson’s correlation coefficient = covariance (X, Y) / (stdv(X) * stdv(Y))

where Covariance is one of the statistical measurement to know the relationship of the variance between the two variables. Cov (x, y) = SUM [(xi — xm) * (yi — ym)] / (n — 1)

Spearman’s Rank Correlation — Two variables may be related by a nonlinear relationship, such that the relationship is stronger or weaker across the distribution of the variables. Further, the two variables being considered may have a non-Gaussian distribution.

In this case, the Spearman’s correlation coefficient (named for Charles Spearman) can be used to summarize the strength between the two data samples. This test of relationship can also be used if there is a linear relationship between the variables, but will have slightly less power (e.g. may result in lower coefficient scores).

Instead of calculating the coefficient using covariance and standard deviations on the samples themselves, these statistics are calculated from the relative rank of values on each sample. This is a common approach used in non-parametric statistics, e.g. statistical methods where we do not assume a distribution of the data such as Gaussian.

Spearman’s correlation coefficient = covariance(rank(X), rank(Y)) / (stdv(rank(X)) * stdv(rank(Y)))

If you are unsure of the distribution and possible relationships between two variables, Spearman correlation coefficient is a good tool to use.

#datascience #machinelearningalgorithms #deeplearningalgorithms #data #algorithms #deloitteusi #deloitteuniversity #intel #target #pwcindia #eyindia

[ad_2]

Source link