[ad_1]

Discriminant Analysis is one of many algorithms intended for dimensionality reduction. It projects all independent variables to a new dimension. Below is the linear equation of the projection

X_projection is the new feature after transforming X (original features) by multiplication using W (weights) that maximizes between-class scatter and minimizes within-class scatter. It resembles linear equation in linear regression. W in linear regression minimizes the cost function **sum of squared difference. **Then what W in discriminant analysis trying to maximize?

It is called Fischer’s Criteria (J). It is defined as **the** **sum of between-class scatter** (SB) over **the sum of within-class scatter **(SW).

**Between-class** scatter is simply the squared difference between the means of each class (between class means and overall mean for multi-class).

And **within-class **scatter is the the squared difference between class sample and class mean

J function for binary can be oversimplified to

For multi-class problem, instead of taking the squared difference between class means, we take the squared difference between class means and overall mean

And J function for multi-class equals to

one of the reasons squaring is needed is that absolute number works better in describing how close something is. for example consider two samples after removing the means we have 3 and -4. the difference between first sample to its mean is 3, and between second sample to the same mean is -4. which sample is the closest to the mean? if we define close as small difference, -4 is the smallest. thus we are saying that 4 steps backwards is closer than 3 steps forward? that’s obviously incorrect. Of course 3 steps forward is closer than 4 steps backwards. Therefore we need to get rid of any negative number to avoid the miss by squaring all of the differences.

To find the function that maximize J (Fisher’s Criteria) we then take the derivative of it. Essentially derivative can get us to find maxima or minima of a function. This is for the reason that maxima or minima values are both equal to zero.

In the process of deriving maximized function for J, both J for binary and J for multi-class will yield the same solution. Maximizing either one will just do for general perhaps.

Here I am going to use J for binary to maximize J.

**step 1:** let’s have the objective function prepared.

The reason to this step is that if we are to find derivative of a function with respect to one of its variables, we need to include that variable to the target function.

Consider

Therefore the J function that we want to maximize equals

Thus

In matrix operations, A² equals to AA’. therefore

Then move out all W or W’

The middle part in both numerator and denominator is nothing but SB and SW

**step 2: **maximize J with respect to W

If **f** a function and **g** a function, then derivative of **f/g **equals **(f’g-g’f)/g²**. using this quotient rule of derivative we have

We can simplify W’W into W²

We want to find the maximum where the slope of the tangent line equals to zero, therefore we set the derivative equals to 0

We can eliminate the denominator by moving it to the left hand side so that it will be multiplied by zero then becomes zero

Simplify the function by moving the second term to the left hand side

To further simplify, we divide each side with the same denominator

Thus we get

Now let’s solve for W

We can see that it somehow resembles eigen decomposition equation

We can further see the resemblance below

Therefore, solving eigen values and eigen vectors for matrix SB/SW will maximize J function.

Discriminant analysis projects independent variables into a new dimension where the ratio of between-class scatter to within-class scatter is maximized. That means we will have more distance between classes, thus makes it easier to classify. In other words other than dimensionality reduction, it can also be used as a classification algorithm.

Websites

Youtube Videos

PDFs

[ad_2]

Source link