Discriminant Analysis is one of many algorithms intended for dimensionality reduction. It projects all independent variables to a new dimension. Below is the linear equation of the projection
X_projection is the new feature after transforming X (original features) by multiplication using W (weights) that maximizes between-class scatter and minimizes within-class scatter. It resembles linear equation in linear regression. W in linear regression minimizes the cost function sum of squared difference. Then what W in discriminant analysis trying to maximize?
It is called Fischer’s Criteria (J). It is defined as the sum of between-class scatter (SB) over the sum of within-class scatter (SW).
Between-class scatter is simply the squared difference between the means of each class (between class means and overall mean for multi-class).
And within-class scatter is the the squared difference between class sample and class mean
J function for binary can be oversimplified to
For multi-class problem, instead of taking the squared difference between class means, we take the squared difference between class means and overall mean
And J function for multi-class equals to
one of the reasons squaring is needed is that absolute number works better in describing how close something is. for example consider two samples after removing the means we have 3 and -4. the difference between first sample to its mean is 3, and between second sample to the same mean is -4. which sample is the closest to the mean? if we define close as small difference, -4 is the smallest. thus we are saying that 4 steps backwards is closer than 3 steps forward? that’s obviously incorrect. Of course 3 steps forward is closer than 4 steps backwards. Therefore we need to get rid of any negative number to avoid the miss by squaring all of the differences.
To find the function that maximize J (Fisher’s Criteria) we then take the derivative of it. Essentially derivative can get us to find maxima or minima of a function. This is for the reason that maxima or minima values are both equal to zero.
In the process of deriving maximized function for J, both J for binary and J for multi-class will yield the same solution. Maximizing either one will just do for general perhaps.
Here I am going to use J for binary to maximize J.
step 1: let’s have the objective function prepared.
The reason to this step is that if we are to find derivative of a function with respect to one of its variables, we need to include that variable to the target function.
Therefore the J function that we want to maximize equals
In matrix operations, A² equals to AA’. therefore
Then move out all W or W’
The middle part in both numerator and denominator is nothing but SB and SW
step 2: maximize J with respect to W
If f a function and g a function, then derivative of f/g equals (f’g-g’f)/g². using this quotient rule of derivative we have
We can simplify W’W into W²
We want to find the maximum where the slope of the tangent line equals to zero, therefore we set the derivative equals to 0
We can eliminate the denominator by moving it to the left hand side so that it will be multiplied by zero then becomes zero
Simplify the function by moving the second term to the left hand side
To further simplify, we divide each side with the same denominator
Thus we get
Now let’s solve for W
We can see that it somehow resembles eigen decomposition equation
We can further see the resemblance below
Therefore, solving eigen values and eigen vectors for matrix SB/SW will maximize J function.
Discriminant analysis projects independent variables into a new dimension where the ratio of between-class scatter to within-class scatter is maximized. That means we will have more distance between classes, thus makes it easier to classify. In other words other than dimensionality reduction, it can also be used as a classification algorithm.