In this technique, there is an assumption that the column on which you are working should be normally distributed.
95% of population lies between μ ± 2σ
99% of population lies between μ ± 3σ
If any values lie outside these μ ± 3σ boundries.You can treat it as an outlier.
First, you will find out if the data is normally distributed or not; if yes, then you will find the range of μ ± 3σ. You consider all rows outside that range to be outliers.
You might be wondering why this technique is called the z-score technique.the formula for caculating the z-score is
Suppose you have an age column.You will calculate xi for each value in the age column; that is how you Z-transform the entire data.
If the point is an outlier, there are two possibilities.outlier is detected how to treat it?
If there are 5 values that does not lie in μ ± 3σ i.e. 5 are outliers.In the case of trimming, you will remove all five rows.
Sometimes the problem with trimming is that too many outliers have been removed, resulting in a significant portion of your data being removed. That is bad.
In capping, depending on whether these 5 values are on the lower or upper side, you cap their values.
if the values of μ ± 3σ is 80 on upper side and on lower side is 60
If your 3 values are outliers (85, 0, and 90), then how will you transform/cap this?
You make 85 to 80, 3 to 5 and 90 to 80 thats it i.e you replace the outliers values to maximum or minimum value.