Multicollinearity is a well-known challenge in multiple regression. The term refers to the high correlation between two or more explanatory variables, i.e. predictors.
Two Basic kinds of Multicollinearity:
- Structural multicollinearity: This type occurs when we create a model term using other terms. In other words, it’s a byproduct of the model that we specify rather than being present in the data itself. For example, if you square term X to model curvature, clearly there is a correlation between X and X2.
- Data multicollinearity: This type of multicollinearity is present in the data itself rather than being an artifact of our model. Observational experiments are more likely to exhibit this kind of multicollinearity.
According to Graham’s study, multicollinearity in multiple regression leads to:
- Inaccurate parameter estimates,
- Decreased power
- Exclusion of significant predictors
When these X variables themselves are related to each other than this problem is called Multicollinearity.
Consider following Regression:
Lawyer Salary: β0+ β1(Years of Experience)+ β2(Age)+εi
•Β1-> The marginal effect on salary of 1 additional year experience, holding other variables constant.
•Β2->The marginal effect on salary of 1 additional year of age, holding other variables constant.
Why we care About Multicollinearity?
Ex: Lawyer Salary: β0+ β1(Years of Experience)+ β2(Age)+εi
Check the Correlations between all pairs of X-variables:
How much correlation is more correlation?
For this , We used Rule of Thumb:
If correlation >0.9, Then we consider this is a problem.
Variance Inflation Factors
Y=β0+β1X1+ β2X2+ β3X3+β4X4+ εi
Create Auxiliary regression for each x variable:
Find R² By the given regression:
To Find VIF using the R-squared from each regression.
How high is too High VIF?
Again According to the Rule of thumb:
VIF’s above 10 are problematic.
Option 1 : Do nothing.
i)If model is used for prediction only.
ii)If correlated variables are not of particular interest to study Question.
iii)If correlation is not extreme.
Options 2 :To Remove one of the correlated variables.
If variables are providing the same information.
NOTE: Be aware of omitted variable bias!
This problem is created if we follow option2 but chances are very rare.
Options 3: combine the correlated variables
Ex Include a “Seniority’ score combining both experience’ and ‘age’.
Option 4:Use partial least squares of principal component analysis.