Load the dataset, do some data cleaning stuff, build the model, run the results BAM BAM BAM!!!
Nope. Not that easy.
Well at least, it shouldn’t be that simple. All of the steps mentioned above are indeed obligatory, yes. But if you go into machine learning thing, it demands some extra work before you build your model. Not that complicated, but certainly mandatory. If you skip that part, at the end of the day you will still have a model that seemingly work…
…but that could not be further from the truth.
Assumptions, my dear friend, assumptions.
Morpheus lingers around the room and looks into your eyes:
Assumptions are everywhere. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work… when you go to church… when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth.
Assumptions that we’re gonna talk today is not that complicated and no, we will not talk about the fact that the world is actually an digital image constructed upon our assumptions about it.
Assumptions that we’re gonna talk about today are statistical assumptions. Before building your model, there are certain apriori thoughts that must be validated. Without having testing them your model is statistically garb-, I mean, your model might be inaccurate, so to speak.
Linear Model Assumptions
Let’s look into assumptions regarding linear models.. There are four assumptions that must be met, which are:
- Linearity (Obvious)
- Normality (Obvious as well)
- Heteroscedasticity (Man what the f-)
- Independence (Your predictor variables must not have collinearity issues.)
That’s right, you must check this one by one before building your model. One by one. Each of them. Yes.
Luckily, you and I are blessed with an R package that can check if the model satisfies above assumptions or not. How beautiful, isnt’t it?
The package that I’m referring to is:
You can access to CRAN page by clicking onto it. It is developed by Edsel A. Pena and Elizabeth H. Slate and currently maintained by Elizabeth H. Slate.
One simple function and it’s done.
How to Install
How to Deploy
Build Your Model
We will use built-in Orange dataset to predict circumference by using age.
View(Orange)m <- lm(circumference ~ age, data = Orange)
Validate the Assumptions
Use gvlma() function to conduct validation process.
validation_m <- gvlma(m)summary(validation_m)
Explore the Results
As you can see, we have a green light. All assumptions are accepted.
Check the Model
Since our assumptions are satisfied and suitable for a linear model, it’s time to look into model results.
Plot the Validation Summary
As we have a linear regression model with a quite high R-squared, let’s honor it with gvlma packege by plotting the validation_m object, so that we can further investigate the assumption check.
To visualise our plot we’ll use a gvlma function:
Yeah, that was all. That’s what we usually get when R’s simplicity meets talented statisticians. Special thanks to author of this package.
P.S: You can dive deep into collinearity validation by checking VIF scores.