An easy-to-read mathematical explanation of Simple and Multiple Linear Regression Formulas
Most people know about Linear Regression and its application but only a few have gone deeper and asked themselves where the equations and mathematical formulas that they are using come from (Yes, I am one of them).
Since I feel this is a non-very popular topic on the internet and many articles and videos don’t take the time to explain these concepts, I have set out to do it in this post.
But first, for those who have just landed here and have no idea about Linear Regression let’s do a brief introduction (if you already know Linear Regression and just want to know where the formulas come from, feel free to skip this section and read the second and third ones)
Note. Throughout this article I am using the book An Introduction to Statistical Learning as a reference. If you are more interested in Linear Regression or Machine Learning I encourage you to have a look at it, it is really worth it.
All the plots and formulas have been generated by me.
Linear regression is a Machine Learning algorithm whose purpose is to fit a set of data points using a linear model (a straight line in 2D) so that afterward we can make predictions or make inferences about the data.
In a Linear Regression problem, we have a set of predictor variables X₁, X₂, …, Xp and a unique response variable Y, and the aim is to explain the response variable with the predictors using a linear model.
The difference between Simple and Multiple Linear regression is the number of predictors:
– 1 predictor (X): Simple Linear Regression
– 2 or more predictors (X₁, X₂, …, Xp): Multiple Linear regression
To understand this better, let’s introduce a simple example where we just have 1 predictor variable (Simple Linear Regression).
Imagine we have gathered some data about the performance of 100 data scientist students in a statistics exam. We are studying the grade obtained for each student over the number of hours they have spent studying, and we would like to draw a straight line that best fits the data so that we can determine if there exists a relationship between the grade and the hours of study.
In this scenario, the predictor X is the hours of study and the response variable Y is the grade obtained by the student.
Since we only have 1 predictor, our linear model has the following shape:
Now we can develop a Simple Regression Analysis and obtain the following line:
Finally, the results obtained for β0 and β1 are:
– β0 = 1.95361788
– β1 = 0.29338499
This means that if we don’t study anything, i.e. we study 0h (X = 0), our average grade will be 1.95361788.
And that for every hour of study our grade will increase an average of 0.29338499 points.
At this point, we are in conditions to extract some additional inferences about the data (see next section), but I will only limit myself to listing them since they require other concepts and statistical tools of Linear Regression that are beyond the objective of this article
(It really gets me on my nerves when people say that because it’s like they’re hiding information from you but otherwise the article would be as extended as a book and we don’t like that. If you are more interested in Linear Regression I encourage you to read An Introduction to Statistical Learning, which I’m using as a reference book for this article, or you just can follow me for more statistical and data science content)
Why and when use Linear Regression?
Linear regression is not very often used for predicting but rather for making inferences (obtaining useful information and conclusions about the data) since it offers a non-flexible fit.
Note that we are not forcing the line to pass through the points (what in mathematics is called interpolation) — since it wouldn’t be possible to do it with a single straight line — but we are looking for the line that passes closest to them.
However, Linear Regression can also be very useful when analyzing data. Among the inferences we can extract using Linear Regression we can find the following ones:
- Find if there is a relationship between one variable (or group of variables) and the response one, and calculate how strong is that relationship.
- Compute the effect of each predictor variable on the response and know which predictor contributes the most.
- Find if the relation is linear and how accurately we can make future predictions on the response using a linear model.
- Find if there is synergy (interaction) between the predictors and how we can improve our linear model to make better predictions.
- Compute the trend of the data.
After a not-very-brief introduction to Linear Regression (I apologize), it is time to explain the true theme and purpose of this article, THE FORMULAS (I promise to be short and stick to the point)
Simple Linear Regression is used when we have only 1 predictor variable X that we want to use to explain the response variable Y.
But before starting to generate random values for β0 and β1 we need a selection method to decide which line is better than the others.
So how do we choose the best fitting line? RSS (Residual Sum of Squares)
In order to pick the best fitting line, we need to establish a fitting measurement that will tell us how good or bad our line fits the data (The measurements that measure the performance of a model are called Loss functions)
In our linear regression problem, a good fitting measurement is to take the distance between the predicted Ŷ value using the fitting line and the Y true value from our data, square the result so that we only get positive values, and compute the summation for all the data points.
This measurement is called Residual Sum of Squares (RSS) and mathematically it is expressed by the formula:
We have just defined how are we going to measure which one is the best fitting line (great!) but, how do we get to the easy and fast computational formulas?
Objective: Minimize the loss function RSS over the parameters β0 and β1 (i.e. find the β0 and β1 that minimizes RSS)
1. Developing the expression:
2. Now, if we take the partial derivatives over the parameters and set them equal to zero:
Thus obtaining the following system:
Note. How do we know this is a minimum and not a maximum?
Intuitively, we know that we are trying to minimize a loss function that describes how good our model is. Therefore, there is no upper limit for the loss, our model can always be worse and lead to higher losses, but there is a lower limit where the error is as close as possible to 0 and cannot be lower (the model has its limitations, and is not possible for the loss to be exactly 0)
Mathematically, the point (β0,β1) is called a stationary point in multivariable calculus, and we can classify it by computing the second partial derivatives, check this pdf .
Finally, we also know this is a global minimum and not a local one because we only obtained one point (β0,β1) when studying the first derivatives (the system has a unique solution).
3. To solve the system we can write the expression in matrix form:
And therefore we obtain the final equation for Simple Linear Regression
Multiple Linear Regression is an extension of the Simple model for more than 1 predictor. In this case, we have a set of predictor variables X₁, X₂, …, Xp that we want to use to explain the response variable Y.
The procedure is the same as in the Simple model. To simplify the mathematical notation I will proceed to explain the formula for 2 predictors X₁ and X₂, but it is the same procedure for more predictors (3,4, …)
Objective: Same as in the Simple model: Minimize the loss function RSS over the parameters β0, β1, and β2 (i.e. find the β0, β1, and β2 that minimize RSS)
1. Developing the expression:
2. Take the partial derivatives over the parameters (the three of them! β0, β1, and β2) and set them equal to zero:
Order the system:
3. And write the expression in matrix shape:
And therefore we obtain the final equation for Multiple Linear Regression:
Finally, we can generalize the above system to p predictors: