## All you need to know about Linear Regression is here (including an application in Python)

If you’re approaching Machine Learning, one of the first models you may encounter is Linear Regression. It’s probably the easiest model to understand, but don’t underestimate it: there are a lot of things to understand and master.

If you’re a beginner in Data Science or an aspiring Data Scientist, you’re probably facing some difficulties because there are a lot of resources out there, but are fragmented. I know how you’re feeling, and this is why I created this complete guide: I want to give you all the knowledge you need without searching for anything else.

So, if you want to have complete knowledge of Linear Regression this article is for you. You can study it deeply and re-read it whenever you need it the most. Also, consider that, to cover this topic, we’ll need some knowledge generally associated with regression analysis: we’ll cover it in deep.

And…you’ll excuse me if I’ll link a resource you’ll need: in the past, I’ve created an article on some topics related to Linear Regression so, to have a complete overview, I advise you to read it (I’ll link later when we’ll need it).

**Table of Contents:**What do we mean by "regression analysis"?

Understanding correlation

The difference between correlation and regression

The Linear Regression model

Assumptions for the Linear Regression model

Finding the line that best fits the data

Graphical methods to validate your model

An example in Python

Here we’re studying Linear Regression, but what do we mean by “regression analysis”? Paraphrasing from Wikipedia:

Regression analysis is a mathematical technique used to find a functional relationship between a dependent variable and one or more independent variable(s).

In other words, we know that in mathematics we can define a function like so: `y=f(x)`

. Generally, `y`

is called the dependent variable and `x`

the independent. So, we express `y`

in relationship with `x`

, using a certain function `f`

. The aim of regression analysis is, then, to find the function `f`

.

Now, this seems easy but is not. And I know you know it. And the reason why is not easy is:

- We know
`x`

and`y`

. For example, if we are working with tabular data (with`Pandas`

, for example)`x`

are the features and`y`

is the label. - Unfortunately, the data rarely follow a very clear path. So our job is to find the best function
`f`

that**approximates**the relationship between`x`

and`y`

.

So, let me summarize it: regression analysis aims to find an estimated relationship (a good one!) between the dependent and the independent variable(s).

Now, let’s visualize why this process may be difficult. Consider the following code and its outcome:

`import numpy as np`

import matplotlib.pyplot as plt# Create random linear data

a = 130

x = 6*np.random.rand(a,1)-3

y = 0.5*x+5+np.random.rand(a,1)

# Labels

plt.xlabel('x')

plt.ylabel('y')

# Plot a scatterplot

plt.scatter(x,y)

Now, tell me: can the relationship between `x`

and `y`

be a line? So…can this data be approximated by a line? Like the following, for example:

Stop reading for a moment and think about that.

Well, it could. And how about the following one?

Well, even this could! So, what’s the best one? And why not another one?

This is the aim of regression: to find the best-estimated function that can approximate the given data. And it does so using some methodologies: we’ll cover them later in this article. We’ll apply them to the Linear Regression model but some of them can be used with any other regression technique. Don’t worry: I’ll be very specific so you don’t get confused.

Quoting from Wikipedia:

In statistics, correlation is any statistical relationship, whether causal or not, between two random variables. Although in the broadest sense, “correlation” may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related.

In other words, **correlation** is a statistical measure that expresses the **linear relationship between variables**.

We can say that two variables are correlated if each value of the first variable corresponds to a value for the second variable, following a path. If two variables are highly correlated, the path would be linear, because the correlation describes the linear relation between the variables.

## The math behind the correlation

This is a comprehensive guide, as promised. So, I want to cover the math behind the correlation, but don’t worry: we’ll make it easy so that you can understand it even if you’re not specialized in math.

We generally refer to the correlation coefficient, also known as the **Pearson correlation coefficient**. This gives an estimate of the correlation between two variables. Suppose we have two variables, `a`

and `b`

and they can reach `n`

values. We can calculate the correlation coefficient as follows:

Where we have:

- the mean value of
`a`

(but it applies to both variables,`a`

and`b`

):

If we have a 0 correlation coefficient, it means that the data points do not tend to increase or decrease following a linear path, because we have no correlation.

Let us have a look at some plots of correlation coefficients with different values (image from Wikipedia here):

As we can see, when the correlation coefficient is equal to 1 or -1 the tendency of the data points is clearly to be along a line. But, as the correlation coefficient deviates from the two extreme values, the distribution of the data points deviates from a linear path. Finally, for the correlation coefficient of 0, the distribution of the data can be anything.

So, when we get a correlation coefficient of 0 we can’t say anything about the distribution of the data, but we can investigate it (if needed) with a regression analysis.

So, correlation and regression are linked but are different:

- Correlation analyzes the tendency of variables to be linearly distributed.
- Regression is the study of the relationship between variables.

We have two kinds of Linear Regression models: the Simple and the Multiple ones. Let’s see them both.

## The Simple Linear Regression model

The goal of the Simple Linear Regression is to model the relationship between a single feature and a continuous label. This is the mathematical equation that describes this ML model:

`y = wx + b`

The parameter `b`

(also called “bias”) represents the y-axis intercept (is the value of `y`

when `X=0`

), and `w`

is the weight coefficient. Our goal is to learn the weight `w`

that describes the relationship between `x`

and `y`

. This weight will later be used to predict the response for new values of `x`

.

Let’s consider a practical example:

`import numpy as np`

import matplotlib.pyplot as plt# Create data

x = np.array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8, 9])

y = np.array([13, 14, 17, 12, 23, 24, 25, 25, 24, 28, 32, 33])

# Show scatterplot

plt.scatter(x, y)

The question is: can this data distribution be approximated with a line? Well, we could create something like that:

`import numpy as np`

import matplotlib.pyplot as plt# Create data

x = np.array([1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 8, 9])

y = np.array([13, 14, 17, 12, 23, 24, 25, 25, 24, 28, 32, 33])

# Create basic scatterplot

plt.plot(x, y, 'o')

# Obtain m (slope) and b (intercept) of a line

m, b = np.polyfit(x, y, 1)

# Add linear regression line to scatterplot

plt.plot(x, m*x+b)

# Labels

plt.xlabel('x variable')

plt.ylabel('y variable')

Well, as in the example we’ve seen above, it could be a line but it could be a general curve.

And, in a moment we’ll see how we can say if the data distribution can be better described by a line or by a general curve.

## The Multiple Linear Regression model

Since reality is complex, the typical cases we’ll face are related to the Multiple Linear Regression case. We mean that the feature `x`

is not a single one: we’ll have multiple features. For example, if we work with tabular data, a data frame with 9 columns has 8 features and 1 label: this means that our problem is eight-dimensional.

As we can understand, this case is very complicated to visualize and the equation of the line has to be expressed with vectors and matrices, becoming:

So, the equation of the line becomes the sum of all the weights (`w`

) multiplied by the independent variable (`x`

) and it can even be written as the product of two matrices.

Now, to apply the Linear Regression model, our data should respect some assumptions. These are:

**Linearity**: the relationship between the dependent variable and independent variables should be linear. This means that a change in the independent variable should result in a proportional change in the dependent variable, following a linear path.**Independence**: the observations in the dataset should be independent of each other. This means that the value of one observation should not depend on the value of another observation.**Homoscedasticity**: the variance of the residuals should be constant across all levels of the independent variable. In other words, the spread of the residuals should be roughly the same across all levels of the independent variable.**Normality**: the residuals should be normally distributed. In other words, the distribution of the residuals should be a normal (or bell-shaped) curve.**No multicollinearity**: the independent variables should not be highly correlated with each other. If two or more independent variables are highly correlated, it can be difficult to distinguish the individual effects of each variable on the dependent variable.

Unfortunately, testing all these hypotheses is not always possible, especially in the case of the Multiple Linear Regression model. Anyway, there is a way to test all the hypotheses. It’s called the `p-value`

test, and maybe you heard of that before. Anyway, we won’t cover this test here for two reasons:

- It’s a general test, not specifically related to the Linear Regression model. So, it needs a specific treatment in a dedicated article.
- I’m one of those (maybe one of the few) who believes that calculating the
`p-value`

is not always a must when we need to analyze data. For this reason, I’ll create in the future a dedicated article on this controversial topic. But just for the sake of curiosity, since I’m an engineer I have a very practical approach, and I like applied mathematics. I wrote an article on this topic here:

So, above we were reasoning which one of the following can be the best fit:

To understand if the best model is the left one (the line) or the right one (a general curve) we proceed as follows:

- We split the data we have into the training and the test set.
- We validate both models on both sets, testing how well our models generalize their learning.

We won’t cover the polynomial model here (useful for general curves), but consider that there are two approaches to validate ML models:

- The analytical one.
- The graphical one.

Generally speaking, we’ll use both to get a better understanding of the performance of the model. Anyway, **generalizing **means that our ML model learns from the training set and **applies correctly its learning to the test set**. If it doesn’t, we try another ML model. Here’s the process:

This means that **an ML model generalizes well when it has good performances on both the training and the test set**.

I’ve discussed the analytical way to validate an ML model in the case of linear regression in the following article:

I advise you to read it because we’ll use some metrics discussed there in the example at the end of this article.

Of course, the metrics discussed can be applied to any ML model in the case of a regression problem. But you’re lucky: I’ve used the linear model as an example.

The graphical ways to validate an ML model in the case of a regression problem are discussed in the next paragraph.

Let’s see three graphical ways to validate our ML models.

## 1. The residual analysis plot

This method is specific to the Linear Regression model and consists in visualizing how the residuals are distributed. Here’s what we expect:

To plot this we can use the built-in function `sns.residplot()`

in `Seaborn`

(here’s the documentation).

A plot like that is good because we want to see randomly distributed data points along the horizontal axis. One of the **assumptions of the linear regression model**, in fact, is that the **residuals must be normally distributed **(assumption n°4 listed above). If the residuals are normally distributed, it means that the errors of the observed values from the predicted ones are randomly distributed around zero, with no clear pattern or trend; and this is exactly the case in our plot. So, in these cases, our ML model may be a good one.

Instead, if there is a particular pattern in our residual plot, our model is not good for our ML problem. For example, consider the following:

In this case, we can see that there is a parabolic trend: this means that our model (the Linear model) is not good to solve our ML problem.

## 2. The actual vs. predicted values plot

Another plot we may use to validate our ML model is the **actual vs. predicted plot**. In this case, we plot a graph having the actual values on the horizontal axis and the predicted values on the vertical axis. The goal is to find the data points distributed as much as possible to a line, in the case of Linear Regression. We can even use the method in the case of a polynomial regression: in this case, we’d expect the data distributed as much as possible to a generic curve.

Suppose we have a result as follows:

The above graph shows that the predicted data points are distributed along a line. It is not a perfect linear distribution, so the linear model may not be ideal.

If, for our specific problem, we have`y_train`

(the label on the training set) and we’ve calculated `y_train_pred`

(the prediction on the training set), we can plot the following graph like so:

`import matplotlib.pyplot as plt`# Scatterplot of y_train and y_train_pred

plt.scatter(y_train, y_train_pred)

plt.plot(y_test, y_test, color='r') # Plot the line

# Labels

plt.title('ACTUAL VS PREDICTED VALUES')

plt.xlabel('ACTUAL VALUES')

plt.ylabel('PREDICTED VALUES')

## 3. The Kernel Density Estimation (KDE) plot

The last graph we want to talk about to validate our ML models is the Kernel Density Estimation (KDE) plot. This is a general method and can be used to validate both regression and classification models.

The KDE is the application of a **kernel smoother** for probability density estimation. A kernel smoother is a statistical method that is used to estimate a function as the weighted average of the neighbor observed data. The kernel defines the weight, giving a higher weight to closer data points.

To understand the usefulness of a smoother function, see the graph below:

It is helpful to approximate our data points with a smoothing function if we want to compare two quantities. In the case of an ML problem, in fact, we typically like to see the comparison between the actual labels and the labels predicted by our model, so we use the KDE to compare two smoothed functions.

Let’s say we have predicted our labels using a linear regression model. We want to compare the KDE for our training set’s actual and predicted labels. We can do so with `Seaborn`

invoking the method `sns.kdeplot()`

(here’s the documentation).

Suppose we have the following result:

As we can see, the comparison between the actual and the predicted label is easy to do, since we are comparing two smoothed functions; in a case like that, our model is good because the curves are very similar.

In fact, what we expect from a “good” ML model are:

- The curves are similar to bell curves, as much as possible.
- The two curves are similar between them, as much as possible.

Now, let’s apply all the things we’ve learned so far here. We’ll use the famous “Ames Housing” dataset, which is perfect for our scopes.

This dataset has 80 features, but for simplicity, we’ll work with just a subset of them which are:

`Overall Qual`

: it is the rating of the overall material and finish of the house on a scale from 1 (bad) to 10 (excellent).`Overall Cond`

: it is the rating of the overall condition of the house on a scale from 1 (bad) to 10 (excellent).`Gr Liv Area`

: it is the above-ground living area, measured in squared feet.`Total Bsmt SF`

: it is the total basement area, measured in squared feet.`SalePrice`

: it is the sale price, in USD $.

We’ll consider our `SalePrice`

column as the target (label) variable, and the other columns as the features.

## Exploratory Data Analysis EDA

Let’s import our data, create a subset with the mentioned features, and display some statistics:

`import pandas as pd`# Define the columns

columns = ['Overall Qual', 'Overall Cond', 'Gr Liv Area',

'Total Bsmt SF', 'SalePrice']

# Create dataframe

df = pd.read_csv('http://jse.amstat.org/v19n3/decock/AmesHousing.txt',

sep='t', usecols=columns)

# Show statistics

df.describe()

An important observation here is that the mean values for all labels have a different range (the `Overall Qual`

mean value is `6.09`

while `Gr Liv Area`

mean value is `1499.69`

). This tells us an important fact: we have to scale the features.

## Data preparation

What does “**features scaling**” mean?

Scaling a feature implies that the feature range is scaled between 0 and 1 or between 1 and -1. There are two typical methods to scale the features:

**Mean normalization:**Mean normalization is a method of scaling numeric data so that it has a minimum value of zero and a maximum value of one and all the values are normalized around the mean value. Suppose*c*is a value reached by our feature; to scale around the mean (*c*′ is the new value of*c*after the normalization process):

Let’s see an example in Python:

`import numpy as np`# Create a list of numbers

data = [1, 2, 3, 4, 5]

# Find min and max values

data_min = min(data)

data_max = max(data)

# Normalize the data

data_normalized = [(x - data_min) / (data_max - data_min) for x in data]

# Print the normalized data

print(f'normalized data: {data_normalized}')

>>>

normalized data: [0.0, 0.25, 0.5, 0.75, 1.0]

**Standardization**(or z-score normalization): This method transforms a variable so that it has a mean of zero and a standard deviation of one. The formula is the following (c′c’c′ is the new value of ccc after the normalization process):

Let’s see an example in Python:

`import numpy as np`# Original data

data = [1, 2, 3, 4, 5]

# Calculate mean and standard deviation

mean = np.mean(data)

std = np.std(data)

# Standardize the data

data_standardized = [(x - mean) / std for x in data]

# Print the standardized data

print(f'standardized values: {data_standardized}')

print(f'mean of standardized values: {np.mean(data_standardized)}')

print(f'std. dev. of standardized values: {np.std(data_standardized): .2f}')

>>>

standardized values: [-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095]

mean of standardized values: 0.0

std. dev. of standardized values: 1.00

As we can see, the normalized data have a mean of 0 and a standard deviation of 1, as we wanted. The good news is that we can use the library `scikit-learn`

to standardize the features, and we’re going to do it in a moment.

Features scaling is an important thing to do when working on an ML problem, for a simple reason:

- If we perform exploratory data analysis with features that are not scaled, when calculating the mean values (for example, during the calculation of the coefficient of correlation) we’ll get numbers that are very different from each other. If we take a look at the statistics we’ve got above when we’ve invoked the
`df.describe()`

method, we can see that, for each column, we get a very different value of the mean. If we scale or normalize the features, instead, we’ll get 0s, 1s, and -1s: and this will help us mathematically.

Now, this dataset has some `NaN`

values. We won’t show it for brevity (try it on your own), but we’ll remove them. Also, we’ll calculate the correlation matrix:

`import seaborn as sns`

import matplotlib.pyplot as plt

import numpy as np# Drop NaNs from dataframe

df = df.dropna(axis=0)

# Apply mask

mask = np.triu(np.ones_like(df.corr()))

# Heat map for correlation coefficient

sns.heatmap(df.corr(), annot=True, fmt="0.1", mask=mask)

So, with `np.triu(np.ones_like(df.corr()))`

we have created a mask that it’s useful to display a triangular correlation matrix, which is more readable (especially when we have much more features than in this case).

So, there is a moderate `0.6`

correlation between `Total Bsmt SF`

and `SalePrice`

, quite a high `0.7`

correlation between `Gr Liv Area`

and `SalePrice`

, and a high correlation `0.8`

between `Overall Qual`

and `SalePrice`

; Also, there is a moderate correlation between `Overall Qual`

and `Gr Liv Area`

`0.6`

and `0.5`

between `Overall Qual`

and `Total Bsmt SF`

.

Here there’s no multicollinearity, so no features are highly correlated with each other (so, our features satisfy the hypothesis n°5 listed above). If we’d found some highly correlated features, we could delete them because **two highly correlated features have the same effect on the label **(**this applies to every general ML model: if two features are highly correlated, we can drop one of the two**).

Finally, we subdivide the data frame `df`

into `X`

( the features) and `y`

(the label) and scale the features:

`from sklearn.preprocessing import StandardScaler`# Define the features

X = df.iloc[:,:-1]

# Define the label

y = df.iloc[:,-1]

# Scale the features

scaler = StandardScaler() # Call the scaler

X = scaler.fit_transform(X) # Fit the features to scale them

## Fitting the linear regression model

Now we have to split the features `X`

into the training and the test set and we’re fitting them with the Linear Regression model. Then, we calculate R² for both sets:

`from sklearn.model_selection import train_test_split`

from sklearn.linear_model import LinearRegression

from sklearn import metrics# Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit the LR model

reg = LinearRegression().fit(X_train, y_train)

# Calculate R^2

coeff_det_train = reg.score(X_train, y_train)

coeff_det_test = reg.score(X_test, y_test)

# Print metrics

print(f" R^2 for training set: {coeff_det_train}")

print(f" R^2 for test set: {coeff_det_test}")

>>>

R^2 for training set: 0.77

R^2 for test set: 0.73

**Notes:**

1) your results can be slightly different due to the stocastical

nature of the ML models.2) here we can see generalization on action:

we fitted the Linear Regression model to the train set with

*reg = LinearRegression().fit(X_train, y_train)*.

The, we've calculated R^2 on the training and test sets with:

*coeff_det_train = reg.score(X_train, y_train)*

coeff_det_test = reg.score(X_test, y_test

In other words: we don't fit the data to the test set.

We fit the data to the training set and we calculate the scores

and predictions (see next snippet of code with KDE) on both sets

to see the generalization of our modelon new unseen data

(the data of the test set).

So we get R² of 0.77 on the training test and 0.73 on the test set which are quite good, suggesting the Linear model is a good one to solve this ML problem.

Let’s see the KDE plots for both sets:

`# Calculate predictions`

y_train_pred = reg.predict(X_train) # train set

y_test_pred = reg.predict(X_test) # test set# KDE train set

ax = sns.kdeplot(y_train, color='r', label='Actual Values') #actual values

sns.kdeplot(y_train_pred, color='b', label='Predicted Values', ax=ax) #predicted values

# Show title

plt.title('Actual vs Predicted values')

# Show legend

plt.legend()

`# KDE test set`

ax = sns.kdeplot(y_test, color='r', label='Actual Values') #actual values

sns.kdeplot(y_test_pred, color='b', label='Predicted Values', ax=ax) #predicted values# Show title

plt.title('Actual vs Predicted values')

# Show legend

plt.legend()

Regardless of the fact that we’ve obtained an R² of 0.73 on the test set which is good (but remember: the higher, the better), this plot shows us that the linear model is indeed a good model to solve this ML problem. This is why I love the KDE plot: is a very powerful tool, as we can see.

Also, this shows why shouldn’t rely on just one method to validate our ML model: a combination of one analytical method with one graphical one generally gives us the right insights to decide whether to change our ML model or not. In this case, the Linear Regression model is perfect to make predictions.

I hope you’ll find useful this article. I know it’s very long, but I wanted to give you all the knowledge you need on this topic, so that you can return to it whenever you need it the most.

Some of the things we’ve discussed here are general topics, while others are specific to the Linear Regression model. Let’s summarize them:

- The definition of
**regression**is, of course, a general definition. **Correlation**is generally referred to as the Linear model**.**In fact, as we said before, correlation is the tendency of two variables to be linearly dependent.**,**there are ways to define non-linear correlations, but we leave them for other articles (but, as knowledge for you: just consider that they exist).- We’ve discussed the Simple and the Multiple Linear Regression models with their assumptions (the assumptions apply to both models).
- When talking about how to find the line that best fits the data, we’ve referred to the article “Mastering the Art of Regression Analysis: 5 Key Metrics Every Data Scientist Should Know”. Here, we find all the metrics to know to solve a regression analysis. So, this is a generical topic that applies to any regression model, including the Linear one, of course.
- We’ve shown three methods to validate our ML models: 1)
**The residual analysis plot**: which applies to Linear Regression models, 2)**The actual vs. predicted values plot**: which can be applied to Linear and Polynomial models, 3) the**KDE plot**: this can be applied to any ML model, even in the case of a classification problem

Finally, I want to remind you that we’ve spent a couple of lines stressing the fact that we can avoid using `p-values`

to test the hypotheses of our ML models. I’m writing an article on this topic very soon, but, as you can see, the KDE has shown us that our Linear model is good to solve this ML problem, and we haven’t validated our hypothesis with `p-values`

.

*So far in this article, we’ve used some plots. You can **clone this repo** I’ve created so that you can import the code and use it to easily plot the graphs. If you have some difficulties, you find examples of usages on my projects on GitHub. If you have any other difficulties, you can **contact me** and I’ll help you.*

*Subscribe to**my newsletter**to get more on Python & Data Science.**Found it useful?**Buy me a**Ko-fi**.**Liked the article? Join Medium through**my referral link**: unlock all the content on Medium for 5$/month (with no additional fee).**Find/contact me**here**.*