Linear & Logistic: The Relationship Between Regression Models
Linear regression’s cost function minimizes the distance between data points and hence can’t be interpreted as a probability.
Logistic regression is one of the most popular and easiest methods to solve classification tasks, it has its limitations but even in the deep learning era (post-2014) it is widely used everywhere.
In this post let’s find out how it differs from linear regression and its relationship to it. To recap, we know linear regression is not so useful for problems related to classification. For example, below we are trying to find an optimal decision boundary to differentiate between malignant and benign tumors. And due to linear regression’s inability to handle outlier values, even with two outliers (at the far right), the AI model’s predictions become useless.
Now we have a visual understanding of linear regression’s limitations let’s see how it connects to logistic regression and how logistic regression handles the outlier values. To start gently we are going to review the basic equation used in linear regression.
Here the multiple input data points denoted by the symbol ‘x’ are being multiplied with the AI’s parameters beta, denoted by the symbol ‘β’. And all of the input data points are scaled up or down in a linear fashion, depending on the value of beta.
The next step is a simple concept but may be hard to follow due to how the equation looks like. We are going to put the equation for linear regression inside ANOTHER equation and the final function looks like something shown below.
Although at first glance it might look complicated, it’s really not that hard to understand if we dissect it one by one. Let’s visualize the sigmoid function once more.
As a recap, the sigmoid function is a mathematical function that only outputs numbers between the range of 0 and 1. And those numbers can also be interpreted as probabilities.
So even if we give the sigmoid function a value of 10 or 1000, both data points will be translated into a numeric value of 1. While the story stays the same for negative numbers too, -10 and -1000 will both become 0. So what is this making all this possible? Again it’s the sigmoid function, let’s take a look at ONLY the function itself.
It looks very familiar to us, doesn’t it? If we just replace the symbol ’n’ with our equation for linear regression, we get the equation for logistic regression! So in summary, we can understand logistic regression in two steps…
1. Use linear regression to calculate the intermediate value.
2. Use the sigmoid function to squish the intermediate value between 0 and 1.
This is a simple way to understand logistic regression, and why it’s needed. When we go back and apply logistic regression on the same dataset that linear regression had trouble with…
It’s easy to see the dramatic difference, thanks to logistic regression’s ability to limit the output between the range of 0 and 1, we are able to handle complex classification problems. And just like how the parameters used in linear regression can tell the correlation between two variables, logistic regression can explain the certain relationship between two variables.
Logistic regression explains how changing one variable will change the odds ratio by a certain weight.
So how does this concept gets to be applied in our business? In the above table, we can see an underlined number 2.27. What that number represents can be rewritten as follows…
When looking at factors that may cause cancer/no cancer an increase in the number of diagnosed STDs had a weighting of 2.27. And compared to other factors like hormonal contraceptives (such as birth control pills) it played a much larger effect on cancer development.
But always remember, correlation does not imply causation.
Logistic regression & Data Science
Often logistic regression is used as one of the tools used to gain insights regarding business and it plays a small but crucial part in the data scientist’s workflow. Usually, at the very beginning, it is best practice to take a closer look at the data.
Even by taking a closer look into the data, we can find a significant relationship between the variables that make up our business. What this means is we can gain a very deep insight into our day-to-day operation, know which pipeline/variable is affecting profit/losses/margins, and much more. In the case above, if the average age of our customers who did purchase was higher we should change our marketing strategy to tailor those demographics.
Data science ISN’T about creating charts or graphs it’s about telling a story, a business story.
A rookie mistake among data scientists is doing too much to get so little. If you are aiming for clarity less is more. While there is nothing wrong with using more advanced methods to build a better AI model. Methods like SMOTE or Recursive feature elimination should be used if it fits your needs in creating a workable solution for your business goals.
While never forgetting that you also need proper metrics. By aligning (or sometimes by reformulating) business KPIs with the right tools, such as logistic regression, you not only know that you are heading in the right direction. But you also can make an impact on the organization in a timely manner even while utilizing the simplest methodologies.
In conclusion, logistic regression is a simple yet powerful method to find out how changing one variable will lead to the occurrence of another by a certain weight. We also saw its relationship with linear regression and how it can handle cases where linear regression fails. Finally, we saw how it’s actually used among data scientists, I also found a cool cheat sheet linked above 📝.
If you like my writing please consider subscribing, and if you want to join the Medium community (please do!) consider using my invitation.