The Logistic Regression model is one that you will undoubtedly come across as you study machine learning. As a machine learning specialist, there are things you should know about the Logistic Regression model.
In this article, we’ll discuss what Logistic Regression is all about so that we can understand it when we come across it. Before we continue, it is important to note that Logistic Regression is used when we want to predict an outcome that is dichotomous or categorical in nature.
What is Logistic Regression?
Logistic Regression is a classification algorithm that predicts the category of a dependent variable based on a set of independent variables. Because logistic regression is a classification algorithm that works with labeled data, it is classified as a supervised learning technique.
For example, suppose you want to investigate whether a person is predisposed to a particular disease. In that case, you’ll receive a dataset containing medical factors related to the disease and several diseased and non-diseased people. The datasets are our independent variable, while the result is our dependent variable because it relies on the datasets we wish to use.
You may be wondering why, if the definition given above appears to be in line with Regression, it is of the Classification model. This is because we use a categorical variable as our dependent variable. Whenever we deal with a categorical variable, it is “Classification,” which is why Logistic Regression is not a Regression Model type.
Why is it important?
Logistic Regression is one of the simplest machine learning methods to implement, train, and interpret. It is a less complex forms of classification methods.
Logistic Regression is the best method for predicting discrete values and it performs well on linearly separable datasets and have acceptable accuracy for simple data (a graph in which a straight line separates the two data classes).
Logistic Regression follows these basic rules:
- We can predict discrete values (0,1,2…n) and binary events using Logistic Regression.
- Its primary goal is to predict the probability of occurrence rather than the variable’s value.
- LR always works with categorical variables.
- To solve classification-based problems, we employ Logistic Regression. E-mail spam detection is a typical example.
- The accuracy of a logistic regression model is measured using Maximum likelihood estimation (a method that determines values for the parameters defining a model)
Surprisingly, Logistic Regression has a wide range of machine learning applications. Fortunately, technology is all around us, and it is reasonable to assume that we have all encountered Logistic Regression applications at some point.
Let’s take a look at some of our community’s most important sectors today and see how Logistic Regression has been in these sectors:
Science: Machine learning, in general, has been highly beneficial to science. We’ve discovered certain diseases and conditions that have caused harm to humanity, but developers have been able to create programs that can help detect, prevent, and control these diseases.
Take, for example, cancer, a disease that has plagued humanity for decades and centuries. Luckily, developers have been able to create ML programs that can detect this disease in human cells at an early stage.
They were able to obtain datasets containing diseased and healthy people’s health factors, and with these factors, they were able to create a model, train, and test the model to predict whether a human is infected or not. We accomplished this using the Logistic Regression model.
Services: Along with Logistic Regression, the service industry has many machine learning applications. Everything from hotel reservations to flight reservations to catering services has Logistic Regression written all over it.
Many companies use ML to attempt to anticipate the consumer’s intentions or level of satisfaction. For example, suppose a restaurant wants to predict the type of food their customers frequently prefer. In that case, they obtain data containing sales records of the products (food) they offer and other factors such as age, food nutrients, and so on, and use these data to predict their output. They accomplish this by utilizing the Logistic Regression model.
Cybersecurity: Detecting spam e-mails, credit card fraud, and other cybercrimes has become a possibility thanks to the employment of Logistic Regression.
There are other applications of Logistic Regression; the ones listed above are just a few examples.
Types of Logistic Regression
The Logistic Regression model consists of:
- Binary Logistic Regression
- Ordinal Logistic regression
- Multinominal Logistic Regression
The binary logistic Regression is one in which there are two categories and two levels of characteristics. The result (dependent variable) is binary.
Using e-mail spam detection as an example, after entering the datasets required to determine our output, the response would be yes, indicating that it is a spam e-mail, or no, indicating that it is not a spam e-mail. Because our output contains a binary response variable, it is a binary logistic regression model.
This type of Logistic Regression has three or more categories with a natural ordering of levels, implying that the dependent variable is in an ordered form.
Consider the results of a specific test in which the dependent variables are either (Poor/ Average,/ Excellent). Although the results are categorical, they are in an ordered sequence.
The dependent variable in this Logistic Regression type also belongs to three or more categories. Still, unlike ordinal logistic Regression, there is no natural ordering of levels among these categories.
For example, you might want to predict the subject that will be the most popular in the next ten years using a list of Mathematics, English, and Geography. Because our dependent variable has three possible outcomes that are not in any way ordered, multinomial logistic Regression is used.
Cons of the Logistic Regression Model
The logistic regression model has a significant flaw in that it is prone to overfitting, especially when the number of features used in the model is greater than the number of observations.
NOTE: Overfitting occurs when our model attempts to cover more data points than the required amount in our data sets)
- Another flaw to consider is that Logistic Regression predicts an outcome based on a set number of independent variables. It will affect the outcome if we use inappropriate variables as input.
- The Logistic Regression model predicts only categorical values, which can be a limitation to having a precise outcome of our data. For example, instead of having an outcome that tells us the exact height of the person, we’ll have a categorical outcome like Tall, Short, and Average.
How does it differ from Linear Regression?
The first distinction is that Logistic Regression is concerned with classifying data, whereas Linear Regression is concerned with continuous data.
Logistic regression accuracy is measured using maximum likelihood estimation, whereas linear regression accuracy is measured using least squares estimation. A Logistic Regression model’s best fit line is a curve, whereas a Linear Regression model’s best fit line is a straight line.
The logistic function serves as the foundation for the logistic model. The following equation represents it:
It is important to note that the logistic function can only take values between 0 and 1, so if the curve reaches a positive infinity, our predicted outcome is 1. If it reaches a negative infinity, our predicted outcome is 0.
Below is a mini project on how to build a logistic regression model that predicts whether a person has lung cancer using certain factors that we would consider as our independent variable.
We start by importing the libraries needed for this mini-project.
import pandas as pd
from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitimport seaborn as sns
Next, we want our program to read the dataset we’ll be using for this program, we can get the dataset here.
lung_cancer = pd.read_csv(‘survey lung cancer.csv’)
Note: If your dataset isn’t in the same directory as your file, you’ll need to include the path and the dataset’s name.
Before we proceed, we must ensure that our data is balanced to avoid biased predictions, as imbalanced data tends to favor the majority class regardless of accuracy score. To do this, run this command:
Our data is imbalanced; therefore, we must balance it before continuing. So to do this, we’ll upsample the minority class of the data using replacement.
from sklearn.utils import resample
lung_cancer_majority=lung_cancer[(lung_cancer[‘LUNG_CANCER’]==’YES’)]lung_cancer_minority=lung_cancer[(lung_cancer[‘LUNG_CANCER’]==’NO’)]lung_cancer_minority_upsampled= resample(lung_cancer_minority,replace=True, n_samples=270, random_state=42)lung_cancer_updated=pd.concat([lung_cancer_minority_upsampled, lung_cancer_majority])lung_cancer_updated[‘LUNG_CANCER’].value_counts()
Our data is balanced now so that we can proceed further.
For this mini project, we will only look at four factors in the dataset: age, smoking, coughing, and chronic disease. These factors are what will be our X feature variables, and the outcome ‘LUNG_ CANCER’ will be our target variable.
X = lung_cancer_updated[[‘SMOKING’, ‘AGE’, ‘COUGHING’, ‘CHRONIC DISEASE’]].valuesy = lung_cancer_updated[‘LUNG_CANCER’].values
We have to train and test our data; to do this; we split them into the training and testing sets, respectively.
We use the train_test_split function, and it classifies 75% of our data under the training set and the remaining 25% under the testing set.
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y)print( X.shape, y.shape)print( X_train.shape, y_train.shape)print( X_test.shape, y_test.shape
This becomes our output:
(540, 4) (540,)
(405, 4) (405,)(135, 4) (135,)
Out of the 540 data points we have, 405 are put under the training set, 135 under the test set while still having the same number of features (4) in both sets.
Next up, we call in our Logistic Regression model, train our model, evaluate and check out its accuracy.
model.fit(X_train, y_train)print(model.score(X_train, y_train))print(model.score(X_test, y_test))
The accuracy of the training and test data is:
We have the accuracy of both the training data and the test data to be above 70%, which is a good score; we can now proceed to make predictions.
We want to predict the outcome of the first five rows of our data.
We see the following as our output:
We’ve created a mini Logistic Regression model that tells under certain factors if a person has lung cancer or not. This is a binary Logistic regression model as the output is the form of a binary event.
This article aimed to teach us about the Logistic Regression model including how it is best used, as well as its limitations. We completed a mini-project, but we can go further and conduct research on various projects involving Logistic Regression. The more we practice, the better we will become.
I hope you found this helpful article.