When we talk about various algorithms of Machine Learning, Linear Regression is the first and most basic algorithm. It is a supervised machine learning model, where we provide a known data set with marked labels and features and, in turn, expect the algorithm to predict the outcome following the same pattern of the provided data set.
Examples of linear regression —
- Predicting the price of a house given house features
- Predicting the sale of ice cream based on the season (temperature outside)
- Predicting the impact of SAT/GRE scores on college admissions
In the above examples, we can observe that we have a linear relationship between both factors. For example, the sale of ice cream will increase with the increase in temperature.
In this article, we will work on the admission prediction dataset, where we will predict the chance of getting admission into the desired college/university based on the given factors.
Let’s investigate the data set first—
It is noticeably clear from the image that we must predict the chance of admission into the college based on the GRE, TOEFL Score, University Rating, SOP, LOR, CGPA, and Research.
Importing all the required libraries —
We have imported the libraries for the following reasons-
After importing the libraries, we will import the data set using the pandas library, and then we will generate a detailed report of the data set using pandas profiling.
Key observations —
- We can observe various levels of collinearity from the spearman correlation. CGPA, GRE score, and TOEFL score are highly correlated with the chance of admit, except serial no. all other columns are correlated with the chance of admit.
After exploring the data, we will now do feature engineering —
Firstly, we will fill in the missing values.
We can fill in the missing data using mean, median, and mode values depending upon a range of factors, which we will discuss later.
All the null values have been filled. We can check this using df.describe() code also.
The Serial no. column is adding no value to our analysis; hence it can be dropped.
df.drop(columns=[‘Serial No.’], inplace=True)
Splitting our data set —
After cleaning and filling in the data, let’s split our data for model training. We will now separate the label (y-axis or the prediction) and feature (x-axis or the input value) from the data set.
In the feature column, the GRE Score value has a higher range of values than others. This might affect our model accuracy, due to which the predicted value might fluctuate. To overcome this, we will do standard scaling.
We will further check the multicollinearity of our data set, i.e. whether our feature data set is collinear among themselves or not. It might affect our model’s accuracy.
To check the multi-collinearity, we have multiple ways, but here, we will use VIF (variance inflation factor) to limit the multi-collinearity.
- VIF starts at 1 and has no upper limit
- VIF = 1, no correlation between the independent variable and the other variables
- VIF exceeding 5 or 10 indicates high multicollinearity between this independent variable and the others
All the values are below 5; hence we can conclude that our dataset is not multicollinear.
Finally, we are ready to train our model; let’s now split our data set into two parts.
After splitting the data, we pass it to the model for training purposes.
At this time, we can save this model for further use and make this entire thing portable and handy we used pickle library.
As we have completed our training phase, let’s check the model’s score and predict the outcome.
Our model has generated a decent score of approximately 81 per cent. This can be increased in several ways that we will discuss later.
Here we discussed the most fundamental way of linear regression. We also have some more advanced and regularised linear regression tools like lasso, ridge, and elastic net.