[ad_1]

## Everything about Data Transformation, Polynomial Regression, and Nonlinear Regression

A Simple linear regression (SLR) model is simple to construct when the relationship between the target variable and the predictor variables is linear. When there is a nonlinear relationship between a dependent variable and independent variables, things become more complicated. In this article, I’ll show you three different approaches to building a regression model on the same nonlinear dataset:

1. Polynomial regression

2. Data transformation

3. Nonlinear regression

The dataset that I have considered has been taken from Kaggle: https://www.kaggle.com/datasets/yasserh/student-marks-dataset

The data consists of Marks of students including their study time & number of courses.

If you examine the relationships of the target variables “Marks” w.r.to study time and number of courses, you will find that the relationship is non-linear.

I tried to build a linear regression model using sklearn LinearRegression() model. I defined a function to calculate various metrics for the model.

Non-linear regressions are a relationship between independent variables ** x** and a dependent variable

**which result in a non-linear function modeled data. Essentially any relationship that is not linear can be termed as non-linear, and is usually represented by the polynomial of**

*y***degrees (maximum power of**

*k***).**

*x**y*** **= a

*x*³ + b

*x*² + c

*x*+ d

Non-linear functions can have elements like exponentials, logarithms, fractions, and others. For example: *y* = log(*x*)

Or even, more complicated such as :*y* = log(a *x*³ + b *x*² + c *x* + d)

*But what happens if we have more than one independent variables*?

For 2 predictors, the equation of the polynomial regression becomes:

where,

– Y is the target,

– *x*1, *x*2 are the predictors or independent variables

– 𝜃0 is the bias,

– and, 𝜃1, 𝜃2, 𝜃3, 𝜃4, and 𝜃5 are the weights in the regression equation

For n predictors, the equation covers all feasible combinations of various order polynomials. This is known as Multi-dimensional Polynomial Regression and is notoriously difficult to implement. We will construct polynomial models of varying degrees and evaluate their performance. But first, let’s prepare the dataset for training.

We can establish a pipeline and pass the degree and class of models that we wish to utilize to produce a polynomial of various degrees. This is what the code below does for us:

If you wish to view all of the coefficients and intercepts, use the following code block: Please keep in mind that the number of coefficients will vary depending on the degree of polynomial:

And here is the output:

This doesn’t give much information on the performance of each model, so will check r2 score.

So, we built polynomial equations up to degree 7 using sklearn pipeline method and found that degree 2 and above yielded 99.9% accuracy (as compared to ~ 94% by SLR). On the same dataset, we will now see another technique for building a regression model.

The linear regression framework assumes that the relationship between the response and predictor variables is linear. To continue utilizing the linear regression framework, we have to modify the data so that the relationship between variables became linear.

**Some Guidelines for data transformations:**

- Both the response and the predictor variables can be transformed
- If the residual plot reveals the presence of nonlinear relationships in the data, a straightforward strategy is to utilize nonlinear transformations of the predictors. In SLR, these transformations can be
*log(x), sqrt(x), exp(x), reciprocal*, and so on. - It is critical that each regressor have a linear connection with the target variables. The transformation of dependent variables is one method for addressing the non-linearity issue.

**In short, usually**:

- – Transforming the y-values aids in dealing with error terms and may aid in non-linearity.
- The non-linearity is mostly fixed by transforming the x-values.
- For further information on data transformation, see https://online.stat.psu.edu/stat462/node/155/.

In our dataset, when we plotted the dependent variables *“Marks”* against “*time of study*” and “ *number of courses*”, we observed that Marks has non-linear relationship with time of study. Hence, we will do a transformation on the feature ** time of study**.

After applying the above transformation, we can plot *Marks* against new feature *time_study_sqaured *to see if the relationship has changed to linear.

Our dataset is now ready for building a SLR model. On this converted dataset, we will now create a simple linear regression model with the sklearn LinearRegression() method. When we print the metrics after building the model, we get the following result:

*R2-Square Value: 0.9996RSS: 7.083MSE: 0.071EMSE: 0.266*

A significant improvement over the previously built SLR model on the raw dataset (without any data transformation). We got a R2-Square Value of 99.9% as opposed to 94%. Now, we’ll validate the various assumptions of an SLR model to see if it’s a good fit.

So In this section, we transformed the data itself. Knowing that feature *time_study* is not linearly dependent with *Marks*, we created a new feature called** time_study_squared**, which was linearly dependent with

*Marks*. Then we built a SLR model again and validated all the assumptions of a SLR model. We observed that all the assumptions are satisfied by this new model. Now, it’s time to explore our next and last techniques to build a different model on the same dataset.

For non-linear regression problem, we can try *SVR(), KNeighborsRegressor() or DecisionTreeRegression()* from sklearn library, and compare the model performance. Here, we will develop our non-linear model using the **sklearn** **SVR()** technique for demonstration purposes. SVR supports a variety of **Kernels**. Kernels enable the linear SVM model to separate nonlinearly separable data points. We will test three alternative kernels with the SVR algorithm and observe how they affect model accuracy:

- rbf (default kernal for SVR)
- linear
- poly

**i. SVR() using rbf kernal**

And here is the model metrics: Still a better R2-squared compare to our first SLR model.

*R2-Square Value: 0.9982RSS: 4053558.081MSE: 0.363EMSE: 0.602*

A quick check on the error term distribution also seems to be OK.

**ii. SVR() using linear kernel**

And here is the model metrics when we used the linear kernel: The R2-squared values again **dropped to ~93%**

*R2-Square Value: 0.9350RSS: 4063556.3MSE: 13.201EMSE: 3.633*

And in this case also, error terms seem to be a near nor mal distribution curve:

**iii. SVR() using poly kernel**

And here are the model metrics with SVR poly kernel : The R2-squared value is 97%, which is higher than the linear kernel but lower than the rbf kernel.

*R2-Square Value: 0.9798RSS: 4000635.359MSE: 4.087EMSE: 2.022*

And here is the error terms distributions:

So, In this section, we created a non-linear model using sklearn **SVR Model** with 3 different kernels. We got the best R2-squared value whith rbf kernel.

- r2-score with rbf kernel = 99.82%
- r2-score with linear kernel = 93.50 %
- r2-score with poly kernel = 97.98 %

In this post, we started with a dataset that was not linearly dependent on the target variable. Before we could investigate alternative strategies for building a regression model on a non-linear dataset, we constructed a simple linear regression model with a r2-score of 94%. We then investigated three distinct methods for modelling a nonlinear dataset: Polynomial Regression, Data Transformations, and a nonlinear regression model (SVR). We discovered that polynomial degrees of 2 and higher resulted in a 99.9% r2-score, whereas SVR with a rbf kernel resulted in a 99.82% r2-score. In general, whenever we have a nonlinear dataset, we should experiment with several strategies and see which ones work best.

*Find the data set and code here: *https://github.com/kg-shambhu/Non-Linear-Regression-Model

*You can contact me on LinkedIn: *https://www.linkedin.com/in/shambhukgupta/

[ad_2]

Source link