With SKLearn, you can do linear regression
A linear equation is created from a dataset using linear regression. For example, square footage of an apartment versus rent price is an example of a linear correlation.
We’ll build a project to run linear regression on a .csv file after I show you how linear regression works in sklearn. The example data for this project will be the square footage of an apartment compared to the rent price.
Following simple linear regression, we’ll look at multiple linear regression, which is linear regression with multiple independent variables.
The numpy, pandas, sklearn, and matplotlib libraries must be installed before we can proceed. I use pip, but if you are using Anaconda Python, you can use conda instead.
As a first step, we’ll import the necessary libraries.
Analyzing a randomized dataset with linear regression
After generating x values, we’ll generate y values randomly. A variance of +/- 0.1 will be applied to the function y = 2x
You can see that I have arranged x values into a vertical array. This is necessary in order to call Sklearn’s Linear Regression function.
Now let’s take a look at our x and y values
Our points will be fitted to a model using LinearRegression(), and then we’ll plot the generated line against our originals
It looks like our model fits really well, let’s check its coefficients and intercept to confirm. There should be a coefficient near 2 and an intercept close to 0.
Let’s check the average deviation per prediction now that the linear regression shows values close to those we expected. This can be accomplished by taking the Mean Squared Error (MSE), dividing it by the number of entries, and then taking the square root of that number.
I’m going to define the mse and average error as functions because we’ll be checking them again later.
Our average error is also less than 0.1. By adding an offset of +/- 0.1 to our function for randomization, we can verify that our linear regression model yields accurate predictions.
Here are some more applications of linear regression that we will discuss now that we have looked at a small example. Our first step will be to read in a .csv and create our x and y arrays, then build and examine this new model.
See how our line looks plotted against the regular points, and look at the average error per entry (this should be below 100).
Cool! As expected, our average price error is within 100 in the model. At $3.66 a square foot, rent in Seattle is very expensive. In our model, we’ve assumed that apartment prices vary directly with square footage, so the intercept is 1.32, which is about right.
Using the height and radius of hardwood trees, we’re going to fit a linear model to the weight of hardwood trees in tons.
Now let’s plot the model (in 3D!) and get the
We can see visually from the plot that the points of our linear regression model predict the actual values quite well, now we can check this out using Sklearn’s .score() function.
This is a really good R² value, the closer the R² value is to 1, the more accurate the linear model. We can also examine the coefficients and intercepts.
Based on our model’s coefficients and intercept, we see that even though our plane fitted well at these values, this model does not really make sense. An intercept of -3.76 doesn’t make sense since we are predicting the tree’s weight in tons, so we should expect an intercept of 0. As we examine ‘trees.csv’ data by eye, we’ll notice the model seems to increase in weight as height and radius increase. In the plot we see the points clearly form some kind of curve, and the reason why this linear fit works well is because the values in the data are so high.
The linear regression module has been completed. The first example in this module was a one-dimensional linear regression, followed by a larger example that was read in from a CSV file, and finally a multiple linear regression example that was verified with a test set. Logistic regression will be covered in the next module.