Previous << Train and Evaluate Regression Models (2/4)
In the previous notebook, we used simple regression models to look at the relationship between features of a bike rentals dataset. In this notebook, we’ll experiment with more complex models to improve our regression performance.
Let’s start by loading the bicycle sharing data as a Pandas DataFrame and viewing the first few rows. We’ll also split our data into training and test datasets.
# Import modules we'll need for this notebook
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline# load the training dataset
#!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/daily-bike-share.csv
bike_data = pd.read_csv('daily-bike-share.csv')
bike_data['day'] = pd.DatetimeIndex(bike_data['dteday']).day
numeric_features = ['temp', 'atemp', 'hum', 'windspeed']
categorical_features = ['season','mnth','holiday','weekday','workingday','weathersit', 'day']
bike_data[numeric_features + ['rentals']].describe()
print(bike_data.head())
# Separate features and labels
# After separating the dataset, we now have numpy arrays named **X** containing the features, and **y** containing the labels.
X, y = bike_data[['season','mnth', 'holiday','weekday','workingday','weathersit','temp', 'atemp', 'hum', 'windspeed']].values, bike_data['rentals'].values
# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
print ('Training Set: %d rowsnTest Set: %d rows' % (X_train.shape[0], X_test.shape[0]))
instant dteday season yr mnth holiday weekday workingday
0 1 1/1/2011 1 0 1 0 6 0
1 2 1/2/2011 1 0 1 0 0 0
2 3 1/3/2011 1 0 1 0 1 1
3 4 1/4/2011 1 0 1 0 2 1
4 5 1/5/2011 1 0 1 0 3 1 weathersit temp atemp hum windspeed rentals day
0 2 0.344167 0.363625 0.805833 0.160446 331 1
1 2 0.363478 0.353739 0.696087 0.248539 131 2
2 1 0.196364 0.189405 0.437273 0.248309 120 3
3 1 0.200000 0.212122 0.590435 0.160296 108 4
4 1 0.226957 0.229270 0.436957 0.186900 82 5
Training Set: 511 rows
Test Set: 220 rows
Now we have the following four datasets:
- X_train: The feature values we’ll use to train the model
- y_train: The corresponding labels we’ll use to train the model
- X_test: The feature values we’ll use to validate the model
- y_test: The corresponding labels we’ll use to validate the model
Now we’re ready to train a model by fitting a suitable regression algorithm to the training data.
Experiment with Algorithms
The linear-regression algorithm we used last time to train the model has some predictive capability, but there are many kinds of regression algorithm we could try, including:
- Linear algorithms: Not just the Linear Regression algorithm we used above (which is technically an Ordinary Least Squares algorithm), but other variants such as Lasso and Ridge.
- Tree-based algorithms: Algorithms that build a decision tree to reach a prediction.
- Ensemble algorithms: Algorithms that combine the outputs of multiple base algorithms to improve generalizability.
Try Another Linear Algorithm
Let’s try training our regression model by using a Lasso algorithm. We can do this by just changing the estimator in the training code.
from sklearn.linear_model import Lasso# Fit a lasso model on the training set
model = Lasso().fit(X_train, y_train) # Instead of LinearRegression().fit(X_train, y_train)
print (model, "n")
# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
Lasso() MSE: 201155.70593338402
RMSE: 448.5038527519959
R2: 0.605646863782449
Try Decision Algorithm
As an alternative to a linear model, there’s a category of algorithms for machine learning that uses a tree-based approach in which the features in the dataset are examined in a series of evaluations, each of which results in a branch in a decision tree based on the feature value. At the end of each series of branches are leaf-nodes with the predicted label value based on the feature values.
It’s easiest to see how this works with an example. Let’s train a Decision Tree regression model using the bike rental data. After training the model, the following code will print the model definition and a text representation of the tree it uses to predict label values.
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text# Train the model
model = DecisionTreeRegressor().fit(X_train, y_train)
print (model, "n")
# Visualize the model tree
tree = export_text(model)
print(tree)
DecisionTreeRegressor() |--- feature_6 <= 0.45
| |--- feature_4 <= 0.50
| | |--- feature_7 <= 0.32
| | | |--- feature_8 <= 0.41
| | | | |--- feature_1 <= 2.50
| | | | | |--- feature_6 <= 0.29
| | | | | | |--- feature_6 <= 0.28
| | | | | | | |--- value: [515.00]
| | | | | | |--- feature_6 > 0.28
| | | | | | | |--- value: [558.00]
| | | | | |--- feature_6 > 0.29
| | | | | | |--- value: [317.00]
| | | | |--- feature_1 > 2.50
| | | | | |--- feature_9 <= 0.28
| | | | | | |--- feature_9 <= 0.22
| | | | | | | |--- value: [981.00]
| | | | | | |--- feature_9 > 0.22
| | | | | | | |--- value: [968.00]
| | | | | |--- feature_9 > 0.28
| | | | | | |--- feature_6 <= 0.30
| | | | | | | |--- value: [532.00]
| | | | | | |--- feature_6 > 0.30
| | | | | | | |--- value: [710.00]
| | | |--- feature_8 > 0.41
| | | | |--- feature_7 <= 0.25
| | | | | |--- feature_7 <= 0.18
| | | | | | |--- feature_8 <= 0.43
| | | | | | | |--- value: [284.00]
| | | | | | |--- feature_8 > 0.43
| | | | | | | |--- feature_7 <= 0.10
| | | | | | | | |--- value: [150.00]
| | | | | | | |--- feature_7 > 0.10
| | | | | | | | |--- feature_6 <= 0.17
| | | | | | | | | |--- feature_9 <= 0.34
| | | | | | | | | | |--- feature_6 <= 0.17
| | | | | | | | | | | |--- value: [68.00]
| | | | | | | | | | |--- feature_6 > 0.17
| | | | | | | | | | | |--- value: [67.00]
| | | | | | | | | |--- feature_9 > 0.34
| | | | | | | | | | |--- value: [73.00]
| | | | | | | | |--- feature_6 > 0.17
| | | | | | | | | |--- value: [117.00]
| | | | | |--- feature_7 > 0.18
| | | | | | |--- feature_9 <= 0.17
| | | | | | | |--- feature_7 <= 0.23
| | | | | | | | |--- value: [123.00]
| | | | | | | |--- feature_7 > 0.23
| | | | | | | | |--- value: [140.00]
| | | | | | |--- feature_9 > 0.17
| | | | | | | |--- feature_6 <= 0.19
| | | | | | | | |--- value: [333.00]
| | | | | | | |--- feature_6 > 0.19
| | | | | | | | |--- feature_8 <= 0.53
| | | | | | | | | |--- feature_3 <= 0.50
| | | | | | | | | | |--- value: [251.00]
| | | | | | | | | |--- feature_3 > 0.50
| | | | | | | | | | |--- feature_7 <= 0.21
| | | | | | | | | | | |--- value: [217.00]
| | | | | | | | | | |--- feature_7 > 0.21
| | | | | | | | | | | |--- value: [205.00]
| | | | | | | | |--- feature_8 > 0.53
| | | | | | | | | |--- feature_8 <= 0.55
| | | | | | | | | | |--- value: [288.00]
| | | | | | | | | |--- feature_8 > 0.55
| | | | | | | | | | |--- value: [275.00]
| | | | |--- feature_7 > 0.25
| | | | | |--- feature_9 <= 0.11
| | | | | | |--- value: [706.00]
| | | | | |--- feature_9 > 0.11
| | | | | | |--- feature_8 <= 0.54
| | | | | | | |--- feature_5 <= 1.50
| | | | | | | | |--- feature_7 <= 0.26
| | | | | | | | | |--- value: [309.00]
| | | | | | | | |--- feature_7 > 0.26
| | | | | | | | | |--- feature_0 <= 2.50
| | | | | | | | | | |--- feature_9 <= 0.16
| | | | | | | | | | | |--- value: [408.00]
| | | | | | | | | | |--- feature_9 > 0.16
| | | | | | | | | | | |--- truncated branch of depth 2
| | | | | | | | | |--- feature_0 > 2.50
| | | | | | | | | | |--- feature_3 <= 5.50
| | | | | | | | | | | |--- value: [440.00]
| | | | | | | | | | |--- feature_3 > 5.50
| | | | | | | | | | | |--- value: [502.00]
| | | | | | | |--- feature_5 > 1.50
| | | | | | | | |--- value: [618.00]
| | | | | | |--- feature_8 > 0.54
| | | | | | | |--- feature_9 <= 0.18
| | | | | | | | |--- feature_7 <= 0.28
| | | | | | | | | |--- value: [318.00]
| | | | | | | | |--- feature_7 > 0.28
| | | | | | | | | |--- value: [354.00]
| | | | | | | |--- feature_9 > 0.18
...
| | | | | | | | |--- value: [204.00]
| | | | | | | |--- feature_8 > 0.90
| | | | | | | | |--- value: [217.00]
So now we have a tree-based model, but is it any good?
Let’s evaluate it with the test data.
# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
MSE: 227066.89545454545
RMSE: 476.5153674904362
R2: 0.5548496030068522
Try Ensemble Algorithm
Ensemble algorithms work by combining multiple base estimators to produce an optimal model, either by applying an aggregate function to a collection of base models (sometimes referred to a bagging) or by building a sequence of models that build on one another to improve predictive performance (referred to as boosting).
For example, let’s try a Random Forest model, which applies an averaging function to multiple Decision Tree models for a better overall model.
from sklearn.ensemble import RandomForestRegressor# Train the model
model = RandomForestRegressor().fit(X_train, y_train)
print (model, "n")
# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
RandomForestRegressor() MSE: 111100.7554109091
RMSE: 333.3177994210767
R2: 0.7821939421050257
For good measure, let’s also try a boosting ensemble algorithm. We’ll use a Gradient Boosting estimator, which like a Random Forest algorithm builds multiple trees; but instead of building them all independently and taking the average result, each tree is built on the outputs of the previous one in an attempt to incrementally reduce the loss (error) in the model.
# Train the model
from sklearn.ensemble import GradientBoostingRegressor# Fit a lasso model on the training set
model = GradientBoostingRegressor().fit(X_train, y_train)
print (model, "n")
# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
GradientBoostingRegressor() MSE: 104431.1535934313
RMSE: 323.1580938077078
R2: 0.7952692778596857
Here, we’ve tried a number of new regression algorithms to improve performance. In our next notebook, we’ll look at tuning these algorithms to improve performance.
Happy learning!
Next >> Train and Evaluate Regression Models (4/4)