Would you like to take your data science skills to the next level? Are you interested in improving the accuracy of your models and making more informed decisions based on your data? Then it’s time to explore the world of bagging and boosting. With these powerful techniques, you can improve the performance of your models, reduce errors and make more accurate predictions.
Whether you are working on a classification problem, a regression analysis, or another data science project, bagging and boosting algorithms can play a crucial role. In this article, we #1 summarize the main idea of ensemble learning, introduce both, #2 bagging and #3 boosting, before we finally #4 compare both methods to highlight similarities and differences.
So let’s get ready for bagging and boosting to succeed!
So when should we use it? Cleary, when we see overfitting or underfitting of our models. Let’s begin with the key concept of bagging and boosting, which both belong to the family of ensemble learning techniques:
The main idea behind ensemble learning is the usage of multiple algorithms and models that are used together for the same task. While single models use only one algorithm to create prediction models, bagging and boosting methods aim to combine several of those to achieve better prediction with higher consistency compared to individual learnings.
Example: Image classification
The essential concept is encapsulated by means of a didactic illustration involving image classification. Supposing a collection of images, each accompanied by a categorical label corresponding to the kind of animal, is available for the purpose of training a model. In a traditional modeling approach, we would try several techniques and calculate the accuracy to choose one over the other. Imagine we used logistic regression, decision tree, and support vector machines here that perform differently on the given data set.
In the above example, it was observed that a specific record was predicted as a dog by the logistic regression and decision tree models, while a support vector machine identified it as a cat. As various models have their distinct advantages and disadvantages for particular records, it is the key idea of ensemble learning to combine all three models instead of selecting only one approach that showed the highest accuracy.
The procedure is called aggregation or voting and combines the predictions of all underlying models, to come up with one prediction that is assumed to be more precise than any sub-model that would stay alone.
The next chart might be familiar to some of you, but it represents quite well the relationship and the tradeoff between bias and variance on the test error rate.
You might be familiar with the following concept, but I posit that it effectively illustrates the correlation and compromise between bias and variance with respect to the testing error rate.
The relationship between the variance and bias of a model is such that a reduction in variance results in an increase in bias, and vice versa. To achieve optimal performance, the model must be positioned at an equilibrium point, where the test error rate is minimized, and the variance and bias are appropriately balanced.
Ensemble learning can help to balance both extreme cases to a more stable prediction. One method is called bagging and the other is called boosting.
Let us focus first on the Bagging technique called bootstrap aggregation. Bootstrap aggregation aims to solve the right extreme of the previous chart by reducing the variance of the model to avoid overfitting.
With this purpose, the idea is to have multiple models of the same learning algorithm that are trained by random subsets of the original training data. Those random subsets are called bags and can contain any combination of the data. Each of those datasets is then used to fit an individual model which produces individual predictions for the given data. Those predictions are then aggregated into one final classifier. The idea of this method is really close to our initial toy example with the cats and dogs.
Using random subsets of data, the risk of overfitting is reduced and flattened by averaging the results of the sub-models. All models are calculated in parallel and then aggregated together afterward.
The calculation of the final ensemble aggregation uses either the simple average for regression problems or a simple majority vote for classification problems. For that, each model from each random sample produces a prediction for that given subset. For the average, those predictions are just summed up and divided by the number of created bags.
A simple majority voting works similarly but uses the predicted classes instead of numeric values. The algorithm identifies the class with the most predictions and assumes that the majority is the final aggregation. This is again very similar to our toy example, where two out of three algorithms predicted a picture to be a dog and the final aggregation was therefore a dog prediction.
A famous extension to the bagging method is the random forest algorithm, which uses the idea of bagging but uses also subsets of the features and not only subsets of the entries. Bagging, on the other hand, takes all given features into account.
Code example for bagging
In the following, we will explore some useful python functions from the
sklearn.ensemblelibrary. The function called
BaggingClassifierhas a few parameters which can be looked up in the documentation, but the most important ones are base_estimator, n_estimators, and max_samples.
from sklearn.ensemble import BaggingClassifier
# define base estimator
est = LogisticRegression() # or est = SVC() or est = DecisionTreeClassifier
# n_estimators defines the number of base estimators in the ensemble
# max_samples defines number of samples to draw from X to train each base estimator
bag_model = BaggingClassifier(base_estimator= est, n_estimators = 10, max_samples=1.0)
bag_model = bag_model.fit(X_train, y_train)
Prediction = bag_model.predict(X_test)
- base_estimator: You have to provide the underlying algorithm that should be used by the random subsets in the bagging procedure in the first parameter. This could be for example Logistic Regression, Support Vector Classification, Decision trees, or many more.
- n_estimators: The number of estimators defines the number of bags you would like to create here and the default value for that is 10.
- max_samples: The maximum number of samples defines how many samples should be drawn from X to train each base estimator. The default value here is one point zero which means that the total number of existing entries should be used. You could also say that you want only 80% of the entries by setting it to 0.8.
After setting the scenes, this model object works like many other models and can be trained using the
fit()procedure including X and y data from the training set. The corresponding predictions on test data can be done using
Boosting is a little variation of the bagging algorithm and uses sequential processing instead of parallel calculations. While bagging aims to reduce the variance of the model, the boosting method tries aims to reduce the bias to avoid underfitting the data. With that idea in mind, boosting also uses a random subset of the data to create an average-performing model on that.
For that, it uses the miss-classified entries of the weak model with some other random data to create a new model. Therefore, the different models are not randomly chosen but are mainly influenced by wrong classified entries of the previous model. The steps for this technique are the following:
- Train initial (weak) model
You create a subset of the data and train a weak learning model which is assumed to be the final ensemble model at this stage. You then analyze the results on the given training data set and can identify those entries that were misclassified.
- Update weights and train a new model
You create a new random subset of the original training data but weight those misclassified entries higher. This dataset is then used to train a new model.
- Aggregate the new model with the ensemble model
The next model should perform better on the more difficult entries and will be combined (aggregated) with the previous one into the new final ensemble model.
Essentially, we can repeat this process multiple times and continuously update the ensemble model until our prediction power is good enough. The key idea here is clearly to create models that are also able to predict the more difficult data entries. This can then lead to a better fit of the model and reduces the bias.
In comparison to Bagging, this technique uses weighted voting or weighted averaging based on the coefficients of the models that are considered together with their predictions. Therefore, this model can reduce underfitting, but might also tend to overfit sometimes.
Code example for boosting
In the following, we will look at a similar code example but for boosting. Obviously, there exist multiple boosting algorithms. Besides the
GradientDescent methodology, the
AdaBoost is one of the most popular.
- base_estimator: Similar to Bagging, you need to define which underlying algorithm you would like to use.
- n_estimators: The amount of estimators defines the maximum number of iterations at which the boosting is terminated. It is called the “maximum” number, because the algorithm will stop on its own, in case good performance is achieved earlier.
- learning_rate: Finally, the learning rate controls how much the new model is going to contribute to the previous one. Normally there is a trade-off between the number of iterations and the value of the learning rate. In other words: when taking smaller values of the learning rate, you should consider more estimators, so that your base model (the weak classifier) continues to improve.
from sklearn.ensemble import AdaBoostClassifier
# define base estimator (requires support for sample weighting)
est = LogisticRegression() # or est = SVC() or est = DecisionTreeClassifier ….
# n_estimators defines maximum number of estimators at which boosting is terminated
# learning_rate defines the weight applied to each classifier at each boosting iteration
boost_model = AdaBoostClassifier(base_estimator= est, n_estimators = 10, learning_rate=1)
boost_model = boost_model.fit(X_train, y_train)
Prediction = boost_model.predict(X_test)
predict()procedures work similarly to the previous bagging example. As you can see, it is easy to use such functions from existing libraries. But of course, you can also implement your own algorithms to build both techniques.
Since we learned briefly how bagging and boosting work, I would like to put the focus now on comparing both methods against each other.
- Ensemble methods
In a general view, the similarities between both techniques start with the fact that both are ensemble methods with the aim to use multiple learners over a single model to achieve better results.
- Multiple samples & aggregation
To do that, both methods generate random samples and multiple training data sets. It is also similar that Bagging and Boosting both arrive at the end decision by aggregation of the underlying models: either by calculating average results or by taking a voting rank.
Finally, it is reasonable that both aim to produce higher stability and better prediction for the data.
- Data partition | whole data vs. bias
While bagging uses random bags out of the training data for all models independently, boosting puts higher importance on misclassified data of the upcoming models. Therefore, the data partition is different here.
- Models | independent vs. sequences
Bagging creates independent models that are aggregated together. However, boosting updates the existing model with the new ones in a sequence. Therefore, the models are affected by previous builds.
- Goal | variance vs. bias
Another difference is the fact that bagging aims to reduce the variance, but boosting tries to reduce the bias. Therefore, bagging can help to decrease overfitting, and boosting can reduce underfitting.
- Function | weighted vs. non-weighted
The final function to predict the outcome uses equally weighted average or equally weighted voting aggregations within the bagging technique. Boosting uses weighted majority vote or weighted average functions with more weight to those with better performance on training data.
It was shown that the main idea of both methods is to use multiple models together to achieve better predictions compared so single learning models. However, there is no one-over-the-other statement to choose between bagging and boosting since both have advantages and disadvantages.
While bagging decreases the variance and reduces overfitting, it will only rarely produce better bias. Boosting on the other hand side decreases the bias but might be more overfitted that bagged models.
Coming back to the variance-bias tradeoff figure, I tried to visualize the extreme cases when each method seems appropriate. However, this does not mean that they achieve the results without any drawbacks. The aim should always be to keep bias and variance in a reasonable balance.
Bagging and boosting both uses all given features and select only the entries randomly. Random forest on the other side is an extension to bagging that creates also random subsets of the features. Therefore, random forest is used more often in practice than bagging.
: Bühlmann, Peter. (2012). Bagging, Boosting and Ensemble Methods. Handbook of Computational Statistics. 10.1007/978–3–642–21551–3_33.
: Machova, Kristina & Puszta, Miroslav & Barcák, Frantisek & Bednár, Peter. (2006). A comparison of the bagging and the boosting methods using the decision trees classifiers. Comput. Sci. Inf. Syst.. 3. 57–72. 10.2298/CSIS0602057M.
: Banerjee, Prashant. Bagging vs Boosting @kaggle: https://www.kaggle.com/prashant111/bagging-vs-boosting