[ad_1]

## With boosted decision tree algorithms, such as XGBoost, CatBoost, and LightBoost you may outperform other models but overfitting is a real danger. Learn how to split the data, optimize hyperparameters, and find the best-performing model without overtraining it using the HGBoost library.

Gradient boosting techniques gained much popularity in recent years for classification and regression tasks. An important part is the tuning of hyperparameters to gain the best performance in predictions. This requires searching across thousands of parameter combinations which is not only a computationally-intensive task but can quickly lead to overtrained models. As a result, a model may not generalize on new (unseen) data, and the performance accuracy can be worse than expected. Luckily there are *Bayesian optimization* techniques that can help to optimize the grid search and reduce the computational burden. But there is more to it because an optimized grid search may still result in overtrained models. Carefully splitting your data into a train set, a test set, and an independent validation set is another important part that should be incorporated when optimizing hyperparameters. *Here comes the HGBoost library into play!** HGBoost stands for Hyperoptimized Gradient Boosting and is a Python package for hyperparameter optimization for XGBoost, LightBoost, and CatBoost. It will carefully split the dataset into a train, test, and independent validation set. Within the train-test set, there is the inner loop for optimizing the hyperparameters using Bayesian optimization (with hyperopt) and, the outer loop to score how well the top performing models can generalize based on k-fold cross validation. As such, it will make the best attempt to select the most robust model with the best performance.** In this blog, I will first briefly discuss boosting algorithms and hyperparameter optimization. Then I will jump into how to train a model with optimized hyperparameters. I will demonstrate how to interpret and visually explain the optimized hyperparameter space and how to evaluate the model performance.*

Gradient boosting algorithms such as Extreme Gradient Boosting (*XGboost*), Light Gradient Boosting (*Lightboost*), and *CatBoost *are powerful ensemble machine learning algorithms for predictive modeling that can be applied on tabular and continuous data, and for both classification and regression tasks.

It is not surprising that boosted decision tree algorithms are very popular because these algorithms were involved in more than half of the winning solutions in machine learning challenges hosted at Kaggle [1]. *It is the combination of gradient boosting with decision trees that provides state-of-the-art results in many applications.* It is also *this *combination that makes the difference between the *XGboost, CatBoost, and Lightboost*. The common theme is that each boosting algorithm needs to find the best split for each leaf, and needs to minimize the computational cost. The high computational cost is because the model needs to find the exact split for each leaf, and that requires iterative scanning through all the data. This process is challenging to optimize.

Roughly, there are two different strategies to compute the trees: *level-wise and leaf-wise. The level-wise strategy**grows the tree level by level. In this strategy, each node splits the data and prioritizes the nodes closer to the tree root.* *The leaf-wise strategy**grows the tree by splitting the data at the nodes with the highest loss change.* Level-wise growth is usually better for smaller datasets whereas leaf-wise tends to overfit in small datasets. However, leaf-wise growth tends to excel in larger datasets where it is considerably faster than level-wise growth [2]. Letâ€™s summarize the *XGboost, LightBoost, and CatBoost *algorithms in terms of tree splitting, and computational costs.

## XGBoost.

Extreme Gradient Boosting *(XGboost) *is one of the most popular types of gradient boosting techniques for which the boosted decision trees are internally made up of an ensemble of weak decision trees. Originally *XGBoost *was based on a level-wise growth algorithm but recently has added an option for leaf-wise growth that implements split approximation using histograms. **The advantages** are that *XGBoost *is effective in learning and has strong generalization performance. In addition, it can also capture nonlinear relationships. **The disadvantages** are that the hyperparameter tuning can be complex, it does not perform well on sparse datasets, and can quickly result in high computational costs (both memory-wise and time-wise) as many trees are needed when using very large datasets [2].

## LightBoost.

*LightBoost *or *LightGBM *is a gradient boosting framework that uses tree-based learning algorithms for which the trees are split vertically (or leaf-wise). Such an approach can be more efficient in loss reduction compared to a level-wise algorithm when growing the same leaf. In addition, *LightBoost *uses histogram-based algorithms, which buckets continuous feature values into discrete bins. This speeds up training and reduces memory usage. It is designed to be efficient and has many **advantages**, such as fast training, high efficiency, low memory usage, better accuracy, support of parallel and GPU learning, and capability of handling large-scale data [3]. The computation and memory efficient advantages make *LightBoost *applicable for large amounts of data. **The disadvantages** are that it is sensitive to overfitting due to leaf-wise splitting, and high complexity due to hyperparameter tuning.

## CatBoost.

*CatBoost *is also a high-performance method for gradient boosting on decision trees. *Catboost grows a balanced tree using **oblivious** trees or symmetric trees for faster execution*. This means that pÃ©r feature, the values are divided into buckets by creating feature-split pairs, such as (temperature, <0), (temperature, 1â€“30), (temperature, >30), and so on. In each level of an oblivious tree, the feature-split pair that brings the lowest loss (according to a penalty function) is selected and is used for all the level nodes. It is described that *CatBoost* provides great results with default parameters and will therefore require less time on parameter tuning [4]. Another **advantage **is that *CatBoost *allows you to use non-numeric factors, instead of having to pre-process your data or spend time and effort turning it into numbers [4]. Some of the **disadvantages **are: It needs to build deep decision trees to recover dependencies in data in case of features with high cardinality, and it does not work with missing values.

Each boosted decision tree has its own (dis)advantages and therefore it is important to have a good understanding of the dataset you are working with in combination with the application you aim to develop.

A few words about the definition of ** hyperparameters**, and how they differ from

**. Overall we can say that**

*normal parameters**normal parameter*are optimized by the machine learning model itself while hyperparameters

**. In the case of a regression model, the relationship between the input features and the outcome (or target value) is learned. During the training of the regression model,**

*are not**the slope of the regression line is optimized by tuning the weights (under the assumption that the relationship with the target value is linear)*. Or in other words, the

*model parameters*are learned during the training phase.

Parameters are optimized by the machine learning model itself, while hyperparameters are outside the training process and require a meta-process for tuning.

**Then there is another set of parameters known as hyperparameters**, (also named as

*nuisance parameters*in statistics

*)*. These values must be specified outside of the training procedure and are the input parameters for the model. Some models do not have any hyperparameters, others have a few (two or three or so) that can still be manually evaluated. Then there are models with a lot of hyperparameters.

*I mean, really a lot.*Boosted decision tree algorithms, such as

*XGBoost, CatBoost, and LightBoost*are examples that have a lot of hyperparameters, think of desired depth, number of leaves in the tree, etc. You could use the default hyperparameters to train a model but tuning the hyperparameters often leads to a big impact on the final prediction accuracy of the trained model [8]. In addition, different datasets require different hyperparameters. This is troubling because a model can easily contain tens of hyperparameters, which subsequently can result in (tens of) thousands of hyperparameter combinations that need to be evaluated to determine the model performance. This is called the

*search space*. The computational burden (time-wise and memory-wise) can therefore be enormous. Optimizing the

*search space*is thus beneficial.

For supervised machine learning tasks, it is important to split the data into separate parts to avoid overfitting when learning the model. Overfitting is when the model fits (or learns) on the data too well and then fails to predict (new) unseen data. The most common manner is to split the dataset into a **train set**, and an independent **validation set**.** **However, when we also perform hyperparameter tuning, such as in boosting algorithms, it also requires a **test set**. The model can now ** see** the data,

**from the data, and finally, we can**

*learn***the model on unseen data. Such an approach does not only prevents overfitting but also helps to determine the**

*evaluate**robustness*of the model i.e., with a

*k*-fold cross-validation approach. Thus when we need to tune hyperparameters, we should separate the data into three parts, namely:

*train set, test set, and validation set.*

**The Train set**: This is the part where the model sees and learns from the data. It consists typically 70% or 80% of the samples to determine the best fit across the thousands of possible hyperparameters (e.g., using a*k*-fold-cross validation scheme).**The test set:**This set contains typically 20% of the samples, and can be used to evaluate the model performance, such as for the specific set of hyperparameters.**The validation set:**This set also contains typically 20% of the samples in the data but is kept untouched until the very final model is trained. The model can now be evaluated in an unbiased manner.*It is important to realize that this set can only be used once.*Or in other words, if the model is further optimized after getting insights on the validation set, you need another independent set to determine the final-final model performance.

Setting up a nested cross-validation approach can be time-consuming and even a complicated task, but very important in case you want to create a robust model, and prevent overfitting. *It is a good *exercise *to make such an approach yourself, but these are also implemented in the **HGBoost library*** [****1****] . **

*Before I will discuss the working of HGBoost, I will first briefly describe the Bayesian approach for large-scale optimization of models with different hyperparameter combinations.*

When using boosted decision tree algorithms, there can easily be tens of input hyperparameters which can subsequently lead to (tens of) thousands of hyperparameter combination that needs to be evaluated. This is an important task because a specific combination of hyperparameters can result in more accurate predictions for the specific dataset. Although there are many hyperparameters to tune, some are more important than others. Moreover, some hyperparameters can have little or no effect on the outcome but without a smart approach, all combinations of hyperparameters need to be evaluated to find the best performing model. This makes it a computational burden.

Hyperparameter optimization can make a big difference in the accuracy of a machine learning model.

Searching across combinations of parameters is often performed with *grid searches*. In general, there are two types: ** grid search **and

**which can be used for parameter tuning.**

*random search***will iterate across the entire search space and is thus very effective but also very-very slow. On the other hand, a**

*Grid search***is fast as it will randomly iterate across the search space, and while this approach has been proven to be effective, it could easily miss the most important points in the search space.**

*random search*

*Luckily, a third option exists: sequential model-based optimization, also known as Bayesian optimization*.*Bayesian optimization** *for parameter tuning is to determine the best set of hyperparameters within several iterations. The *Bayesian *optimization technique is an efficient method of function minimization and is implemented in the Hyperopt library. The efficiency makes it appropriate for optimizing the hyperparameters of machine learning algorithms that are slow to train [5]. In-depth details about Hyperopt can be found in this blog [6]. To summarize, it starts by sampling random combinations of parameters and computes the performance using a cross-validation scheme. During each iteration, the sample distribution is updated and in such a manner, it becomes more likely to sample parameter combinations with a good performance. A great comparison between the traditional *grid search*, *random search, *and *Hyperopt *can be found in this blog.

efficiency ofThemakes it appropriate for optimizing the hyperparameters of machine learning algorithms that are slow to train [5].Bayesian optimization

*Hyperopt** is incorporated into the **HGBoost** approach to do the hyperparameter optimization. In the next section, I will describe how the different parts of splitting the dataset and hyperparameter optimization are taken care of, which includes the objective function, the search space, and the evaluation of all trials*

The Hyperoptimized Gradient Boosting library (*HGBoost*), is a Python package for hyperparameter optimization for *XGBoost*, *LightBoost*, and *CatBoost*. It will split the dataset into a *train*, *test, *and *independent validation set.* Within the train-test set, there is the inner loop for optimizing the hyperparameters using Bayesian optimization (using *hyperopt)** *and, the outer loop to score how well the best-performing models can generalize using *k*-fold cross validation. Such an approach has the advantage of not only selecting the model with the highest accuracy but also the model that can best generalize.

*HGBoost** has many advantages, such as; it makes the best attempt to detect the model that can best generalize, and thereby reducing the possibility of selecting an overtrained model. It provides explainable results by providing insightful plots such as; a deep examination of the hyperparameter space, the performance of all evaluated models, the accuracy for the k-fold cross-validation, the validation set, and last but not least the best decision tree can be plotted together with the most important features.*

What benefits does *hgboost *offer more:

- It consists of the most popular decision tree algorithms;
*XGBoost, LightBoost, and Catboost*. - It consists of the most popular hyperparameter optimization library for Bayesian Optimization;
*Hyperopt*. - An automated manner to split the dataset into a train-test and independent validation to score the model performance.
- The pipeline has a nested scheme with an inner loop for hyperparameter optimization and an outer loop with
*k*-fold cross-validation to determine the most robust and best-performing model. - It can handle
*classification*and*regression*tasks. - It is easy to go wild and create a
*multi-class model*or an*ensemble of boosted decision tree models*. - It takes care of unbalanced datasets.
- It creates explainable results for the hyperparameter search-space, and model performance results by creating insightful plots.
- It is open-source.
- It is documented with many examples.

The *HGBoost* approach consists of various steps in the pipeline. *The *schematic overview is depicted in ** Figure 1**. Letâ€™s go through the steps.

** The first step** is to split the dataset into the

*training set*, a

*testing set,*and, an

*independent validation set*.

*Keep in mind that the validation set is kept untouched during the entire training process and used at the very end to evaluate the model performance.*Within the

*train-test set*(

**), there is the inner loop for optimizing the hyperparameters using Bayesian optimization and, the outer loop to test how well the best-performing models can generalize on unseen test sets using external**

*Figure 1A**k*-fold cross validation. The search space depends on the available hyperparameters of the model type. The model hyperparameters that needs optimization are shown in

**.**

*code section 1*In the Bayesian optimization part, we minimize the cost function over a hyperparameter space by *exploring* the entire space. Unlike a traditional *grid search*, the *Bayesian approach* searches for the best set of parameters and optimizes the search space after each iteration. The models are internally evaluated using a *k*-fold cross-validation approach. Or in other words, we can evaluate a fixed number of function evaluations, and take the best trial. The default is set to 250 evaluations which thus results in 250 models. *After this step, you could now take the best-performing model (based on the AUC or any other metric) and then stop. *However, this is strongly discouraged because we just searched throughout the entire search space for a set of parameters that now seems to fit the best on the training data, without knowing how well it generalizes. Or in other words, the best-performing model may or-may-not be overtrained. *To avoid finding an overtrained model, the k-fold cross-validation scheme will continue our quest to find the most robust model. Letâ€™s go to the second step.*

** The second step** is ranking the models (250 in this example) on the specified evaluation metric (the default is set to AUC), and then taking the top

**best performing models (default is set to**

*p*

*p**=10*). In such a manner, we do not rely on a single model with the best hyperparameters that

*may-or-may-not*be overfitted.

*Note that all 250 models can be evaluated but this is computationally intensive and therefore we take the best p(erforming) models.*To determine how the top 10 performing models generalize, a

**-fold cross-validation scheme is used for the evaluation. The default for**

*k*

*k**is*5 folds, which means that for each of the

**models, we will examine the performance across the**

*p***=5 folds. To ensure the cross-validation results are comparable between the top**

*k*

*p**=10*models, sampling is performed in a stratified manner. In total, we will evaluate

*p**x*

**; 10×5=50 models.**

*k*** In the third step,** we have the

**best performing models, and we computed their performance in a**

*p***-fold cross-validation approach. We can now compute the average accuracy (e.g., AUC) across the**

*k***-folds, then rank the models, and finally select the highest ranked model**

*k**.*The cross-validation approach will help to select the most robust model, i.e., the one that can also generalize across different sets of data.

*Note that other models may exist that result in better performance without the cross-validation approach but these may be overtrained.*

** In the fourth step,** we will examine the model accuracy on the independent validation set. This set has been untouched so far and will therefore give a good estimate. Because of our extensive model selection approach (whether the model can generalize, and limit the possibility to select an overtrained model), we should not expect to large differences in performance compared to our readily seen results.

** In the very final step**, we can re-train the model using the optimized parameters on the entire dataset (

**). The next step (**

*Figure 1C***) is the interpretation of the results for which insightful plots can be created.**

*Figure 1D*Before we can use *HGBoost*, we need to *install* it first using the command line:

**pip install hgboost**

After the installation, we can import *HGBoost *and initialize. We can keep the input parameters to their defaults (depicted in ** code section 2**) or change them accordingly. In the next sections, we will go through the set of parameters.

The initialization is kept consistent for each task (*classification, regression, multi-class, or ensemble*), and each model (*XGBoost, LightBoost, and CatBoost*) which makes it very easy to switch between the different decision tree models or even tasks. *The output is also kept consistent* which is a dictionary that contains six keys as shown in ** code section 3**.

For demonstration, letâ€™s use the *Titanic *dataset [10] that can be imported using the function `import_example(data=â€™titanicâ€™)`

. This data set is free of use and was part of the Kaggle competition *Machine Learning from Disaster*. The preprocessing step can be performed using the function `.preprocessing()`

which relies on the *df2onehot library** [11]. *This function will encode the categorical values into one-hot and keep the continuous values untouched. I manually removed some features, such as *PassengerId* and *Name **(code section 4)**. Note that in normal use-cases, it is recommended to carefully do the Exploratory Data Analysis, feature cleaning, feature engineering, and so on. The largest performance gains typically follow from these first steps.*

Keep in mind that the preprocessing steps may differ when training a model with *XGBoost*, *LightBoost, *or *CatBoost*. As an example, does the model require encoding of the features or can it handle non-numeric features? Can the model handle missing values or should the features be removed? See section ** A very brief introduction to Boosting Algorithms **to read more details about the (dis)advantages between the three models and do the preprocessing accordingly.

## Train a Model.

After initialization, we can use the *XGBoost *decision tree to train a model. *Note that you can also use the LightBoost or CatBoost model as shown in code section 3.* During the training process (running the ** code** in

**, it will iterate across the search space and create 250 different models (**

*section 4)*`max_eval=250`

) for which the model accuracy is evaluated for the set of hyperparameters that are available for the particular decision tree model (*XGBoost in this case, code section 1*). Next, the models are ranked on their accuracy, and the top 10 best-performing models are selected

`top_cv_evals=10`

. These are now further investigated using the 5 fold cross-validation scheme (`cv=5`

) to score how well the models generalize. In this sense, we aim to prevent finding an overtrained model.All tested hyperparameters for the different models are returned which can be further examined (manually) or by using the plot functionality. As an example, the output of *HGBoost* is depicted in ** code section 5** and more details in

**.**

*Figure 2*After the final model is returned, various plots can be created for a deeper examination of the model performance and the hyperparameter search space. This can help to gain better intuition and explainable results on how the model parameters behave in relation to the performance. *The following plots can be created:*

*to deeper investigate the hyperparameter space.**to summarize all evaluated models.**to show the performance using the cross-validation scheme.**the results on the independent validation set*.*The decision tree plot for the best model and the best performing features.*

## Interpretation of the Hyperparameter Tuning.

To deeper investigate the hyperparameter space, the function `.plot_params()`

can be used ** (Figure 3 and 4)**. Letâ€™s start by investigating how the hyperparameters are tuned during the Bayesian Optimization process

*.*Both figures contain multiple histograms (or kernel density plots), where each subplot is a single parameter that is optimized during the 250 model iterations. The small bars at the bottom of the histogram depict the 250 evaluations, whereas the black dashed vertical lines depict the specific parameter value that is used across the top 10 best-performing models. The green dashed line depicts the best-performing model

**the cross-validation approach, and the red dashed line depicts the best-performing model**

*without***cross-validation.**

*with*Letâ€™s have a look at ** Figure 3**. At the left top corner, there is the parameter

**for which the values are selected in the range from 0 up to 1.2. The average value for**

*colsample_bytree*

*colsample_bytree**is*~0.7 which indicates that the optimized sample distribution using the Bayesian optimization for this parameter has moved to this value. Our best-performing model has the value of exactly

**0.7, (**

*colsample_bytree=**red dashed line*). But there is more to look at. When we now look at

**, we also have the**

*Figure 4***for which each dot is one of the 250 models sorted on the iteration number. The horizontal axes are the iterations and the vertical axis are optimized parameter values. For this parameter, there is an upwards trend during the iterations of the optimization process; towards the value of 0.7. This indicates that during the iterations, the search space for**

*colsample_bytree***was optimized and more models were seen with better accuracy when increasing the value of**

*colsample_bytree*

*colsample_bytree**. In this manner, all the hyperparameters can be interpreted in relation to the different models.*

## Interpretation of the model performance across all evaluated models.

With the plot function `.plot()`

we can get insights into the performance (AUC in this case) of the 250 models (** Figure 4**). Here again, the green dashed line depicts the best-performing model

**the cross-validation approach, and the red dashed line depicts the best-performing model**

*without***cross-validation.**

*with***depicts the ROC curve for the best-performing model with optimized hyperparameters compared to a model with default parameters.**

*Figure 5**Decision Tree plot of the best model.*

With the decision tree plot (** Figure 6**) we can get a better understanding of how the model works. It may also give some intuition whether such a model can generalize over other datasets. Note that the best tree is returned by default

`num_tree=0`

but many trees are created that can be returned by specifying the input parameter `.treeplot(num_trees=1)`

. In addition, we can also plot the best-performing features (not shown here).## Make Predictions.

After having the final trained model, we can now use it to make *prediction* on new data. Suppose that ** X** is new data, and is similarly pre-processed as in the training process, then we can use the

`.predict(X)`

function to make predictions. This function returns the classification probability and the prediction label. See **for an example.**

*code section 7*## Save and load model.

Saving and loading models can become handy. In order to accomplish this, there are two functions:

and function **.save()****.load()**** (code section 8)**.

I demonstrated how to train a model with optimized hyperparameters by splitting the dataset into a train, test, and independent validation set. Within the train-test set, there is the inner loop for optimizing the hyperparameters using Bayesian optimization, and the outer loop to score how well the top performing models can generalize based on k-fold cross validation. In this manner, we make the best attempt to select the model that can generalize and with the best accuracy.

An important part before training any model is to do the typical modeling workflow: *first do the Exploratory Data Analysis (EDA), iteratively do the cleaning, feature engineering, and feature selection. The largest performance gains typically follow from these steps.*

*HGBoost** *supports learning *classification models, regression models*, *multi-class models*, and even an *ensemble of boosting tree models*. For all tasks, the same procedure is applied to make sure the best-performing and most robust model is selected. Furthermore, it relies on HyperOpt to do the Bayesian optimization which is one of the most popular libraries for hyperparameter optimization. Another Bayesian optimization algorithm that is recently developed is called Optuna. In case you want to read more details, and a comparison between the two methods, try this blog.

This is it. I hope you enjoyed reading it! Let me know if something is not clear or if you have any other suggestions.

*Be safe. Stay frosty.*

*Cheers, E.*

- Nan Zhu et al,
*XGBoost: Implementing the Winningest Kaggle Algorithm in Spark and Flink*. - Miguel Fierro et al,
*Lessons Learned From Benchmarking Fast Machine Learning Algorithms**.* *Welcome to LightGBMâ€™s documentation**!**CatBoost is a high-performance open source library for gradient boosting on decision trees**.*- James Bergstra et al, 2013,
*Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms.* - Kris Wright, 2017,
*Parameter Tuning with Hyperopt.* *E.Taskesen, 2019,**Classeval: Evaluation of supervised predictions for two-class and multi-class classifiers**.*- Alice Zheng,
*Chapter 4. Hyperparameter Tuning*. - E. Taskesen, 2020,
*Hyperoptimized Gradient Boosting**.* - Kaggle,
*Machine Learning from Disaster**.* - E.Taskesen, 2019,
*df2onehot: Convert unstructured DataFrames into structured dataframes.*

[ad_2]

Source link