[ad_1]

Gradient boosting models have gained immense popularity over the past decade. With dozens of academic papers written about them, they have become a staple solution for ML models in a variety of fields, and have been implemented by multiple leading companies in the field of machine learning. As with all other ML models, if not tuned correctly boosting models can overfit on the training data and thus perform poorly when predicting using unseen data. Fortunately, ways to generalize boosting models and avoid overfitting are available, such as hyperparameter options of boosting models’ implementations.

In this article I’ll discuss some of the classical approaches to regularization in boosting, and then present an innovative regularization technique called D.A.R.T, which helped me improve my models.

Any data scientist will acknowledge that deep neural networks have become the most accurate models in many domains, *e.g.* computer vision, natural language processing and speech recognition.

The one domain where deep learning has not yet managed to completely overcome classical machine learning models is that of tabular data, also named structured data.

Tabular data is data that is organized in a table with rows that represent observations and columns, which represent attributes. Although in theory each dataset can be best tackled with a different algorithm, it is shown both in academic papers as well as empirically in Kaggle competitions that in most cases of tabular data analysis, gradient boosting models are significantly dominant.

The main idea behind the gradient boosting algorithm is an ensemble of many “weak learners”, with each such learner making a slight improvement compared to its predecessor. Although these weak learners can theoretically be any machine learning model, in practice they tend to perform optimally using decision trees, both for regression and classification purposes.

The algorithm runs for a pre-defined number of iterations.

In the first iteration, a constant prediction is made, whichever constant minimizes the loss function. In each subsequent iteration a new model is built, which tries to predict the **pseudo-residuals** of the previous model — these are the output of the gradient of the loss function, with respect to labels and predictions.

At each iteration, prediction is done by summing over all previous models. Similarly, in the inference stage, the unlabeled samples are passed through all of the trees in the same order that they were built in, with the last tree outputting the resulting prediction.

One of the main concerns with any machine learning model is that of overfitting. Gradient boosting models are no different, and have a higher probability of overfitting as the number of iterations increases.

There are a number of known methods for avoiding overfitting in boosting models:

- Regularization of weak learners — constraints on depth, number of leaves, number of splits, etc.
- Regularization of loss function (adding L1 or L2 penalties)
- Constraints on number of iterations — either hard coded, or defining early stopping when loss-decrease is low enough
- Random sampling of rows and/or features in each iteration
- Shrinkage — controlling contribution of each new learner with a small learning rate

When trying to avoid overfitting, the techniques above are a good place to start, BUT in the past few years an additional approach has been developed that should definitely not be overlooked — the D.A.R.T algorithm.

D.A.R.T — Dropouts meet Multiple Additive Regression Trees, is a method presented in a paper by K. V. Rashmi & Ran Gilad-Bachrach in 2015. The main idea is to use the regularization technique named *dropout*, which is a common technique in deep neural networks and to incorporate it in the gradient boosting algorithm. In neural networks, the meaning of dropout is to ignore a random subset of the neurons in a layer during training. In D.A.R.T, the meaning of dropout is to ignore part of the trees when calculating the pseudo-residuals in each iteration.

D.A.R.T requires an additional hyperparameter, which determines the proportion of trees that should stay for the calculation of the pseudo-residuals in each iteration.

When building the ensemble model for a specific iteration, a random subset of trees from the previous iteration are dropped, and pseudo-residuals are calculated based on the predictions of the trees that have stayed. The new tree is then trained on these pseudo-residuals, and added to the ensemble. The dropout technique is only used in the training stage, during inference all trees are used.

Why do boosting models overfit?

In each iteration the model is improved slightly, this is achieved by modifying the model’s target variable (the pseudo-residuals) between iterations. While this guarantees that the new model is different from the ones in the ensemble, it typically focuses on a small subset of the problem and hence does not have strong predictive power when used on the original target variable.

This problem, often referred to as over-specialization, increases the risk of adding models that over-fit specific instances.

The most successful approach employed to combat this problem before D.A.R.T, was shrinkage, which does help in reducing the impact of the first trees. However, as the size of the ensemble increases, the problem of over-specialization reappears.

When applying D.A.R.T, it was shown that the relative effect of the first trees drastically decreases, and that later trees, even after hundreds of iterations, still contribute to the predictions (Figure 1). The ability to keep creating new learners that don’t over-specialize on specific samples is the power of D.A.R.T. and the reason it is such a generalized model.

D.A.R.T. has been implemented both in XGBoost and in LightGBM.

In XGBoost, set the *booster* parameter to *dart*, and in lightgbm set the *boosting* parameter to *dart*.

Each implementation provides a few extra hyper-parameters when using D.A.R.T. These additional parameters control the percentage of trees that are dropped in each iteration, the random sampling method and the normalization of dropped trees.

To demonstrate the power of D.A.R.T, I ran a small experiment on sklearn’s diabetes dataset, where the label is a quantitative measure of the disease *i.e.* a regression task. Three models were tested — Random Forest, Regular Boosting and D.A.R.T. Boosting.

The objective was to minimize the mean squared error, and some basic hyperparameters were chosen for each model, without tuning and without feature extraction. The results show that the D.A.R.T. model managed to reach a substantially lower test error and also a lower “overfit ratio” (ratio between test error and train error).

[ad_2]

Source link