How is that possible, when MAE is non-smooth?
When working on a model based on Gradient Boosting, a key parameter to choose from is the objective. Indeed, the whole building process of the decision tree derives from the objective and its first and second derivatives.
XGBoost has recently introduced support for a new kind of objective: non-smooth objectives with no second derivative. Amongst them, the famous MAE (mean absolute error) is now natively activable inside XGBoost.
In this post, we will detail how XGBoost has been modified to handle this kind of objective.
XGBoost, LightGBM, and CatBoost all share a common limitation: they need smooth (mathematically speaking) objectives to compute the optimal weights for the leaves of the decision trees.
This is not true anymore for XGBoost, which has recently introduced, support for the MAE using line search, starting with release 1.7.0
If you’re willing to master Gradient Boosting in detail, have a look at my book:
The core of gradient boosting-based methods is the idea of applying descent gradient to functional space instead of parameter space.
As a reminder, the core of the method is to linearize an objective function around the previous prediction t-1
, and to add a small increment that minimizes this objective.
This small increment is expressed in the functional space, and it is a new binary node represented by the function f_t.
This objective combines a loss function l
with a regularization function Ω:
Once linearized, we get:
Where:
Minimizing this linearized objective function boils down to reducing the constant part, i.e:
As the new stage of the model f_t
is a binary decision node that will generate two values (its leaves) : w_left
and w_right
it is possible to reorganize the sum above as follows:
At this stage, minimizing the linearized objective simply implies finding the optimal weight w_left
and w_right
. As they are both implied in a simple second-order polynomial, the solution is well the known -b/2a
expression where b
is G
and a
is 1/2H
, hence for the left node, we get
The exact same formula stands for the right weight.
Note the regularization parameter λ, which is an L2 regularisation term, proportional to the square of the weight.
The issue with the Mean Absolute Error is that is it’s second derivative is null, hence H
is zero.
Regularization
One possible option to circumvent this limitation is to regularize this function. This means substituting this formula with another one that has the property of being at least twice derivable. See my article below that shows how to do that with the logcosh
:
Line search
Another option, the one recently introduced by XGBoost since its release 1.7.0, is the use of an iterative method for finding the best weight for each node.
To do so, the current XGBoost implementation uses a trick:
- First, it computes the leaf values as usual, simply forcing the second derivative to 1.0
- Then, once the whole tree is built, XGBoost updates the leaf values using an α-quantile
If you’re curious to see how this is implemented (and are not afraid of modern C++) the detail can be found here. UpdateTreeLeaf
, and more specifically UpdateTreeLeafHost
the method of interest.
How to use it
It’s plain and simple: just pick a release of XGBoost that is greater than 1.7.0 and use objective: mae
as parameter.
XGBoost has introduced a new way to cope with non-smooth objectives, like the MAE, that does not require the regularization of a function.
The MAE is a very convenient metric to use, as it is easy to understand. Moreover, it does not over penalize large errors as would the MSE. This is handy when trying to predict large as well as small values using the same model.
Being able to use non-smooth objective is very appealing as it not only avoids need for approximation but also opens the door to other non-smooth objectives like the MAPE.
Clearly, a new feature to try and follow.
More on Gradient Boosting, XGBoost, LightGBM, and CaBoost in my book: