1. num_boost_round
– n_estimators
Afterwards, you have to determine the number of decision trees (often called base learners in XGBoost) to plant during training using num_boost_round
. The default is 100 but that’s hardly enough for today’s large datasets.
Increasing the parameter will plant more trees but significantly increases the chances of overfitting as the model becomes more complex.
One trick I learned from Kaggle is to set a high number like 100,000 for num_boost_round
and make use of early stopping rounds.
In each boosting round, XGBoost plants one more decision tree to improve the collective score of the previous ones. That’s why it is called boosting. This process continues until num_boost_round
rounds, regardless whether each new round is an improvement on the last or not.
But by using early stopping, we can stop the training and thus planting of unnecessary trees when the score hasn’t been improving for the last 5, 10, 50 or any arbitrary number of rounds.
With this trick, we can find the perfect number of decision trees without even tuning num_boost_round
and we will save time and computation resources. Here is how it would look like in code:
# Define the rest of the params
params = {...}# Build the train/validation sets
dtrain_final = xgb.DMatrix(X_train, label=y_train)
dvalid_final = xgb.DMatrix(X_valid, label=y_valid)
bst_final = xgb.train(
params,
dtrain_final,
num_boost_round=100000 # Set a high number
evals=[(dvalid_final, "validation")],
early_stopping_rounds=50, # Enable early stopping
verbose_eval=False,
)
The above code would’ve made XGBoost use 100k decision trees but because of early stopping, it will stop when the validation score hasn’t been improving for the last 50 rounds. Usually, the number of required trees will be less than 5000–10000.
Controlling num_boost_round
is also one of the biggest factors in how long the training process runs as more trees require more resources.