[ad_1]

## All you need to know about this library for scalable hyperparameter tuning of machine learning models

The optimization of model hyperparameters (or model settings) is perhaps the most important step in training a machine learning algorithm as it leads to finding the optimal parameters that minimize your model’s loss function. This step is also essential to building generalizable models that are not prone to overfitting.

The most known techniques to optimize model hyperparameters are *exhaustive* *grid search* and its stochastic counterpart: *random grid search*. In the first method, the search space is defined as a grid across the domain of each of the model hyperparameters. The optimal hyperparameters are obtained by training the model on every point of the grid. Although grid search is very easy to implement, it becomes computationally expensive, especially when the number of variables to optimize is large. On the other hand, andom grid search is a faster optimization method that gives better results. In random grid search, the best hyperparameters are obtained by training the model only on a random sample of points from the grid space.

For a long time, both grid search algorithms were widely used by data scientists to find optimal model hyperparameters. However, these approaches usually find model hyperparameters for which the loss function is far from the global minimum.

The history changed in 2013 when James Bergstra an d his collaborators published a paper where they explored a Bayesian optimization technique to find the best hyperparameters of an image classification neural network. They compared the results against those obtained from a random grid search. It was clear that the Bayesian method outperformed the random grid search:

But why is Bayesian optimization better than any of the grid search algorithms? Because this is one of the *guided *methods that performs a smart search of model hyperparameters instead of finding them through trial and error.

In this blog, we will dissect the Bayesian optimization method and we’ll explore one of its implementations through a relatively new Python package called *Mango*.

Before explaining what *Mango* does, we need to understand how Bayesian optimization works. If you have a good understanding of this algorithm, you can safely skip this section.

Bayesian optimization has 4 components:

**The objective function**: This is the true function that you want to either minimize or maximize. It can be, for instance, the root mean squared error (RMSE) in a regression problem or the Log loss in classification. In optimization of machine learning models, the objective function depends on the model hyperparameters. That is why the objective function is also known as black-box function because its shape is unknown.**The search domain or search space**: This corresponds to the possible values each model hyperparameter can have. As a user, you need to specify the search space of your model. For instance, the search domain of a Random Forest regressor model might be:

`param_space = {`**'max_depth'**: range(3, 10),

**'min_samples_split'**: range(20, 2000),

**'min_samples_leaf'**: range(2, 20),

**'max_features'**: [**"sqrt"**, **"log2"**, **"auto"**],

**'n_estimators'**: range(100, 500)

}

The Bayesian optimization uses the defined search space to sample points that are evaluated in the objective function.

**The surrogate model**: Evaluating the objective function is very expensive, so in practice, we know the true value of the objective function only at a few places, however, we would need to know the values elsewhere. Here is when it enters the surrogate model which is a tool to modeling the objective function. A common choice of a surrogate model is the so-called Gaussian Processes (GP) because of its ability to provide uncertainty estimates. Explaining Gaussian Processes is out of the scope of this blog post, but I encourage you to read this outstanding article which has plenty of visuals to help you building an intuition behind this probabilistic method.

At the beginning of the Bayesian Optimization, the surrogate model starts with a prior function which is distributed with uniform uncertainty along the search space:

Every time a sample point from the search space is evaluated in the objective function, the uncertainty of the surrogate model at that point becomes zero. After many iterations, the surrogate model will resemble the objective function:

However, the objective of Bayes optimization is not modeling the objective function. Instead, is to find the best model hyperparameters in the least number of iterations possible. To achieve this, it is necessary to use an acquisition function.

**The acquisition function**: This function is introduced in the Bayesian optimization to guide the search. The acquisition function is used to assess whether a point is desirable to be evaluated based on the present surrogate model. A simple acquisition function is to sample the point where the mean of the surrogate function is maximized.

The steps of the Bayesian optimization code are:

Select a surrogate model for modeling the objective function and define its priorfor i = 1, 2,..., number of iterations:

Given a set of evaluations in the objective, use Bayes

to obtain the posterior.

Use an acquisition function (which is a function of the

posterior) to decide the next sampling point.

Add newly sampled data to the set of observations.

The following figure shows the Bayesian optimization for a simple one-dimensional function:

If you are interested in reading more about Bayesian optimization, I recommend you to read this great article:

Several Python packages use Bayesian optimization under the hood to obtain the best hyperparameters of your machine learning model. Some examples are: Hyperopt; Optuna; Bayesian optimization; Scikit-optimize (skopt); GPyOpt; pyGPGO and *Mango*. The list is extensive and I’m not mentioning other libraries. For a nice summary of other packages, you can read this blog post:

Now, let’s dive into *Mango*!

In recent years, the amount of data has grown considerably. This represents a challenge for data scientists who need their machine learning pipelines to be scalable. Distributed computing might solve this issue.

Distributed computing refers to a set of computers that work on a common task while communicating with each other. This is different from parallel computing, where a task is divided into multiple subtasks which are allocated to different processors on the same computer system.

Although there are quite a good number of Python libraries that use Bayesian optimization to tune model hyperparameters, **none of them are written to support scheduling on any distributed computing framework. **The motivation of the authors who developed *Mango* was to create an optimization algorithm capable of working in a distributed computing environment while maintaining the power of Bayesian optimization.

What is the secret of *Mango*’s architecture that it makes it work well in a distributed computing environment? *Mango* was built with a modular design where the optimizer is decoupled from the scheduler. This design allows easy scaling of machine learning pipelines that use large amounts of data. However, this architecture comes with challenges in the optimization method because the traditional Bayesian optimization algorithm is sequential, meaning that the acquisition function only provides a single next point to evaluate for the search.

*Mango* uses two methods to parallelize Bayesian optimization: a method called *batch Gaussian process bandits *and k-means clustering. In this blog, we won’t explain the batch Gaussian process. If you are interested in knowing more about this approach, you can read this paper.

Regarding the clustering approach, the usage of k-means clustering to horizontally scale the Bayesian optimization process was proposed by a group of researchers of IBM in 2018 (see this paper for technical details). This approach consists of clustering points sampled from the search domain which generate high values in the acquisition function (see figure below). In the beginning, these clusters are far from each other in the parameter search space. As the optimal regions in the surrogate function are discovered, the distance in the parameter space decreases. The k-means clustering method horizontally scales the optimization because each cluster is used to run the Bayesian optimization as a separate process. This parallelization leads to finding the optimal model hyperparameters faster.

Additional to its capacity to work on distributed computing frameworks, *Mango* is also compatible with the scikit-learn API. This means that you can define the hyperparameter search space as a Python dictionary where the keys are the parameter names of the model and each item can be defined with any of the more than 70 distributions implemented in scipy.stats. All these unique characteristics make *Mango* an excellent alternative for data scientists who want to leverage data-driven solutions at scale.

If you are interested in knowing more about the inner workings of *Mango*, you can read the original paper or visit this nice blog written by the authors of the library:

## Simple example

Let’s now illustrate how *Mango* works in an optimization problem. You first need to create a Python environment and then, install *Mango* through the following command:

`pip install arm-mango`

For this example, we use the California housing dataset that can be loaded directly from Scikit-learn (more information on this link):

[ad_2]

Source link