[ad_1]

Mentors at instamentor.com have coached and helped many candidates land their dream data scientists and machine learning engineers job over the last two years.

And we’ve summarized and categorized most of the top machine learning interview questions in a new course: **Cracking the Machine Learning Fundamentals Interview****.**

Here are the top 10 most frequently asked Deep Learning basics questions and solutions.

To learn more, visit the course page.

Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function.

The update function to minimize the loss function is the following.

Where the loss function is based on cross-entropy.

Gradient descent tries to solve the optimization problem of finding the parameters of a loss function where the function reaches its minimum value.

However, the gradient descent algorithm does not always guarantee a global minimum and can get stuck at a local minimum.

Local minimum problem is hard to tackle. Usually, you can try different values of the **static** **learning rate** or use **Annealing** to update the learning rate at each epoch; also, how the initial parameters are initialized can play an important role in whether you will be stuck at a local minimum.

And run the gradient descent algorithms multiple times on the same data set, and select the one corresponding with the minimum loss function values.

Additionally: we can also use **stochastic** or **mini-batch stochastic gradient descent** to help with the local minimum.

Using **Momentum **by keeping the past weight involved in the parameter update process or more advanced gradient descent algorithms such as **Adam** and **Adagrad** can also help avoid sticking at the local minimum.

In machine learning, feature scaling is often applied before we start training the model.

Some of the reasons are that the features might be on different scales, and for Neural Networks specifically, it has been shown feature scaling can help gradient descent converge much faster.

The most common techniques of feature scaling are **Normalization** and **Standardization**.

**In Normalization, **for every feature, we subtract its min and divide it by the delta between min and max.

**In Standardization:** we substruct every feature by its mean and divide by its standard deviation.

It’s important to save the parameters so that when we perform inferencing, the new data points are also transformed using the same normalization/standardization parameters.

Scaling the features helps the gradient descent to move more smoothly and reach the minimum faster mostly because the original features are from different scales.

Without scaling features, the algorithm may be biased toward moving in the direction that is driven by features in higher scales. e.g., the same distance feature in miles vs. in inches.

There are multiple methods for choosing the right learning rate.

1. The basic way to find the best static learning rate is to try different learning rate values and see which one performs the best, e.g., starting with 0.01, and try a few values that are bigger and some values smaller: 0.05, 0.02, 0.01, and 0.001.

2. Use **Learning Rate Annealing **method to update the learning rate dynamically.

Generally speaking, if the learning rate is too small, it takes much longer for the gradience descent to converge, and it might also get stuck into local minima.

If the learning rate is too big, it might overshoot to find the local minima and won’t converge.

There are many ways to initialize the parameters before we start training the neural network.

1. For example, we can use a standard normal distribution (mu =0, sd =1) to assign a random value to a parameter as its initial value.

2. We can also use a standard uniform distribution (min=0, max=1) to assign a random value to a parameter as its initial value.

If we initialize all parameters to 0, the derivatives will remain the same, and gradient descent will fail to learn, the network will also fail to break the symmetry.

Similarly, initializing all parameters to have the same value will generate a poor result.

**Sigmoid, tanh, Relu.**

The activation function helps transform the neural network into a non-linear model; without an activation function, the neural network becomes a linear regression model, and the number of layers becomes irrelevant as mathematically, those layers can be merged into one layer.

The sigmoid is used for two-class logistic regression problems by transforming a score into a probability.

Whereas the softmax function is an extension of sigmoid and can be used for the multiclass logistic regression.

Mathematically the Relu is not differentiable at 0 since its derivative on the left side (zero) is not the same on the right side (non-zero).

We simply use its derivate on the right-hand side.

Choosing the right activation function usually depends on the data and problem, and there is no 100% scientific way to decide which one is the best, but a general rule of thumb:

1. Sigmoid works really well when the neural network does not have a ton of layers, and if the neural network has a gradient vanishing problem, we should avoid using sigmoid and tanh.

2. Relu is often used as the default activation function, and generally speaking, it performs well, but it can only be used in a hidden layer.

We often start with **relu** and try several different activation functions to benchmark the results.

[ad_2]

Source link