Variants of Gradient Descent Optimizer in Deep Learning with Simple Analogy

Manasa Noolu(Mortha)
5 min readJan 9, 2021

--

The role of optimizers is an essential phase in deep learning. It is important to understand the underlying math to decide on appropriate parameters to boost up the accuracy. There are different types of optimizers, however, I am going to explain the variants of the Gradient Descent optimizer with a simple analogy.

Sometimes, it is difficult to interpret the mathematical equations behind the algorithms in the right way. To avoid this, I have come with a simple example to understand the equations and how math works.

What do you understand by looking at the above image?

We know that the target is a middle point and we have to shoot the dart at that point. We will be given3–5 darts and we have to shoot at the middle point at least once. Before reaching the target, we may throw the darts on some other locations on the board. However, the ultimate goal of the dart game is to reach the target (middle point). The same analogy applies to the optimizer concept in deep learning. The main purpose of the optimizer is to reach the local minima (middle point) by updating the parameters (weights, learning rate, etc) and minimize the loss.

Now, our aim is to update the weights and learning rates to reduce the loss by checking with varied optimization techniques. We will start with Gradient Descent.

Gradient Descent

Gradient Descent is the most popularly used optimization technique in regression and classification problems. It is the basic algorithm to find the optimal weights and learning rate in backpropagation. Initially, we do forward propagation to find the loss, it is given by the below equation:

We have to see for which values of W and b, the loss (y’,y) is zero(global minima). Either the values of the parameters are decreased or increased based on the initial weight point as shown in figure 1.

figure 1: Gradient Descent

The initial weight is on the right side, so the weights must be decreased to reach the Global cost minimum point (it is a target point in the dart game). The question is how to update weights and biases? The answer is below equations:

Wnew=W- α (∂L/∂w)

bnew=b-α (∂L/∂b)

here, α is the learning rate and ∂L/∂w is the slope of the gradient. The learning rate is used to decide the length of arrows to reach the minima point. and ∂L/∂w signifies the change in weight to change the loss for the minimum.

The main problem with the gradient descent is with the size of the dataset. Gradient Descent process the whole dataset at a time. It means, suppose, we have 10k records and we have to update the parameters of these 10k records. Guess the time it will take to process !!!

The stage where forward propagation and backward propagation takes place is called Epoch. Per epoch, the whole dataset is processed for updating the weights to reduce the loss. Imagine if you have 1million records. What will be the computation power? I can explain this with a simple analogy.

Assume, we have a big bucket and we are asked to fill the bucket with water. And you have another bucket which is bigger than the first one. It is shown in figure 2.

figure 2

From figure 2, it is understood that when we try to fill the bucket with water, we may lose some water because the size is smaller than the bucket in the hand. The same analogy applies to gradient descent where it is computationally expensive to handle a huge amount of data for optimization. We may not have enough memory (RAM size) to handle huge data.

Okay, Now how to solve this drawback?

Back to the analogy!! Now we have to find a solution to fill the bucket. I have an idea. Why can’t we fill it with small glasses? It is shown in figure 3. It will “fill for sure” but takes time to fill the bucket completely. Here “filling for sure” means reaching the global minima point in gradient descent. Okay! How many times we have to repeat(pouring) to fill the bucket? Repeat is nothing but iterations.

figure 3

The same applies here in the optimization technique, wherein 1 epoch 1 record is processed. That means if we have 10k records then 10k iterations are performed to reach the global minima. This process is known as Stochastic Gradient Descent (SGD). It is much faster than Gradient Descent(GD). However, there is a drawback with SGD. The number of iterations per epoch takes more time to reach the minima point i.e. the convergence would be very very slow.

To overcome this we go with Mini Batch Gradient Descent. Instead of performing a number of iterations per epoch as a whole, we can divide the records into batches (group) and iterate over each batch. For example, we have 10k records and at each iteration, we can process 1000 records. Then will have 10 iterations (10k/1000) in total to reach the minima point.

So if we compare Mini Batch GD with the SGD, in 1 epoch SGD process 10k iterations and Mini Batch GD process 10 iterations. This shows that the Mini Batch GD is much efficient and computationally less expensive.

figure 4

To map with the analogy(figure 4), we can take the cup the same size as the bucket or bigger than the glass size mentioned for SGD. It reduces the time to fill the bucket with fewer repetitions. So, we can conclude that among the variants of Gradient Descent optimizers, Mini Batch GD is efficient and computationally inexpensive. However, there is one drawback with Mini Batch GD i.e. Noise.

In my next post, I will talk about Noise and the technique to solve it, and also I will explain different types of other optimizers.

I hope you liked the article and able to understand the variants of Gradient Descent with a specified analogy.

--

--

Manasa Noolu(Mortha)
Manasa Noolu(Mortha)

Written by Manasa Noolu(Mortha)

Artificial Intelligence Intern, BeCode.org, Brussels

No responses yet