Posts Stochastic Gradient Descent - Terminologies
Post
Cancel

Stochastic Gradient Descent - Terminologies

Background

Stochastic gradient descent is an optimization algorithm from the gradient descent family that is better than batch gradient descent in finding proper feature weights. It is extensively used in various use-cases in machine learning. Before going further with an explanation, I would recommend you to go through batch gradient descent as it is the base for understanding any other gradient descent algorithm.

Definition

In the batch gradient descent, we were calculating gradient for all data points in each iteration whereas in stochastic gradient descent (SGD), the gradient will be computed on random samples from training data in each iteration (epoch). Keep in mind the following pseudocode for understanding this algorithm.

Pseudocode

1
2
3
4
5
6
7
Steps:
    1. Initialize random feature weights.
    2. Calculate the gradient for randomly selected records from the training data.
    3. Adjust feature weights by calculating new weights.
    4. Use new weights for prediction and calculate an error.
    5. Repeat steps 2, 3, and 4 for adjusting feature weights and
       do it until an error no longer reduces or reached some iterations limit.

Explanation

• Just like batch gradient descent, here we will start with the random initialization of weights.
• Then randomly select some data points from training data.
• Calculate gradient and find new weights and update existing feature weights.
• This entire process will be repeated until an error no longer reduces or reached some iterations limit.
• There is a possibility that some of the samples will be randomly selected multiple times, for that matter, there are other variants of the stochastic gradient descent that takes care of this is explained below.

It is as simple as that and slightly different from the batch gradient descent. Rest of all the operations works as same as batch gradient descent (BGD). Keep in mind that the above pseudocode is just a common set of operations behind the algorithm.

So, let us look at what are all the variants of the stochastic gradient descent available for us.

Variants

First

1
2
3
4
5
6
7
8
9
10
Steps:
    1. Initialize random feature weights.
    2. Randomly shuffle data points in the training data.
    3. Calculate the gradient for randomly selected records with
    replacement from the training data.
    4. Adjust feature weights by calculating new weights.
    5. Use new weights for prediction and calculate an error.
    6. Repeat steps 3, 4, and 5 for adjusting feature weights and
       do it until an error no longer reduces or reached
       some iterations limit.

Entire training data will be shuffled only once in the beginning and samples will be selected with replacement means currently selected data points will be available for the next selection.

Second

1
2
3
4
5
6
7
8
9
10
Steps:
    1. Initialize random feature weights.
    2. Calculate the gradient for randomly selected records with
    replacement from the training data.
    3. Randomly shuffle data points in the training data.
    4. Adjust feature weights by calculating new weights.
    5. Use new weights for prediction and calculate an error.
    6. Repeat steps 2, 3, 4, and 5 for adjusting feature weights and
       do it until an error no longer reduces or reached
       some iterations limit.

Here the difference is that training data points will be shuffled in every iteration (epoch), so that chances of getting the same sample will be reduced.

Diagram > Batch Gradient Descent Minima Diagram

I am keeping this diagram from the batch gradient descent for the understanding of the references of local and global minima throughout this article.

Batch Gradient Descent Minima Diagram

Stochastic Gradient Descent > Things To Remember

• Since it is using random samples, it is proved to identify global minima better than batch gradient descent.
• Using SGD, training a huge dataset is possible as only one sample of training data will remain in memory at a time.
• Since we are using random samples so cost function will be jumping more compare to batch gradient descent.
• In case of an irregular cost function, SGD will be useful to get out of local minima.
• Because of the random nature of SGD, stopping at minimum is difficult, therefore a concept called learning rate schedule is used.
• In the learning rate schedule, the learning rate will be changed during a training period so that SGD will be able to stop at global minima.
• If the learning rate is changed too fast then SGD may get stuck at the local minima and if the learning rate is changed too slowly then SGD will take a longer time to reach the minima or may never reach.

Conclusion

• Stochastic gradient descent is a better option than batch gradient descent in terms of memory and performance.
• Though, random samples used in SGD still not able to utilize hardware properly that is the reason Mini-batch Gradient Descent exists.

End Quote

“A gradient measures how much the output of a function changes if you change the inputs a little bit.” -  Lex Fridman (MIT)

This post is licensed under CC BY 4.0 by the author.