18-3-9

# Gradient Descent Optimization methods from Deep Learning Perspective

By Santanu Pattanayak

It is important to understand and appreciate few key points regarding full batch Gradient Descent and Stochastic gradient descent methods along with their shortcomings so that one can appreciate the need of using different variants of gradient based optimizers in Deep Learning.

# Elliptical Contours

Figure 1. Contour plot for a quadratic cost function with elliptical contours.

The best way to get around this problem is to take larger steps in directions in which gradients are small but consistent and take smaller steps in directions with big but inconsistent gradients. This can be achieved if instead of having a fixed learning rate for all the dimensions we have separate learning rate for each dimension.

Figure 2. Gradient descent for a cost function in one variable.

In the Figure 2 above the cost function between A to C is almost linear and hence gradient descent does well. However, from the point C the curvature of the cost function takes over and hence the gradient at C is not able to keep up with the direction of the change in cost function. Based on the gradient if we take a small learning rate at C we will end up at D which is reasonable enough since it doesn’t overshoot the point of minima. However, a larger step size at C will get us to D’ which is not desirable since it's on the other side of the minima. Again, a large step size at D’ would get us to E and if the learning rate is not reduced the algorithm tends to toggle between points on either side of the minima, leading to oscillations. When this happens one way to stop this oscillation and achieve convergence is to look at the sign of the gradient  or  in successive iterations and if they are having opposite signs reduce the learning so that the oscillations are reduced. Similarly, if the successive gradients are having the same sign then the learning rate can be accordingly increased. When the cost function is a function of multiple weights then the cost function might have curvatures in some dimensions of the weights while along other dimensions the cost function might be linear. Hence for multivariate cost functions the partial derivative of the cost function with respect to each weight  can be similarly analyzed to update the learning rate for each weight or dimension of the cost function.

# Non-Convexity of Cost functions

The other big problem with neural networks is that the cost functions are mostly non-convex and hence the gradient descent method might get stuck at local minimum points leading to sub-optimal solution. The non-convex nature of the neural network is due to the hidden layer units with non-linear activation functions such as sigmoid. Full batch gradient descent uses the full dataset for the gradient computation. While the same is good for convex cost surfaces it has its own problems in case of non-convex cost functions. For non-convex cost surfaces with full batch gradient the model is going to end up in the minima in its basin of attraction.  For if the initialized parameters are in the basic of attraction of a local minima that doesn’t provide good generalization full batch gradient would give a sub-optimal solution.

With stochastic gradient descent, the noisy gradients computed may force the model out of basin of attraction of the bad local minima i.e. one that doesn’t provide good generalization and place it in a more optimal region. Stochastic gradient descent with single data points produces very random and noisy gradients. Gradients with the mini-batches tend to produce much stable estimates of gradients when compared to that of single data points but still noisier than those produced by the full batches. Ideally the mini-batch size should be carefully chosen such that the gradients are noisy enough to avoid or escape bad local minima points but stable enough to converge at global minima or a local minimum that provides good generalization.

Figure 3. Contour plot showing basins of attraction for Global and Local minima and traversal of paths for gradient descent and Stochastic gradient descent.

# Saddle points in the High Dimensional Cost functions

So, (x, y) = (0, 0) is a stationary point. The next thing to compute is the Hessian matrix and evaluate its Eigen values at (x, y) = (0, 0). The Hessian matrix H f(x, y) is as below:

So, the Hessian H f(x, y) at all points including (x, y) = (0, 0) is

The two eigen values of the H f(x, y) are 2 and -2 corresponding to the Eigen vectors  and  which are nothing but the directions along and axis. Since one Eigen value is positive and the other negative hence (x, y) = (0, 0) is a saddle point. Since the Eigen value is positive along X direction hence with respect to X axis the point (0,0) is a local minima whereas with respect to Y direction it is a local maxima.

Figure 4. Plot of f(x,y) = x2 - y2

The non-convex function f(x,y) = x2 - yis plotted in the Figure 4 above where S is the saddle point at x, y = (0, 0)

For more information kindly refer to the author’s book Pro Deep Learning with TensorFlow

Santanu Pattanayak currently works at GE, Digital as a Senior Data Scientist. He has 10 years of overall work experience with six of years of experience in the data analytics/data science field and also has a background in development and database technologies. Prior to joining GE, Santanu worked in companies such as RBS, Capgemini, and IBM. He graduated with a degree in electrical engineering from Jadavpur University, Kolkata and is an avid math enthusiast. Santanu is currently pursuing a master's degree in data science from Indian Institute of Technology (IIT), Hyderabad. He also devotes his time to data science hackathons and Kaggle competitions where he ranks within the top 500 across the globe. Santanu was born and brought up in West Bengal, India and currently resides in Bangalore, India with his wife.

For more, check out Santanu's book Pro Deep Learning with TensorFlow.