Understanding neural network parameters with TensorFlow in Python: the optimiser

Welcome to the third instalment of a series of posts introducing deep neural networks (DNN) for spectral data regression. In this tutorial, we’ll discuss the notion of the optimiser and its function in training a neural network. We’ll use the same dataset as the first two posts, putting together a basic neural network using TensorFlow in Python.

Before starting, and if you haven’t done it already, consider taking some time to read our previous posts on this topic:

  1. An earlier post on Binary classification of spectra with a single perceptron.
  2. The first post of this short series: Deep neural networks for spectral data regression with TensorFlow.
  3. The second post: Understanding neural network parameters with TensorFlow in Python: the activation function.

The optimiser

A fully-connected neural network is generally composed by a series of perceptrons (sometimes called neurons) arranged into layers. A perceptron is a mathematical function that calculates the weighted sum of its inputs and outputs a signal depending on a threshold, called activation function. As explained in the first posts of this series, training a neural network means finding the values of the weights for each neuron, such that, for any given input, the system generates an output that minimises the error metric (also called loss function, or cost function) for a given training/test dataset.

The optimiser defines the way in which the loss function is minimised.

The basic idea is to compare the prediction of the network with the expected result on a test set, at each training step. This comparison, encapsulated in the error metric, becomes the feedback for the next step of the training. The feedback is sent back to the network to adjust the weights of each neuron in the appropriate way. This feedback mechanism is called back-propagation, because it is used to adjust the network weights in reverse order, i.e. from the output back to the input layer.

The “appropriate way” to adjust the weights is what the optimiser is all about.

As it turns out, we have already defined the prototypical optimiser: it is none others that the gradient descent mechanism we introduced in the perceptron post. The loss function can be thought as a landscape in the multi-dimensional space defined by the weights. Mathematically, we can say that the loss function is a function of the weights. With the gradient descent method, the direction to minimise the loss function (i.e. the change of the weights) is the one that decreases (“descends”) the gradient.

Gradient descent gives us a rule to change the network weights so that, at the next step, the loss function has decreased. This is, in a nutshell, what an optimiser does.

Stochastic Gradient Descent (SGD)

Gradient descent is a very powerful idea, and very useful to explain the principle of optimisation. It is however not very efficient, and in practice never used with modern neural networks.

The reason is that, for large networks with millions of weights, calculating the gradient of the loss function is computationally very taxing. In the perceptron post, we calculated the gradient by taking the dot product between the input vector and the error vector (the loss). With millions (or even billions) of parameters this operation is not efficient at all.

A solution to this problem was formulated with the stochastic gradient descent (SGD). The SGD optimiser is exactly the same as the vanilla gradient descent, but it calculates the gradient on a randomly selected subsample of the weights, instead of on the totality of them. This helps keeping the computational resources in check, while the randomness in selection should ideally ensure that the stochastic gradient is sufficiently close to the actual gradient that the result of the optimisation is not going to be (hopefully) much worse.

In practice, of course, the SGD optimisation is noisier than the corresponding one done with the full gradient. The computational advantage however is such that a noisier optimisation is not a big price to pay.

The learning rate and the Adam optimiser

Stochastic Gradient Descent, just like ordinary gradient descent, is designed to move towards the direction of decreasing gradient. In the deep learning language, the step size is called learning rate. The learning rate defines how fast the algorithm is prompted to minimise the gradient.

The learning rate is a tunable parameter. If it’s too small the algorithm will be very slow in converging towards the minimum. If it’s too fast, one may inadvertently step out the minimum and therefore make the process unable to converge.

While one can try and optimise the learning rate, with SGD one is however limited to a fixed learning rate. This may be a problem when the algorithm is close to converging, since at that point a smaller learning rate may be beneficial for a finer optimisation.

A modern class of optimisers is designed to be adaptive, that is to optimise the learning rate itself during training. This is an advantage, also to overcome the problem of the vanishing gradients, which we discussed in a previous post. Without going into the mathematical details of the optimisation strategies, we here mention two popular optimisers called RMSprop and Adam (Adaptive Moment estimator) respectively.

Before going to the code, let me just mention the accessible introduction to the role of optimisers in deep learning by Sebastian Ruder: An overview of gradient descent optimization algorithms. TL:DR – Dr Ruder recommend the use of the Adam optimisers for most modern deep learning training problems.

In the next section we’ll show a comparison between SGD, RMSprop and Adam on an NIR dataset.

Training comparison between different optimisers

The first part of the code below is identical to the one we used in the post on comparing activation functions. The difference is that here we set the activation function to ReLU and cycle through the different optimisers. I won’t therefore be repeating the explanation of the code details, for which I recommend reading our previous post linked above. Anyway, here’s the entire code

This is the plot that is generated in one of the runs (note that, since the network is initialised with random weights, each individual outcome may vary).

Validation Loss - Neural Network - Optimisers - Nirpy Research

 

In this simple example we see no noticeable difference between the three procedures, but again, I refer you to the excellent paper by Sebastian Ruder quoted above for the advantages of adaptive methods, and for the choice that works best for your data.

Thanks for reading and until next time,
Daniel

References

For reference, here’s the full list of posts (published so far) dedicated to neural network for spectral data processing with Tensorflow.

  1. Deep neural networks for spectral data regression with TensorFlow.
  2. Understanding neural network parameters with TensorFlow in Python: the activation function.

There are a large number of excellent resources out there to learn more about neural networks. Relevant resources are

  1. J. Brownlee, How to Configure the Learning Rate When Training Deep Learning Neural Networks.
  2. J. Brownlee, Gentle Introduction to the Adam Optimization Algorithm for Deep Learning.
  3. The TensorFlow Python library.
  4. The Tensorflow regression tutorial.
  5. Tensorflow – Keras optimisers
  6. Sebastian Ruder: An overview of gradient descent optimization algorithms.
  7. F. Chollet, Deep learning with Python, second edition. Simon and Schuster, 2021.
  8. The dataset is taken from the publicly available data associated with the paper by J. Wenjun, S. Zhou, H. Jingyi, and L Shuo, In Situ Measurement of Some Soil Properties in Paddy Soil Using Visible and Near-Infrared Spectroscopy, PLOS ONE 11, e0159785.