Principal component selection with simulated annealing

Simulated annealing is a fairly common algorithm for random optimisation. In this tutorial we are going to look at how one can use a simulated annealing algorithm for principal component selection in PCR.

In the previous post we discussed how to run a random search with a greedy algorithm, and discussed the common pitfalls of such an algorithm. This post will closely map our previous post on greedy algorithms. We are going to use the same data and the same notations. That will make it easier to compare the results of simulated annealing with the outputs of the greedy algorithm as coded last time. Therefore I strongly recommend that you take a look at the previous post if you haven’t done that already.

The plan for this post is to introduce the concept of simulated annealing and how it differs from the greedy algorithm strategy. Then we are going to modify the greedy algorithm code we have seen last time, to implement the simulated annealing. Finally we’ll compare the results of the two approaches.

Introduction to optimisation with simulated annealing

The main problem of a greedy algorithm is that, while it generally converges to a minimum of the cost function (assuming we are implementing a minimisation algorithm), there is no guarantee that the result is optimal. As we have seen in the previous post, a greedy algorithm will aim at minimising the metric at each step. As a result, chances are the process may get stuck in a local minimum without being able to get out of it to converge towards a global minimum.

Given that we do not generally know the shape of the cost function, we have no guarantee that the result found by the greedy strategy is the best possible result; and with a greedy strategy alone we do not have a way to find that out either.

That’s where a simulated annealing strategy comes to the rescue.

The simulated annealing class of algorithms was originally inspired by the annealing process in metallurgy. Metallurgical annealing is the process whereby a metal is heated above its re-crystallisation temperature so that its atom can migrate across the lattice. This process can reduce the number of dislocations thereby increasing the ductility of the metal (i.e. making it more workable) and reducing its hardness.

This is achieved by slowly cooling the metal down after heating. The final properties of the material are heavily dependent on the initial temperature and rate of cooling. Oftentimes one doesn’t monotonically decrease the temperature but instead uses cycles of partial re-heating and cooling until the final result is obtained. The partial re-heating can help further smooth out the dislocation present in the material.

This is the principle that is followed in the simulated annealing optimisation. Suppose that decreasing the cost function corresponds to the “cooling” process (the cost function may be thought as the equivalent of the temperature). A greedy strategy seeks to always decrease the temperature. A simulated annealing strategy would allow, with some probability, to occasionally “heat up” the system. In other words it allows for the cost function to occasionally increase.

The occasional (random) increase of the cost function is governed by a probability that we set up as a hyper-parameter. Loosely speaking we want to set the probability so that we have a few occurrences of the “heating” step. If the probability is too low (i.e. the increase has no chance to occur) we have simply a greedy algorithm. The opposite situation is when the probability is too high. This is also bad, because it will keep the system jumping all over the place, with no chance of converging.

From what we said, it must be clear that a judicious setting of this parameter is key to a successful implementation of simulated annealing. Therefore, let’s spend a few words on the strategy we are going to implement.

RMSE minimisation with simulated annealing

Our objective is to pick and choose principal components that will minimise the RMSE in cross-validation when the chosen set of principal components are used to run a PCR. As discussed in the greedy algorithm minimisation of the previous post, the strategy is to pick a random move (a random change in the set of principal components) and accept it only if it decreases the RMSE.

In simulated annealing we are going to define a “cooling” parameter c, which we set at a fraction of a percent of the minimum value of the RMSE we have attained so far. Let’s call this value \mathrm{RMSE_{min}}, and the current value \mathrm{RMSE_{cur}}. The probability is then

\exp \left[- \frac{\mathrm{RMSE_{cur} - RMSE_{min}}}{c}\right]

where c is the cooling parameter we introduced above. Some of you may recognise the Boltzmann distribution of thermodynamics fame, which is just a formal way of putting what we said above: simulated annealing is inspired by the physical process of annealing and its thermodynamics.

Anyway, the definition of the probability above has the advantage of generating a probability which is fairly small so as not to hinder convergence, but not too small so as to avoid the occasional jump. Moreover, as the system converges towards a minimum, the cooling parameter is also decreasing. The consequence is that the system tends to “cool down” on average, and the likelihood of a jump will decrease as we descend toward the minimum of the RMSE.

The presence of this non-zero probability however is all too important. It introduces an element of randomness that, for sufficiently long runs, will prevent the process from getting stuck in a secondary minimum, and so helping to find the global minimum.

OK, this is as much as we want to say on the theory. Let’s write down some code.

Principal component selection with simulated annealing in Python

We are going to use NIR spectra from fresh peaches and corresponding brix values. The dataset is available for download at our Github repo. We’ll follow exactly the same steps we used in the previous post on greedy algorithms. For completeness we are writing the complete code here, but won’t explain the parts we have already covered.

Imports and data.

PCR functions (check also similar functions in our PCR post).

Basic PCR regression which uses the functions above

Result

This is the basic PCR regression which takes the first 19 principal components, which is the number of components which, when taken in order, minimises the RMSE. The aim of our optimisation algorithm is instead to pick different combinations of principal components, not necessarily in the natural PCA order, and use those for regression.

Now we are going to modify the iterative algorithm we’ve seen in the last post to use simulated annealing. The most important differences from the greedy approach are:

  1. We are defining a cooling parameter which we set to be 0.5% of the minimum RMSE
  2. We run the loop in almost the same way as we did in the greedy algorithm. If the new RMSE is less than the previous, we keep the last change. However, if the new RMSE happens to be larger than the previous, we might still accept it, depending on the probability defined above. In practice we extract a random number from a uniform distribution in [0,1]. If such a random number is less than the probability value, we accept the change.

In this way, the larger the probability value, the more likely it will be to accept a larger RMSE. The numerical value of such probability depends on the difference between the current RMSE and the minimum RMSE (the larger the difference, the less unlikely is the jump) and the value of the cooling parameter (the smaller the cooling parameter, the less likely the jump).

Here is the full Python code for the simulated annealing. Note this code assumes the PCA decomposition has been already done, as in the previous code snippet.

Discussion

In the previous code, we hard-coded the cooling parameter to be 0.5% of the minimum RMSE. We avoided a fixed value because we want the cooling parameter to decrease as the algorithm converges to a minimum. In this way, as the minimum RMSE decreases, so does the cooling parameter. The value of 0.5% comes from a bit of trial and error.

I noticed that lower values (say 0.1%) tend to generate very low probability values, hence to prevent, in almost all cases, a jump. In practice, very low values of the cooling parameter make the algorithm effectively equivalent to a greedy algorithm.

Conversely, if we increase the cooling parameter, say to 1%, the algorithm has trouble converging. You can try that out for yourself: you’ll note that for values of the RMSE around 1.2 the probability becomes large enough to have frequent jumps and making convergence very unlikely.

At around 0.5% we have quite a few jumps, which are useful to speed up convergence, but without creating instabilities.

The result we get after running the iterative code above is slightly better (but essentially equivalent) to the greedy algorithm: RMSE drops to about 1.2 with a selection of 14 principal components. Just like we’ve done in the previous post, here’s the code to run a regression with the optimised components and visualise a scatter plot.

which prints

And plots

Simulated annealing for principal component regression

In conclusion, the result is not too dissimilar from a greedy approach, which tells us that also a greedy algorithm does a good job to minimise the RMSE in this case. The thing is however, that a simulated annealing approach gives us way more flexibility in designing the optimisation strategy, which can certainly results in improved performances in general.

Thanks for reading and until next time,

Daniel