Simulated annealing is a fairly common algorithm for random optimisation. In this tutorial we are going to look at how one can use a simulated annealing algorithm for principal component selection in PCR.

In the previous post we discussed how to run a random search with a greedy algorithm, and discussed the common pitfalls of such an algorithm. This post will closely map our previous post on greedy algorithms. We are going to use the same data and the same notations. That will make it easier to compare the results of simulated annealing with the outputs of the greedy algorithm as coded last time. Therefore I strongly recommend that you take a look at the previous post if you haven’t done that already.

The plan for this post is to introduce the concept of simulated annealing and how it differs from the greedy algorithm strategy. Then we are going to modify the greedy algorithm code we have seen last time, to implement the simulated annealing. Finally we’ll compare the results of the two approaches.

## Introduction to optimisation with simulated annealing

The main problem of a greedy algorithm is that, while it generally converges to a minimum of the cost function (assuming we are implementing a minimisation algorithm), there is no guarantee that the result is optimal. As we have seen in the previous post, a greedy algorithm will aim at minimising the metric at each step. As a result, chances are the process may get stuck in a local minimum without being able to get out of it to converge towards a global minimum.

Given that we do not generally know the shape of the cost function, we have no guarantee that the result found by the greedy strategy is the best possible result; and with a greedy strategy alone we do not have a way to find that out either.

That’s where a simulated annealing strategy comes to the rescue.

The simulated annealing class of algorithms was originally inspired by the annealing process in metallurgy. Metallurgical annealing is the process whereby a metal is heated above its re-crystallisation temperature so that its atom can migrate across the lattice. This process can reduce the number of dislocations thereby increasing the ductility of the metal (i.e. making it more workable) and reducing its hardness.

This is achieved by slowly cooling the metal down after heating. The final properties of the material are heavily dependent on the initial temperature and rate of cooling. Oftentimes one doesn’t monotonically decrease the temperature but instead uses cycles of partial re-heating and cooling until the final result is obtained. The partial re-heating can help further smooth out the dislocation present in the material.

This is the principle that is followed in the simulated annealing optimisation. Suppose that decreasing the cost function corresponds to the “cooling” process (the cost function may be thought as the equivalent of the temperature). A greedy strategy seeks to always decrease the temperature. A simulated annealing strategy would allow, with some probability, to occasionally “heat up” the system. In other words it allows for the cost function to occasionally increase.

The occasional (random) increase of the cost function is governed by a probability that we set up as a hyper-parameter. Loosely speaking we want to set the probability so that we have a few occurrences of the “heating” step. If the probability is too low (i.e. the increase has no chance to occur) we have simply a greedy algorithm. The opposite situation is when the probability is too high. This is also bad, because it will keep the system jumping all over the place, with no chance of converging.

From what we said, it must be clear that a judicious setting of this parameter is key to a successful implementation of simulated annealing. Therefore, let’s spend a few words on the strategy we are going to implement.

## RMSE minimisation with simulated annealing

Our objective is to pick and choose principal components that will minimise the RMSE in cross-validation when the chosen set of principal components are used to run a PCR. As discussed in the greedy algorithm minimisation of the previous post, the strategy is to pick a random move (a random change in the set of principal components) and accept it only if it decreases the RMSE.

In simulated annealing we are going to define a “cooling” parameter c, which we set at a fraction of a percent of the minimum value of the RMSE we have attained so far. Let’s call this value \mathrm{RMSE_{min}}, and the current value \mathrm{RMSE_{cur}}. The probability is then

\exp \left[- \frac{\mathrm{RMSE_{cur} - RMSE_{min}}}{c}\right]where *c* is the cooling parameter we introduced above. Some of you may recognise the Boltzmann distribution of thermodynamics fame, which is just a formal way of putting what we said above: simulated annealing is inspired by the physical process of annealing and its thermodynamics.

Anyway, the definition of the probability above has the advantage of generating a probability which is fairly small so as not to hinder convergence, but not too small so as to avoid the occasional jump. Moreover, as the system converges towards a minimum, the cooling parameter is also decreasing. The consequence is that the system tends to “cool down” on average, and the likelihood of a jump will decrease as we descend toward the minimum of the RMSE.

The presence of this non-zero probability however is all too important. It introduces an element of randomness that, for sufficiently long runs, will prevent the process from getting stuck in a secondary minimum, and so helping to find the global minimum.

OK, this is as much as we want to say on the theory. Let’s write down some code.

## Principal component selection with simulated annealing in Python

We are going to use NIR spectra from fresh peaches and corresponding brix values. The dataset is available for download at our Github repo. We’ll follow exactly the same steps we used in the previous post on greedy algorithms. For completeness we are writing the complete code here, but won’t explain the parts we have already covered.

Imports and data.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn import linear_model from sklearn.model_selection import cross_val_predict from sklearn.metrics import mean_squared_error, r2_score data = pd.read_csv('../data/peach_spectra+brixvalues.csv') y = data['Brix'].values X = data.values[:,1:] |

PCR functions (check also similar functions in our PCR post).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
def pcr(X,y, pc): # Run PCA and select the first pc components pca = PCA() Xreg = pca.fit_transform(X)[:,:pc] # Create linear regression object regr = linear_model.LinearRegression() # Fit regr.fit(Xreg, y) # Calibration y_c = regr.predict(Xreg) # Cross-validation y_cv = cross_val_predict(regr, Xreg, y, cv=5) # Calculate scores for calibration and cross-validation score_c = r2_score(y, y_c) score_cv = r2_score(y, y_cv) # Calculate mean square error for calibration and cross validation rmse_c = np.sqrt(mean_squared_error(y, y_c)) rmse_cv = np.sqrt(mean_squared_error(y, y_cv)) return(y_cv, score_c, score_cv, rmse_c, rmse_cv) def pcr_optimise_components(Xpca, y, npc): rmsecv = np.zeros(npc) for i in range(1,npc+1,1): Xreg = Xpca[:,:i] # Create linear regression object regr = linear_model.LinearRegression() # Fit regr.fit(Xreg, y) # Cross-validation y_cv = cross_val_predict(regr, Xreg, y, cv=5) # Calculate R2 score score_cv = r2_score(y, y_cv) # Calculate RMSE rmsecv[i-1] = np.sqrt(mean_squared_error(y, y_cv)) # Find the minimum of the RMSE and its location opt_comp, rmsecv_min = np.argmin(rmsecv), rmsecv[np.argmin(rmsecv)] return (opt_comp+1, rmsecv_min) |

Basic PCR regression which uses the functions above

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
pca = PCA() Xpca = pca.fit_transform(X) opt_comp, rmsecv_min = pcr_optimise_components(Xpca, y, 30) predicted, r2r, r2cv, rmser, rmscv = pcr(X,y, pc=opt_comp) print("PCR optimised values:") print(" R2: %5.3f, RMSE: %5.3f"%(r2cv, rmscv)) print(" Number of Principal Components: "+str(opt_comp)) # Regression plot z = np.polyfit(y, predicted, 1) with plt.style.context(('ggplot')): fig, ax = plt.subplots(figsize=(9, 5)) ax.scatter(y, predicted, c='red', edgecolors='k') ax.plot(y, z[1]+z[0]*y, c='blue', linewidth=1) ax.plot(y, y, color='green', linewidth=1) plt.title('Conventional PCR') plt.xlabel('Measured $^{\circ}$Brix') plt.ylabel('Predicted $^{\circ}$Brix') plt.show() |

Result

1 2 3 |
PCR optimised values: R2: 0.546, RMSE: 1.455 Number of Principal Components: 19 |

This is the basic PCR regression which takes the first 19 principal components, which is the number of components which, when taken in order, minimises the RMSE. The aim of our optimisation algorithm is instead to pick different combinations of principal components, not necessarily in the natural PCA order, and use those for regression.

Now we are going to modify the iterative algorithm we’ve seen in the last post to use simulated annealing. The most important differences from the greedy approach are:

- We are defining a cooling parameter which we set to be 0.5% of the minimum RMSE
- We run the loop in
*almost*the same way as we did in the greedy algorithm. If the new RMSE is less than the previous, we keep the last change. However, if the new RMSE happens to be larger than the previous, we might still accept it, depending on the probability defined above. In practice we extract a random number from a uniform distribution in [0,1]. If such a random number is less than the probability value, we accept the change.

In this way, the larger the probability value, the more likely it will be to accept a larger RMSE. The numerical value of such probability depends on the difference between the current RMSE and the minimum RMSE (the larger the difference, the less unlikely is the jump) and the value of the cooling parameter (the smaller the cooling parameter, the less likely the jump).

Here is the full Python code for the simulated annealing. Note this code assumes the PCA decomposition has been already done, as in the previous code snippet.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
tot_pc = 30 #total number of PCs sel_pc = 15 #number of PCs we want to select # Generate random sample to start with p = np.arange(tot_pc) np.random.shuffle(p) pc = p[:sel_pc] # Selected Principal Components notpc = p[sel_pc:] # Excluded Principal Components Xop = Xpca[:,pc] # Selected PCA-transformed data # Out of this random selection, choose the first n PCs that optimise the RMSE in CV opt_comp, rmsecv_min = pcr_optimise_components(Xop,y,tot_pc) print('start',opt_comp, rmsecv_min) # Simulated annealing for i in range(500): # Define the cooling parameter cool = 0.005*rmsecv_min # Make a copy of the original lists on PCs new_pc = np.copy(pc) new_notpc = np.copy(notpc) # Swap two elements at random between the copied lists r1, r2 = np.random.randint(sel_pc),np.random.randint(tot_pc-sel_pc) el1, el2 = new_pc[r1],new_notpc[r2] new_pc[r1] = el2 new_notpc[r2] = el1 # Select the new PCA-transformed data Xop = Xpca[:,new_pc] # Pick the first n PCs that optimise the RMSE in CV opt_comp_new, rmsecv_min_new = pcr_optimise_components(Xop,y,tot_pc) # If the new RMSE is less than the previous, keep the new selection if (rmsecv_min_new < rmsecv_min): pc = new_pc notpc = new_notpc opt_comp = opt_comp_new rmsecv_min = rmsecv_min_new print(i, opt_comp_new, rmsecv_min_new) # If the new RMSE is larger than the previous, accept the new selection # with some probability dictated by the cooling parameter if (rmsecv_min_new > rmsecv_min): prob = np.exp(-(rmsecv_min_new - rmsecv_min)/cool) if (np.random.random() < prob): pc = new_pc npc = new_npc opt_comp = opt_comp_new rmsecv_min = rmsecv_min_new print(i, opt_comp_new, rmsecv_min_new, prob) print(pc) print('Done') |

## Discussion

In the previous code, we hard-coded the cooling parameter to be 0.5% of the minimum RMSE. We avoided a fixed value because we want the cooling parameter to decrease as the algorithm converges to a minimum. In this way, as the minimum RMSE decreases, so does the cooling parameter. The value of 0.5% comes from a bit of trial and error.

I noticed that lower values (say 0.1%) tend to generate very low probability values, hence to prevent, in almost all cases, a jump. In practice, very low values of the cooling parameter make the algorithm effectively equivalent to a greedy algorithm.

Conversely, if we increase the cooling parameter, say to 1%, the algorithm has trouble converging. You can try that out for yourself: you’ll note that for values of the RMSE around 1.2 the probability becomes large enough to have frequent jumps and making convergence very unlikely.

At around 0.5% we have quite a few jumps, which are useful to speed up convergence, but without creating instabilities.

The result we get after running the iterative code above is slightly better (but essentially equivalent) to the greedy algorithm: RMSE drops to about 1.2 with a selection of 14 principal components. Just like we’ve done in the previous post, here’s the code to run a regression with the optimised components and visualise a scatter plot.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
Xop = Xpca[:,pc] Xreg = Xop[:,:opt_comp] # Create linear regression object regr = linear_model.LinearRegression() # Fit regr.fit(Xreg, y) # Cross-validation y_cv = cross_val_predict(regr, Xreg, y, cv=5) # Calculate R2 score in cross-validation score_cv = r2_score(y, y_cv) # Calculate RMSE in cross validation rmsecv = np.sqrt(mean_squared_error(y, y_cv)) print("PCR optimised values after selection by simulated annealing:") print(" R2: %5.3f, RMSE: %5.3f"%(score_cv, rmsecv)) print(" Number of Principal Components: "+str(opt_comp)) # Regression plot z = np.polyfit(y, y_cv, 1) with plt.style.context(('ggplot')): fig, ax = plt.subplots(figsize=(9, 5)) ax.scatter(y, y_cv, c='red', edgecolors='k') ax.plot(y, z[1]+z[0]*y, c='blue', linewidth=1) ax.plot(y, y, color='green', linewidth=1) plt.title('PC selection by simulated annealing') plt.xlabel('Measured $^{\circ}$Brix') plt.ylabel('Predicted $^{\circ}$Brix') plt.show() |

which prints

1 2 3 |
PCR optimised values after selection by simulated annealing: R2: 0.682, RMSE: 1.209 Number of Principal Components: 134 |

And plots

In conclusion, the result is not too dissimilar from a greedy approach, which tells us that also a greedy algorithm does a good job to minimise the RMSE in this case. The thing is however, that a simulated annealing approach gives us way more flexibility in designing the optimisation strategy, which can certainly results in improved performances in general.

Thanks for reading and until next time,

Daniel