Wavelength band selection with simulated annealing

One of the open secrets in chemometrics applied to spectroscopy is that wavelength band selection is a key ingredient. Often more important than the specific model used for multivariate calibration, band selection enables to significantly improve the quality of a prediction model.

Given the large wide wavelengths sensitivity of modern instruments, selecting the wavelength bands that better correlate with the quantity of interest is often a daunting task.

In this post we are going to use an optimisation technique called Simulated Annealing, which enables us to perform a random search in the space of possible band combinations. The aim is to find what we hope is an optimal solution.

Defining the problem of wavelength bands selection

Let’s first define what we mean by wavelength bands for the aim of this optimisation.

Most instruments nowadays have amazing spectral resolution. For many practical situations however, collinearity (also called multi-collinearity) is the norm. Collinearity refers to the fact that despite high spectral resolution, neighbouring wavelengths in many regions of the spectrum are correlated with one another. Hence, one doesn’t really need to retain all of these wavelengths for the data analysis, as adding correlated wavelengths doesn’t change the quality of the model.

As an aside, (discussed in previous posts here and here) collinearity is an actual problem for linear regression as it makes the underlying mathematical problem ill-posed. This is the reason why dimensionality reduction (via PCA or PLS for instance) is almost always required.

For the aim of this discussion, we are looking at the collinearity problem from a slightly different angle. It goes like this. If neighbouring wavelengths are correlated, we can actually average them up and in doing so producing a decimated spectrum which retains a relatively small fraction of the original wavelengths.

At this point we can easily ‘switch on/off’ individual bands (which are in fact single bins in the decimated spectrum) and check how a model will perform.

Note that this makes sense in the decimated spectrum as adding/removing bands (or swapping bands, as we’ll actually do) is likely to produce a measurable effect. Doing the same thing in the original spectrum is not likely to generate a significant difference, as many neighbouring wavelengths are equivalents.

Optimising the bands via simulated annealing

The method of simulated annealing (SA) is a clever way to produce a random optimisation. We introduced SA in a previous post already, so you can refer to that for the details.

In short, we are going to define a cost function, namely the root mean square error in cross-validation (RMSECV) and seek to decrease it. Unlike greedy algorithms which seek to monotonically decrease the cost function, SA is an optimisation process that can occasionally allow an increase of the cost function (based on some probability).

The net effect of allowing an occasional ‘worse’ solution is that the algorithm is less likely to get stuck in a local minimum. I won’t say more here. For more info you can check out my dedicated post on simulated annealing here.

Here’s the procedure we are going to follow.

  1. Define a number of bands we want to retain. This value, call it N, has to be obviously smaller than the total number of bands. This number is a hyper-parameter. It is fixed during the optimisation but one can try a few values out and see the result during different runs.
  2. Pick N random bands and start the optimisation process and fit a PLS model optimising the number of latent variables. Calculate the RMSECV as our cost function.
  3. At this point we start the iterative procedure. At each step we swap three bands at random and extract the RMSECV of the optimised PLS. If the new value of the RMSECV is lower than the older one, keep the swapped bands. Otherwise keep it only in accord with some small probability, driven by the cooling parameter. (Once again, for all the details of the simulated annealing optimisation, refer to my previous post).
  4. Repeat for a sufficient number of iterations.

At the end of the procedure we return the selected bands, alongside the optimised number of components of the PLS model associated with such a band selection. We also return a list containing the value of the RMSECV at each iteration, which we can plot to get a better understanding of how the RMSE is converging.

Here’s the entire function (for the list of imports, see next section).

Result: bands optimisation for Brix prediction in fresh plums

It’s now time to check how this band optimisation works with real data.

We are going to use NIR spectra of fresh plums and we are trying to predict the sugar content (measured in degree Brix) from the spectra.

In the following we are going to write the main part of the script. The code of all functions is copied at the end of the post and the entire Jupyter Notebook is available on Github. Let’s start with the imports.

Now let’s import and preprocess the data. The Multiplicative Scatter Correction function used here is copied at the end of the post and discussed here.

Now we are going to resample the wavelengths into bands. Since we have 600 wavelengths, let’s decimate it by a factor of 10, down to 60 bands.

Before we begin the band selection algorithm, let’s check a basic PLS regression against which we’ll benchmark the optimised results. The three functions called here are defined at the end of the post. The first function will optimise the PLS components. The second runs a PLS regression with the optimal number of components. Finally the third function is just a utility function to plot the results.

The result is OK, but not great.

It’s time to run the Simulated Annealing algorithm and see if we can get any better. We choose 20 bands, a maximum of 20 latent variables for the PLS search and a total of 100 iterations. After the algorithm has finished we run a basic PLS with the selected bands and number of components, and we plot the results.

The result is way better than the basic PLS above. The algorithm has halved the RMSECV with a massive improvement in the coefficient of determination. Note that the result of each individual run of the Simulated Annealing code may vary due to the randomness of the process. We expect however that the final results will differ little from one another. If that is not happening it is likely that the model is overfitting the data.

To gain some insights into the progress of the optimisation, we can plot the rms

Also, we can take a look a which bands the algorithm has selected. Here’s the code to do that.

By running the code a few times, we can easily check that most of the bands are selected each time (which is related to the stability of the solution during multiple runs we have hinted at before).

That’s us for now. Thanks for reading and until next time,

Daniel

Reference of the Python functions used in this post