Wavelength selection with a genetic algorithm

Genetic algorithms are a class of optimisation and search algorithm inspired by the mechanism of evolution by natural selection. Basic functions of a genetic algorithm (loosely) mimic the corresponding mechanisms observable in natural evolution, whereby species evolve due to the combination of random genome mutations and environmental selection.

Optimisation problems are designed to find a function that maximises (or minimises) a desired metric given the constraints of the problem. Genetic algorithms approach this problem by defining:

  • A population, that is a group of potential solutions (these loosely corresponds to chromosomes).
  • A fitness function, which measures the optimality of each solution according to the problem.
  • Crossovers, which is the combination of two solutions, corresponding to the combination of the genetic material of two parents in an offspring)
  • Random mutations that may affect individual parts of each solution.

These basic ingredients are used in an iterative fashion. Suppose that each solution (chromosome) is an array. Each element of the array is a gene. A population will be a series of arrays. A fitness function will be associated with each array so that individual arrays in the series can be ranked by it. A fraction of the ‘fittest’ arrays can be crossed over according to some rule, to form ‘offspring’ arrays. Elements (genes) of offspring arrays could undergo random mutation according to some rule. The new arrays, so formed are the starting ones for the next iteration (“generation”) of the algorithm.

In this post we present a wavelength selection algorithm based on genetic optimisation. Wavelength (or band) selection is a form of variable selection, when the variables are channels in a spectrum. This is a topic we keep coming back to, for its importance in defining chemometrics models for spectroscopy. A few examples of past posts on this topic are

The foundational ideas for variable selection methods is that not all wavelengths (or bands) in a spectrum are equally relevant to predict a quantity of interests. Variable selection methods are therefore developed to search through the large space of possibilities represented by the combination of bands that produce an optimal model.

Since it’s the first time we discuss genetic algorithms at all, we’ll start with a basic example and then move on to the actual wavelength selection algorithm.

Before getting stuck in the content, I’d like to thank our wonderful Patreon backers for the support. And so you know, Patreon supporters get to enjoy additional perks, including Python code, on this and many other topics covered in the blog. You can check out our Patreon page here.

Learning the basics with a simple example

The library of choice to develop genetic algorithms in Python is PyGAD, which is an open source, versatile package, which also works with with Keras and PyTorch (even though we won’t use this functionalities here).

We’ll be using NIR data taken from the paper In Situ Measurement of Some Soil Properties in Paddy Soil Using Visible and Near Infrared Spectroscopy by Ji Wenjun et al. The data is available here

In this first example, we just aim at reproducing a target NIR spectrum starting from a population of random spectra, which is updated according to the rules of the genetic algorithms. First, the imports

Next, let’s load the data and select the target spectrum

Next up, we need to define the fitness function. In this (admittedly contrived) example, the fitness function is a simple function of the distance between a solution and the target spectrum. As a distance metric, we pick the Mean Squared Error (MSE). Note that the convention is to define a fitness function that will have to be maximised, therefore we chose the inverse of the MSE as fitness function.

Now we need to define an instance of the genetic algorithm, and here it is. Each of the argument of the function is commented, explaining its meaning.

The very last line enables us to define a function that executes at every iteration. We can choose to have an output printed  to screen like this

All is left is to run the optimisation

which will go through the 10,000 generations that we requested, starting from a set of random arrays (the initial solutions). At the end, we can print the best solution and compare it to the target

Our best solution is quite close to the target, but with some additional noise. You can play around with the parameters for instance by changing the number of solutions in a population (the more you have the better genetic diversity), the number of cross-over parents, the rate of random mutations, etc.

We can also explore how the fitness function improved as the generations progressed, using a pyGAD built-in function

As you can realise by studying this example, all the action is in the definition of the fitness function, assuming a clever definition of the arrays and their constraints. More complicated optimisation problems, like the wavelength selection we are about to present, will have to be wrapped into the fitness function itself.

Wavelength selection

The first step to implement wavelength selection is to define a Boolean (or binary — I use the terms interchangeably here being aware of their difference in other contexts) array which specifies which bands need to be selected. If an element (i.e. a gene) of that array is True, than the corresponding band needs to be selected, otherwise neglected. So, to recap, genes are binary (True/False) and solutions are binary arrays.

Each Boolean array is the subject of the genetic optimisation. It is “evolved” through the algorithm sequence of operations aiming at maximising a fitness function.

We’ll need a few additional imports here.

Also, in order to speed up the process and make the solutions less sensitive to noise, we are going to decimate the spectra, by block-averaging every 4 consecutive wavelengths.

The goal is to optimise a regression model that will predict the independent variable form the spectra. Therefore, the fitness function will have to be related to RMSE in cross-validation or similar. Following this idea, let’s first define a cross-validation (CV) function for regression. We use PLS regression here, but the same approach can be generalised to many other types of regressions

The cv_regressor  function optimises the number of latent variables and returns a prediction based on n-fold (n=5 here) CV. Now, the fitness function will have to include the cv_regressor function and calculate the fitness based on it. Here’s what you could do

Again, keep in mind that the fitness function must increase, so we are going to take the inverse of the RMSE-CV.

In the simple example above, we had chosen to start from a completely random set of solutions in the initial population. We could do the same here of course, but I thought it could be useful to explore another possibility: define the initial population and pass it as a parameter to the GA instance. Here’s how you could generate an initial population with number of solutions per population and number of bands as parameters.

The function above lets us define the size of the initial population, and the number of bands that are initially selected (at random). Here’s how we use this function

We start with 25 bands and 50 solutions per population. The GA instance is defined as

I tried a few different options and the one above seems to outperform the others. As you can see, we let genes undergo random mutations (i.e. each gene can randomly change from True to False independently of the other genes) with a given rate. This will not preserve the number of allowed bands, and in fact will optimise it! I found that this worked much better than the option of keeping the number of selected bands fixed.

Note that we define the genes to be integer and restricted to sit between 0 and 1. In this way we ensure that 0 and 1 are the only possible options.

We now run the instance and plot the results.

In fact we can also compare between the optimised solution and the original

Wavelength selection with a genetic algorithm by Nirpy Research.

In case you’re wondering, the regression_plot  function was defined in this post. I’ve added the metrics at the bottom of each plot to show the improvement achieved with the genetic algorithm.

To wrap it up, the genetic optimisation has improved the metrics of the regression model by selecting bands that are better correlated to the independent variable. This was a basic example to show how the wavelength selection problem could be solved via genetic optimisation.

Well, thanks for reading until the end … and until next time!