Genetic algorithm for wavelength selection using Numpy

A genetic algorithm is a type of optimisation procedure inspired by the mechanism of evolution by natural selection. I’ve already published a version of genetic algorithm for wavelength (band) selection:

Wavelength selection with a genetic algorithm

The material above is based on the PyGAD library. Now, this library is excellent (even though its documentation is a bit terse) but I wanted to look at a simpler version of a genetic algorithm using basic Numpy. This will help simplifying the relevant functions as much as possible, which is helpful during the learning process.

So, here’s a basic version of genetic algorithm for wavelength selection using Numpy only.

Review of the basics: what’s a genetic algorithm?

If you haven’t already, I recommend you take a look at our previous post of genetic algorithms for wavelength selection. In the introduction to that post, we discuss how a genetic algorithm (GA) borrows concepts from the mechanism of evolution by natural selection, to optimise solutions according to a fitness function.

The basic ingredients of a GA are

  • A population, that is a group of potential solutions (these loosely corresponds to chromosomes).
  • A fitness function, which measures the optimality of each solution according to the problem.
  • Crossovers, which is the combination of two solutions, corresponding to the combination of the genetic material of two parents in an offspring).
  • Random mutations that may affect individual parts of each solution.

The algorithm itself is iterative.

  1. Start from an initial population (ensemble) of solutions to a given problem.
  2. Calculate the fitness value for each solution.
  3. Select the solutions with the best fitness, according to some criterion.
  4. Allow some of the solution to cross-over, that is to mix and generate other ‘offspring’ solutions.
  5. Discard the worst performing solution and replace them with the newly-generated offsprings to compose a new population.
  6. Iterate from 2 to 6  as needed.

In the previous post (linked above) these steps where implemented using PyGAD. In the next section, we will do the same, this time using Numpy and scikit-learn.

Genetic algorithm using Numpy and scikit-learn in Python

Let’s begin with the relevant imports

As in the previous post on the topic, we’ll use NIR data taken from the paper In Situ Measurement of Some Soil Properties in Paddy Soil Using Visible and Near Infrared Spectroscopy by Ji Wenjun et al. The data is available here.

Note the block-average performed at the last step. That is optional, but to minimise overfitting problem it is always recommended to block-average data into larger wavelength bands rather than using individual wavelengths. Of course, you can also apply a suitable pre-processing step at this stage (see for instance examples on this blog).

We are now ready to define the procedure. Wavelength selection is approached by defining a binary array specifying which bands need to be selected. If an element of that array is True, than the corresponding band is selected, otherwise neglected. The binary array plays the role of a ‘mask’ which enables/disables individual elements of the spectra array.

To evaluate the fitness of each solution (i.e. each mask when applied to the spectra array), we calculate regression parameters in cross-validation. The first function we define is therefore the cv_regressor

In this case we’ve gone with PLS Regression, but other choices are obviously possible. Note that the function above is very similar (but not identical!) to the corresponding function we used for the genetic algorithm with PyGAD.

The function returns the predicted values in a K-fold cross validation. These values can then be used as an input for the fitness function

Note that, in GAs, the fitness function must be maximised. Hence, if we evaluate the quality of our model by minimising a cost function (such as the RMSE, in the example above), we can simply define the fitness function as the inverse of the cost function.

Next up is the function to select the best fitness solutions or, more precisely, to discard the worst-fit solutions. The number of solutions to discard is equal to the number of “parents”, since every pair of parents will generate two offspring solutions. Therefore the number of new solutions, at each iteration, is equal to the number of parents, and they will replace the worst-fit solutions.

The next function is the “cross-over”, i.e. a function that, for any pair of parents generate two offspring solutions

Finally, we need a function to introduce random mutation in the newly-generated solutions.

That’s all we need! In the next section we use these functions in a loop to simulate the evolving generations of solutions.

Testing the algorithm

The GA requires us to specify reasonable values for the population size, the number of parents, the mutation rate, and the number of generations. This last parameter could be replaced by a target value for the fitness function and the code must be modified to allow stopping the loop once the target is reached. In this example we go with the first option. Here’s the code

After a set number of generations, we can select the best solution out of the final population.

Finally, we can run one last cross-validation with the selected features, print the output and plot the results.

genetic algorithm optimisation

The printed result, in my case, is R2: 0.533, RMSE: 4.538, MAE: 3.364 . For comparison, the same figures, at the start of the optimisation, were R2: 0.070, RMSE: 6.407, MAE: 4.678 confirming that the variable selection has improved the regression model.

That’s it for today! As always, thanks for reading and until next time.

Daniel