Genetic algorithm for wavelength selection using Numpy

A genetic algorithm is a type of optimisation procedure inspired by the mechanism of evolution by natural selection. I’ve already published a version of genetic algorithm for wavelength (band) selection:

Wavelength selection with a genetic algorithm

The material above is based on the PyGAD library. Now, this library is excellent (even though its documentation is a bit terse) but I wanted to look at a simpler version of a genetic algorithm using basic Numpy. This will help simplifying the relevant functions as much as possible, which is helpful during the learning process.

So, here’s a basic version of genetic algorithm for wavelength selection using Numpy only.

Review of the basics: what’s a genetic algorithm?

If you haven’t already, I recommend you take a look at our previous post of genetic algorithms for wavelength selection. In the introduction to that post, we discuss how a genetic algorithm (GA) borrows concepts from the mechanism of evolution by natural selection, to optimise solutions according to a fitness function.

The basic ingredients of a GA are

A population, that is a group of potential solutions (these loosely corresponds to chromosomes).
A fitness function, which measures the optimality of each solution according to the problem.
Crossovers, which is the combination of two solutions, corresponding to the combination of the genetic material of two parents in an offspring).
Random mutations that may affect individual parts of each solution.

The algorithm itself is iterative.

Start from an initial population (ensemble) of solutions to a given problem.
Calculate the fitness value for each solution.
Select the solutions with the best fitness, according to some criterion.
Allow some of the solution to cross-over, that is to mix and generate other ‘offspring’ solutions.
Discard the worst performing solution and replace them with the newly-generated offsprings to compose a new population.
Iterate from 2 to 6 as needed.

In the previous post (linked above) these steps where implemented using PyGAD. In the next section, we will do the same, this time using Numpy and scikit-learn.

Genetic algorithm using Numpy and scikit-learn in Python

Let’s begin with the relevant imports

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from sklearn.cross_decomposition import PLSRegression

from sklearn.model_selection import cross_val_predict, GridSearchCV

from sys import stdout

As in the previous post on the topic, we’ll use NIR data taken from the paper In Situ Measurement of Some Soil Properties in Paddy Soil Using Visible and Near Infrared Spectroscopy by Ji Wenjun et al. The data is available here.

# Load data and define arrays

data = pd.read_excel("../../data/File_S1.xlsx")

X = data.values[1:,8:].astype('float32')

wl = np.linspace(350, 2500, X.shape[1])

y = data["TOC"].values[1:].astype("float32")

# Average blocks of 4 wavelengths

Xavg = X.reshape(X.shape[0], X.shape[1]//4,4).mean(axis=2)

Note the block-average performed at the last step. That is optional, but to minimise overfitting problem it is always recommended to block-average data into larger wavelength bands rather than using individual wavelengths. Of course, you can also apply a suitable pre-processing step at this stage (see for instance examples on this blog).

We are now ready to define the procedure. Wavelength selection is approached by defining a binary array specifying which bands need to be selected. If an element of that array is True, than the corresponding band is selected, otherwise neglected. The binary array plays the role of a ‘mask’ which enables/disables individual elements of the spectra array.

To evaluate the fitness of each solution (i.e. each mask when applied to the spectra array), we calculate regression parameters in cross-validation. The first function we define is therefore the cv_regressor

def cv_regressor(X,y, cv=5):

#specify parameter space

parameters_gs = {"n_components": np.arange(5,10,1)}

# Define Grid Search CV object

regressor = GridSearchCV(PLSRegression(), parameters_gs, \

cv=cv, scoring = 'neg_mean_squared_error', n_jobs=8)

# Fit to data

regressor.fit(X, y)

# Calculate a final CV with best choice of parameters

y_cv = cross_val_predict(regressor.best_estimator_, X, y, cv=5, n_jobs=8)

return y_cv

In this case we’ve gone with PLS Regression, but other choices are obviously possible. Note that the function above is very similar (but not identical!) to the corresponding function we used for the genetic algorithm with PyGAD.

The function returns the predicted values in a K-fold cross validation. These values can then be used as an input for the fitness function

def fitness_function(individual, X, y):

# Select features based on a binary mask

selected_features = X[:, np.array(individual) == 1]

# Train regression model in cross-validation

y_cv = cv_regressor(selected_features, y)

# Fitness function is defined as the inverse of the RMSE

fitness = 1.0 / np.sqrt(mean_squared_error(y,y_cv.flatten()))

return fitness

Note that, in GAs, the fitness function must be maximised. Hence, if we evaluate the quality of our model by minimising a cost function (such as the RMSE, in the example above), we can simply define the fitness function as the inverse of the cost function.

Next up is the function to select the best fitness solutions or, more precisely, to discard the worst-fit solutions. The number of solutions to discard is equal to the number of “parents”, since every pair of parents will generate two offspring solutions. Therefore the number of new solutions, at each iteration, is equal to the number of parents, and they will replace the worst-fit solutions.

def selection(population, fitnesses, number_of_parents):

# Discard the number_of_parents least-fit solutions

# These will be replaced by new solutions calculated by crossover

best_fitness = np.argsort(np.array(fitnesses))[number_of_parents:]

return np.array(population)[best_fitness,:].tolist()

The next function is the “cross-over”, i.e. a function that, for any pair of parents generate two offspring solutions

def crossover(parent1, parent2):

# This function calculate a crossover between two parents

# which generate two offsprings

crossover_point = np.random.randint(1, len(parent1) - 1)

child1 = parent1[:crossover_point] + parent2[crossover_point:]

child2 = parent2[:crossover_point] + parent1[crossover_point:]

return child1, child2

Finally, we need a function to introduce random mutation in the newly-generated solutions.

def mutation(individual, mutation_rate):

# This function randomly apply mutations to a solution

for i in range(len(individual)):

if np.random.random() < mutation_rate:

individual[i] = 1 - individual[i] #Invert the Boolean value

return individual

That’s all we need! In the next section we use these functions in a loop to simulate the evolving generations of solutions.

Testing the algorithm

The GA requires us to specify reasonable values for the population size, the number of parents, the mutation rate, and the number of generations. This last parameter could be replaced by a target value for the fitness function and the code must be modified to allow stopping the loop once the target is reached. In this example we go with the first option. Here’s the code

population_size = 30

generations = 2000

mutation_rate = 0.2

number_of_parents = 10

# Initialise population to binary values, either 0 or 1

population = [[np.random.randint(0, 2) for _ in range(Xavg.shape[1])] \

for _ in range(population_size)]

for generation in range(generations):

# Calculate fitness functions for each individual in the population

fitnesses = [fitness_function(individual, Xavg, y) for individual in population]

# Print to screen

stdout.write("\r"+str(generation))

stdout.write(" ")

stdout.write(str(np.array(fitnesses).max()))

stdout.flush()

# The new population starts with the best-fitness individuals of the previous generation

new_population = selection(population, fitnesses, number_of_parents)

new_population_size = np.array(new_population).shape[0]

# Parents are chosen at random between the best fitness individuals

parents = np.random.choice(range(new_population_size), size=number_of_parents, replace=False)

# Parents generate offsprings

children = []

for i in range(number_of_parents // 2):

parent1, parent2 = new_population[parents[i]], new_population[parents[i+number_of_parents // 2]]

child1, child2 = crossover(parent1, parent2)

# Mutations can occur

child1 = mutation(child1, mutation_rate)

child2 = mutation(child2, mutation_rate)

# Add children to pool

children.extend([child1, child2])

# The children are added to the existing population

new_population.extend(children)

# Redefine the population for the next generation

population = new_population

After a set number of generations, we can select the best solution out of the final population.

# Identify the best individual out of the population and the features

best_individual = max(population, key=lambda ind: fitness_function(ind, Xavg, y))

selected_features = [i for i, x in enumerate(best_individual) if x == 1]

Finally, we can run one last cross-validation with the selected features, print the output and plot the results.

def regression_plot(y_ref, y_pred, title = None, variable = None):

# Regression plot

z = np.polyfit(y_ref, y_pred, 1)

with plt.style.context(('seaborn-v0_8')):

fig, ax = plt.subplots(figsize=(8, 8))

ax.scatter(y_ref, y_pred, c='red', edgecolors='k', alpha=0.75)

ax.plot(y_ref, z[1]+z[0]*y_ref, c='blue', linewidth=1)

ax.plot(y_ref, y_ref, color='green', linewidth=1)

if title is not None:

plt.title(title, fontsize=20)

if variable is not None:

plt.xlabel('Measured ' + variable, fontsize=20)

plt.ylabel('Predicted ' + variable, fontsize=20)

plt.xticks(fontsize=18)

plt.yticks(fontsize=18)

plt.show()

y_cv_best = cv_regressor(Xavg[:,selected_features], y) #[:,best_solution]

rmse_best, mae_best, score_best = np.sqrt(mean_squared_error(y, y_cv_best)), \

mean_absolute_error(y, y_cv_best), \

r2_score(y, y_cv_best)

print("R2: %5.3f, RMSE: %5.3f, MAE: %5.3f" %(score_best, rmse_best, mae_best))

regression_plot(y,y_cv_best, variable = "")

The printed result, in my case, is R2: 0.533, RMSE: 4.538, MAE: 3.364 . For comparison, the same figures, at the start of the optimisation, were R2: 0.070, RMSE: 6.407, MAE: 4.678 confirming that the variable selection has improved the regression model.

That’s it for today! As always, thanks for reading and until next time.

Daniel

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Review of the basics: what’s a genetic algorithm?

Genetic algorithm using Numpy and scikit-learn in Python

Testing the algorithm

Work With Me

About The Author

Daniel Pelliccia

Review of the basics: what’s a genetic algorithm?

Genetic algorithm using Numpy and scikit-learn in Python

Testing the algorithm

Work With Me

About The Author

Daniel Pelliccia

Related Posts