A genetic algorithm is a type of optimisation procedure inspired by the mechanism of evolution by natural selection. I’ve already published a version of genetic algorithm for wavelength (band) selection:
Wavelength selection with a genetic algorithm
The material above is based on the PyGAD library. Now, this library is excellent (even though its documentation is a bit terse) but I wanted to look at a simpler version of a genetic algorithm using basic Numpy. This will help simplifying the relevant functions as much as possible, which is helpful during the learning process.
So, here’s a basic version of genetic algorithm for wavelength selection using Numpy only.
Review of the basics: what’s a genetic algorithm?
If you haven’t already, I recommend you take a look at our previous post of genetic algorithms for wavelength selection. In the introduction to that post, we discuss how a genetic algorithm (GA) borrows concepts from the mechanism of evolution by natural selection, to optimise solutions according to a fitness function.
The basic ingredients of a GA are
- A population, that is a group of potential solutions (these loosely corresponds to chromosomes).
- A fitness function, which measures the optimality of each solution according to the problem.
- Crossovers, which is the combination of two solutions, corresponding to the combination of the genetic material of two parents in an offspring).
- Random mutations that may affect individual parts of each solution.
The algorithm itself is iterative.
- Start from an initial population (ensemble) of solutions to a given problem.
- Calculate the fitness value for each solution.
- Select the solutions with the best fitness, according to some criterion.
- Allow some of the solution to cross-over, that is to mix and generate other ‘offspring’ solutions.
- Discard the worst performing solution and replace them with the newly-generated offsprings to compose a new population.
- Iterate from 2 to 6 as needed.
In the previous post (linked above) these steps where implemented using PyGAD. In the next section, we will do the same, this time using Numpy and scikit-learn.
Genetic algorithm using Numpy and scikit-learn in Python
Let’s begin with the relevant imports
1 2 3 4 5 6 7 |
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score from sklearn.cross_decomposition import PLSRegression from sklearn.model_selection import cross_val_predict, GridSearchCV from sys import stdout |
As in the previous post on the topic, we’ll use NIR data taken from the paper In Situ Measurement of Some Soil Properties in Paddy Soil Using Visible and Near Infrared Spectroscopy by Ji Wenjun et al. The data is available here.
1 2 3 4 5 6 7 8 |
# Load data and define arrays data = pd.read_excel("../../data/File_S1.xlsx") X = data.values[1:,8:].astype('float32') wl = np.linspace(350, 2500, X.shape[1]) y = data["TOC"].values[1:].astype("float32") # Average blocks of 4 wavelengths Xavg = X.reshape(X.shape[0], X.shape[1]//4,4).mean(axis=2) |
Note the block-average performed at the last step. That is optional, but to minimise overfitting problem it is always recommended to block-average data into larger wavelength bands rather than using individual wavelengths. Of course, you can also apply a suitable pre-processing step at this stage (see for instance examples on this blog).
We are now ready to define the procedure. Wavelength selection is approached by defining a binary array specifying which bands need to be selected. If an element of that array is True, than the corresponding band is selected, otherwise neglected. The binary array plays the role of a ‘mask’ which enables/disables individual elements of the spectra array.
To evaluate the fitness of each solution (i.e. each mask when applied to the spectra array), we calculate regression parameters in cross-validation. The first function we define is therefore the cv_regressor
1 2 3 4 5 6 7 8 9 10 11 |
def cv_regressor(X,y, cv=5): #specify parameter space parameters_gs = {"n_components": np.arange(5,10,1)} # Define Grid Search CV object regressor = GridSearchCV(PLSRegression(), parameters_gs, \ cv=cv, scoring = 'neg_mean_squared_error', n_jobs=8) # Fit to data regressor.fit(X, y) # Calculate a final CV with best choice of parameters y_cv = cross_val_predict(regressor.best_estimator_, X, y, cv=5, n_jobs=8) return y_cv |
In this case we’ve gone with PLS Regression, but other choices are obviously possible. Note that the function above is very similar (but not identical!) to the corresponding function we used for the genetic algorithm with PyGAD.
The function returns the predicted values in a K-fold cross validation. These values can then be used as an input for the fitness function
1 2 3 4 5 6 7 8 9 10 11 |
def fitness_function(individual, X, y): # Select features based on a binary mask selected_features = X[:, np.array(individual) == 1] # Train regression model in cross-validation y_cv = cv_regressor(selected_features, y) # Fitness function is defined as the inverse of the RMSE fitness = 1.0 / np.sqrt(mean_squared_error(y,y_cv.flatten())) return fitness |
Note that, in GAs, the fitness function must be maximised. Hence, if we evaluate the quality of our model by minimising a cost function (such as the RMSE, in the example above), we can simply define the fitness function as the inverse of the cost function.
Next up is the function to select the best fitness solutions or, more precisely, to discard the worst-fit solutions. The number of solutions to discard is equal to the number of “parents”, since every pair of parents will generate two offspring solutions. Therefore the number of new solutions, at each iteration, is equal to the number of parents, and they will replace the worst-fit solutions.
1 2 3 4 5 |
def selection(population, fitnesses, number_of_parents): # Discard the number_of_parents least-fit solutions # These will be replaced by new solutions calculated by crossover best_fitness = np.argsort(np.array(fitnesses))[number_of_parents:] return np.array(population)[best_fitness,:].tolist() |
The next function is the “cross-over”, i.e. a function that, for any pair of parents generate two offspring solutions
1 2 3 4 5 6 7 |
def crossover(parent1, parent2): # This function calculate a crossover between two parents # which generate two offsprings crossover_point = np.random.randint(1, len(parent1) - 1) child1 = parent1[:crossover_point] + parent2[crossover_point:] child2 = parent2[:crossover_point] + parent1[crossover_point:] return child1, child2 |
Finally, we need a function to introduce random mutation in the newly-generated solutions.
1 2 3 4 5 6 |
def mutation(individual, mutation_rate): # This function randomly apply mutations to a solution for i in range(len(individual)): if np.random.random() < mutation_rate: individual[i] = 1 - individual[i] #Invert the Boolean value return individual |
That’s all we need! In the next section we use these functions in a loop to simulate the evolving generations of solutions.
Testing the algorithm
The GA requires us to specify reasonable values for the population size, the number of parents, the mutation rate, and the number of generations. This last parameter could be replaced by a target value for the fitness function and the code must be modified to allow stopping the loop once the target is reached. In this example we go with the first option. Here’s the code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
population_size = 30 generations = 2000 mutation_rate = 0.2 number_of_parents = 10 # Initialise population to binary values, either 0 or 1 population = [[np.random.randint(0, 2) for _ in range(Xavg.shape[1])] \ for _ in range(population_size)] for generation in range(generations): # Calculate fitness functions for each individual in the population fitnesses = [fitness_function(individual, Xavg, y) for individual in population] # Print to screen stdout.write("\r"+str(generation)) stdout.write(" ") stdout.write(str(np.array(fitnesses).max())) stdout.flush() # The new population starts with the best-fitness individuals of the previous generation new_population = selection(population, fitnesses, number_of_parents) new_population_size = np.array(new_population).shape[0] # Parents are chosen at random between the best fitness individuals parents = np.random.choice(range(new_population_size), size=number_of_parents, replace=False) # Parents generate offsprings children = [] for i in range(number_of_parents // 2): parent1, parent2 = new_population[parents[i]], new_population[parents[i+number_of_parents // 2]] child1, child2 = crossover(parent1, parent2) # Mutations can occur child1 = mutation(child1, mutation_rate) child2 = mutation(child2, mutation_rate) # Add children to pool children.extend([child1, child2]) # The children are added to the existing population new_population.extend(children) # Redefine the population for the next generation population = new_population |
After a set number of generations, we can select the best solution out of the final population.
1 2 3 |
# Identify the best individual out of the population and the features best_individual = max(population, key=lambda ind: fitness_function(ind, Xavg, y)) selected_features = [i for i, x in enumerate(best_individual) if x == 1] |
Finally, we can run one last cross-validation with the selected features, print the output and plot the results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
def regression_plot(y_ref, y_pred, title = None, variable = None): # Regression plot z = np.polyfit(y_ref, y_pred, 1) with plt.style.context(('seaborn-v0_8')): fig, ax = plt.subplots(figsize=(8, 8)) ax.scatter(y_ref, y_pred, c='red', edgecolors='k', alpha=0.75) ax.plot(y_ref, z[1]+z[0]*y_ref, c='blue', linewidth=1) ax.plot(y_ref, y_ref, color='green', linewidth=1) if title is not None: plt.title(title, fontsize=20) if variable is not None: plt.xlabel('Measured ' + variable, fontsize=20) plt.ylabel('Predicted ' + variable, fontsize=20) plt.xticks(fontsize=18) plt.yticks(fontsize=18) plt.show() y_cv_best = cv_regressor(Xavg[:,selected_features], y) #[:,best_solution] rmse_best, mae_best, score_best = np.sqrt(mean_squared_error(y, y_cv_best)), \ mean_absolute_error(y, y_cv_best), \ r2_score(y, y_cv_best) print("R2: %5.3f, RMSE: %5.3f, MAE: %5.3f" %(score_best, rmse_best, mae_best)) regression_plot(y,y_cv_best, variable = "") |
The printed result, in my case, is R2: 0.533, RMSE: 4.538, MAE: 3.364 . For comparison, the same figures, at the start of the optimisation, were R2: 0.070, RMSE: 6.407, MAE: 4.678 confirming that the variable selection has improved the regression model.
That’s it for today! As always, thanks for reading and until next time.
Daniel