Backward Variable Selection for PLS regression

Variable selection, or specifically wavelength selection, is often a really important intermediate step to develop a good prediction model based on spectroscopic data. If you have been reading these articles this is not news to you: not all wavelength bands have the same predictive power for the primary parameter of interest. It is better therefore to eliminate the bands that are least useful, which often improve the quality and robustness of the model. Today we’ll discuss a band selection method called Backward Variable Selection (or Elimination) and we’ll apply it to PLS regression.

I have discussed a few different flavours of band selection in past articles. You can refer to these posts for some more background information.

For the topic of today, you can also check a paper by Pierna et al. linked here and cited at the end of this post [1]. The data used for this example are taken from the paper by Ji Wenjun et al. linked here [2], and the associated data is available here.

Backward Selection PLS: a minimal example

The concept underlying Backward Selection PLS is fairly simple.

Define you model of choice. In this case, PLS regression;
Define a suitable metric. For this example, RMSE in cross-validation;
Bin the wavelengths of your data into bands. For the data used here, we start with 210 wavelengths from 310 to 2500 nm. We then bin the wavelengths into groups of 5, ending up with n=42 bands (42×5=210). The importance of this step will become clear as we go.
Fit a PLS regression model to the binned dataset including all n bands. Calculate the RMSE-CV. This is going to be our baseline to be compared with.
Loop over the bands. Remove one band at a time and fit a PLS model to the remaining n-1 bands. The band that, once eliminated produces the sharpest decrease in the RMSE-CV, is discarded.
Repeat the process above with n-1 bands. Loop over them and discard the band that, once eliminated, reduce the RMSE-CV the most.
Repeat and repeat, iteratively discarding bands, until you reach a point where the RMSE-CV won’t decrease any longer. That is when you stop iterating as you’ve eliminated all bands that are poorly correlated with the quantity you want to predict.

As you can see, we are discarding bands backwards, one at a time, each time getting rid of the worse performing.

OK, let’s see how this works in practice. For clarity, let’s just show the sequence of commands up to the step 5 of the previous list. In essence we want to show how to eliminate just one variable, to avoid the complication of the double loop for the time being.

Let’s begin with the imports

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import LeaveOneOut, cross_val_predict

from sklearn.cross_decomposition import PLSRegression

from sklearn.metrics import mean_squared_error, r2_score

from scipy.signal import savgol_filter

from sys import stdout

Now, let’s load the data, apply a Savitzky–Golay smoothing filter and plot the spectra

raw = pd.read_excel("File_S1.xlsx")

X = raw.values[1:,14:].astype('float32')

y = raw.values[1:,1].astype('float32')

wl = np.linspace(350,2500, num=X.shape[1], endpoint=True)

X = savgol_filter(X, 11, polyorder = 2,deriv=0)

plt.figure(figsize=(12,8))

with plt.style.context(('seaborn')):

plt.plot(wl, X.T, linewidth=1)

plt.xlabel('Wavelength (nm)', fontsize=20)

plt.ylabel('Reflectance', fontsize=20)

plt.xticks(fontsize=18)

plt.yticks(fontsize=18)

plt.show()

Now it’s time to bin the data into wavelength bands. As anticipated, we’ll define 42 bands, each containing 5 adjacent wavelengths of the original spectra. A neat trick to do with Numpy is to reshape the original array by adding one dimension and then summing over the added dimension.

# Resample spectra into wavelength bands.

Xr = X.reshape(X.shape[0],42,5).sum(axis=2)

wlr = wl.reshape(42,5).mean(axis=1)

Finally, let’s fit a baseline PLS regression model. For this I’m going to use a few utility functions, which are defined at the end of this post. These functions perform basic PLS regression, optimise the number of PLS component to be used in the regression model, and plot the results.

# Find the optimal number of components, up to 20

opt_comp, rmsecv_min = pls_optimise_components(X, y, 20)

# Run a simple PLS model with the optimal number of components

predicted, r2cv_base, rmscv_base = base_pls_cv(X, y, opt_comp)

# Print metrics

print("PLS results:")

print(" R2: %5.3f, RMSE: %5.3f"%(r2cv_base, rmscv_base))

print(" Number of Latent Variables: "+str(opt_comp))

# Plot result

regression_plot(y, predicted, title="Basic PLS", variable = "TOC (g/kg)")

This code will print the following text

PLS results:

R2: 0.399, RMSE: 5.151

Number of Latent Variables: 9

and produce this plot.

That is an OK starting point but there ample room for improving our model. Let’s write a script to perform a single variable elimination and explain it step by step. Let’s start with the code.

# Define an empty list, to be populated with the RMSE values obtained by eliminating each band in sequance

rmscv = []

# Define a Leave-One-Out cross-validator

loo = LeaveOneOut()

# Loop over the bands one by one

for train_wl, test_wl in loo.split(wlr):

# Optimise and fit a PLS model with one band removed

opt_comp, rmsecv_min = pls_optimise_components(Xr[:,train_wl], y, 10)

predicted, r2cv_loo, rmscv_loo = base_pls_cv(Xr[:,train_wl], y, opt_comp)

# Append the value of the RMSE to the list

rmscv.append(rmscv_loo)

# Print a bunch of stuff

stdout.write("\r"+str(test_wl))

stdout.write(" ")

stdout.write(" %1.3f" %rmscv_loo)

stdout.write(" ")

stdout.write(" %1.3f" %r2cv_loo)

stdout.write(" ")

stdout.flush()

stdout.write('\n')

# List to array

rmscv = np.array(rmscv)

For simplicity, in this example, we are using the entire dataset for this optimisation. In real cases you may want to split your dataset into a training and test set, and perform the optimisation on the training set only.

Anyway, we first define a list to be populated by the RMSE values to be calculated at each step. Then we define a Leave-One-Out (LOO) cross-validator. Note that however, we are not using the LOO cross-validator for cross-validation of the PLS regression. In a normal cross-validation procedure using LOO you would eliminate one measurement at a time (not one band). Unlike that, we use the LOO as a nifty way to iterate over the wavelength bands, discarding one at a time. At each step the variable train_wl is the wavelength band that is eliminated. A model is fitted and cross validated without it, and a RMSE-CV (using standard K-fold cross-validation) is calculated.

At the end of the sequence, which here consists of 42 iterations, we can plot the result along with the baseline RMSE-CV:

plt.figure(figsize=(12,8))

with plt.style.context(('seaborn')):

# Plot the RMSE obtained by eliminating each band

plt.plot(rmscv, 'o--', linewidth=2)

# Plot the baseline RMSE-CV from the previous step

plt.plot([0, 42], [rmscv_base, rmscv_base], 'r', lw=2)

plt.xlabel('Index of Band', fontsize=20)

plt.ylabel('RMSE-CV', fontsize=20)

plt.xticks(fontsize=18)

plt.yticks(fontsize=18)

plt.show()

Let’s repeat once again what we have done. First we calculated a PLS model using all bands and estimated the RMSE-CV of such model. That is our baseline (red curve in the chart above). Then we removed one band at a time and each time fitted a new PLS model with the remaining (42-1) bands. For each of these models we estimated the RMSE-CV, which is plotted as blue dots in the graph above.

It’s easy to convince yourself that some of the bands are quite deleterious to the overall model because, once removed, produce a substantial decrease in the RMSE-CV. The the last band, in the instance above, produces the largest improvement once removed. Getting rid of that band would produce an improvement of the RMSE-CV of almost 10%. That is clearly the band we want to eliminate in our bands selection procedure, but we can iterate the process over the remaining band, which is the topic of the next section.

Now that you understand the basic mechanism, it should be quite easy to iterate over the bands and remove one at a time. The approach used here is to define the arrays to be optimised. One array for the spectra and one for the wavelength bands. Elements will be deleted from these arrays as the optimisation progresses. Let’s call these arrays Xr_optim and wlr_optim with obvious meaning and define them, at the beggining, exactly as we did for the Xr and wlr arrays above. Here’s the full code.

# Resample spectra

Xr_optim = X.reshape(X.shape[0],70,3).sum(axis=2)

wlr_optim = wl.reshape(70,3).mean(axis=1)

rmscv_min = rmscv_base # Initialise to the baseline value

iter_max= 70

for rep in range(iter_max):

rmscv = []

r2 = []

loo = LeaveOneOut()

for train_wl, test_wl in loo.split(wlr_optim):

opt_comp, rmsecv_min = pls_optimise_components(Xr_optim[:,train_wl], y, 15)

predicted, r2cv_loo, rmscv_loo = base_pls_cv(Xr_optim[:,train_wl], y, opt_comp)

rmscv.append(rmscv_loo)

r2.append(r2cv_loo)

stdout.write("\r"+str(test_wl))

stdout.write(" ")

stdout.write(" %1.4f" %rmscv_loo)

stdout.write(" ")

stdout.write(" %1.4f" %r2cv_loo)

stdout.write(" ")

stdout.flush()

new_rmscv = np.min(np.array(rmscv))

stdout.write('\r')

#print(rep, np.argmin(np.array(rmscv)), np.min(np.array(rmscv)), np.array(r2)[np.argmin(np.array(rmscv))] )

print("Rep: %1d, Deleted band: %1d, RMSCV: %1.4f, R^2: %1.4f " \

%(rep,np.argmin(np.array(rmscv)), np.min(np.array(rmscv)),np.array(r2)[np.argmin(np.array(rmscv))]) )

if (new_rmscv < rmscv_min):

rmscv_min = new_rmscv

Xr_optim = np.delete(Xr_optim, np.argmin(np.array(rmscv)), axis=1)

wlr_optim = np.delete(wlr_optim, np.argmin(np.array(rmscv)))

else:

print("End of optimisation at step ", rep)

break

To track the minimisation of the RMSE-CV, we initialise the relevant variable rmscv_min to the baseline value. For each run of the LOO process we remove one band and calculate the minimum of the RMSE-CV of the new model. If this value is less than the previous we keep going, otherwise the loop is interrupted and the optimisation process comes to an end.

While the optimisation is in progress, this code prints the current index of the band being remove, the RMSE-CV and R^2 of the corresponding model. That is overwritten at each step to avoid filling up the available space (The trick is to use stdout.flush() at the end of each iteration). At the end of one LOO loop a line is printed containing the information on the index of repetition, the band that has been deleted and the best values of the RMSE-CV and R^2 attained at that stage.

Running the code above, I got the following output printed on screen.

Rep: 0, Deleted band: 0, RMSCV: 4.9201, R^2: 0.4513

Rep: 1, Deleted band: 61, RMSCV: 4.8037, R^2: 0.4770

Rep: 2, Deleted band: 61, RMSCV: 4.7148, R^2: 0.4962

Rep: 3, Deleted band: 4, RMSCV: 4.6644, R^2: 0.5069

Rep: 4, Deleted band: 30, RMSCV: 4.6261, R^2: 0.5149

Rep: 5, Deleted band: 59, RMSCV: 4.5910, R^2: 0.5223

Rep: 6, Deleted band: 56, RMSCV: 4.5448, R^2: 0.5319

Rep: 7, Deleted band: 62, RMSCV: 4.3696, R^2: 0.5672

Rep: 8, Deleted band: 61, RMSCV: 4.2868, R^2: 0.5835

Rep: 9, Deleted band: 56, RMSCV: 4.2194, R^2: 0.5965

Rep: 10, Deleted band: 55, RMSCV: 4.1255, R^2: 0.6143

Rep: 11, Deleted band: 55, RMSCV: 4.0740, R^2: 0.6238

Rep: 12, Deleted band: 53, RMSCV: 4.0400, R^2: 0.6301

Rep: 13, Deleted band: 46, RMSCV: 3.9739, R^2: 0.6421

Rep: 14, Deleted band: 52, RMSCV: 3.9409, R^2: 0.6480

Rep: 15, Deleted band: 15, RMSCV: 3.9181, R^2: 0.6521

Rep: 16, Deleted band: 6, RMSCV: 3.9046, R^2: 0.6545

Rep: 17, Deleted band: 6, RMSCV: 3.8830, R^2: 0.6583

Rep: 18, Deleted band: 18, RMSCV: 3.8761, R^2: 0.6595

Rep: 19, Deleted band: 17, RMSCV: 3.8725, R^2: 0.6601

Rep: 20, Deleted band: 31, RMSCV: 3.8715, R^2: 0.6603

Rep: 21, Deleted band: 8, RMSCV: 3.8710, R^2: 0.6604

Rep: 22, Deleted band: 16, RMSCV: 3.8745, R^2: 0.6598

End of optimisation at step 22

The improvement is noticeable. If we now run the regression model with the optimised bands, this is what we obtain.

opt_comp, rmsecv_min = pls_optimise_components(Xr_optim, y, 20)

predicted, r2cv_optim, rmscv_optim = base_pls_cv(Xr_optim, y, opt_comp)

print("PLS results:")

print(" R2: %5.3f, RMSE: %5.3f"%(r2cv_optim, rmscv_optim))

print(" Number of Latent Variables: "+str(opt_comp))

regression_plot(y, predicted, title="Optimised-bands PLS", variable = "TOC (g/kg)")

PLS results:

R2: 0.660, RMSE: 3.871

Number of Latent Variables: 10

Here you go. A minimal code to run your own Backward Variable Selection for PLS Regression. Thanks for reading and until next time,

Daniel

References

J. Pieran et al., A Backward Variable Selection method for PLS regression (BVSPLS), Analytica Chimica Acta 642, 89-93 (2009).
J. Wenjun et al., In Situ Measurement of Some Soil Properties in Paddy Soil Using Visible and Near-Infrared Spectroscopy, PLOS ONE 11, e0159785 (2016).
Featured image credit: Valiphoto.

Python functions used in this post

def base_pls_cv(X,y,n_components, return_model=False):

# Simple PLS

pls_simple = PLSRegression(n_components=n_components)

# Fit

pls_simple.fit(X, y)

# Cross-validation

y_cv = cross_val_predict(pls_simple, X, y, cv=10)

# Calculate scores

score = r2_score(y, y_cv)

rmsecv = np.sqrt(mean_squared_error(y, y_cv))

if return_model == False:

return(y_cv, score, rmsecv)

else:

return(y_cv, score, rmsecv, pls_simple)

def pls_optimise_components(X, y, npc):

rmsecv = np.zeros(npc)

for i in range(1,npc+1,1):

# Simple PLS

pls_simple = PLSRegression(n_components=i)

# Fit

pls_simple.fit(X, y)

# Cross-validation

y_cv = cross_val_predict(pls_simple, X, y, cv=10)

# Calculate scores

score = r2_score(y, y_cv)

rmsecv[i-1] = np.sqrt(mean_squared_error(y, y_cv))

# Find the minimum of ther RMSE and its location

opt_comp, rmsecv_min = np.argmin(rmsecv), rmsecv[np.argmin(rmsecv)]

return (opt_comp+1, rmsecv_min)

def regression_plot(y_ref, y_pred, title = None, variable = None):

# Regression plot

z = np.polyfit(y_ref, y_pred, 1)

with plt.style.context(('seaborn')):

fig, ax = plt.subplots(figsize=(12, 7))

ax.scatter(y_ref, y_pred, c='red', edgecolors='k')

ax.plot(y_ref, z[1]+z[0]*y_ref, c='blue', linewidth=1)

ax.plot(y_ref, y_ref, color='green', linewidth=1)

if title is not None:

plt.title(title, fontsize=20)

if variable is not None:

plt.xlabel('Measured ' + variable, fontsize=20)

plt.ylabel('Predicted ' + variable, fontsize=20)

plt.xticks(fontsize=18)

plt.yticks(fontsize=18)

plt.show()

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Backward Variable Selection for PLS regression

Backward Selection PLS: a minimal example