A variable selection method for PLS in Python

Welcome to our new technical tutorial on Python chemometrics; today we will be discussing a variable selection method for PLS in Python.

In other posts we’ve covered Principal Component Regression (PCR) and the basics of Partial Least Squares (PLS) regression. At the heart of all regression methods is the idea of correlating measured spectra of a substance to be analysed, with known reference values of the same substance. In this way one can build a calibration model that ‘translates’ the spectrum into the value of interest. A robust calibration will then enable to obtain the value (or values) of interest straight form the spectra.

PCR is quite simply a regression model built using a number of principal components derived using PCA. In our last post on PCR, we discussed how PCR is nice and simple, but limited by the fact that it does not take into account anything other than the regression data. That is, our primary reference data are not considered when building a PCR model. That is obviously not optimal, and PLS is a way to fix that.

For an introduction to PLS take a look at this post. There we run through a tutorial for a simple PLS optimisation in cross-validation. The quality of the calibration model is evaluated using the coefficient of determination R^{2} and the mean squared error (MSE).

The typical cross-validation procedure is to divide the set of data into a few groups, leave one of the group out and fit a PLS model on the remaining groups. The model is then used to predict the values of the left out group. This basic process is repeated so that all samples have been predicted once. The metrics are then averaged to produce cross-validation scores.

We have seen how a simple model might fail to have sufficient predictive power (the cross-validation metrics are not optimal). In this post we will discuss what variable selection is, and how it can be used to improve the predictive power of the calibration. Hope you’ll enjoy it.

Variable selection in chemometrics

The idea behind variable selection in chemometrics is that when it comes to spectral measurements not all wavelengths are created equals. In visible and NIR spectroscopy especially, it is often hard to predict in advance which wavelength bands will contain most of the signal related to the analyte we want to measure. So, as a first shot, we measure the whole range allowed by our instrument, and then figure out later which bands are more relevant for our calibration.

To say the same thing in a bit more quantitative way, we want to check which wavelength bands lead to a better quality model. Actually, in practice, we check which bands give a worse quality model, so we can get rid of them.

This seems logical enough, but I’ve deliberately left a fundamental piece of the puzzle out. How do we even start breaking our spectrum into bands and searching for the worst performing ones?

For this example we’ll take a simple approach: we’ll filter out wavelength bands based on the strength of the related regression coefficients. Let’s understand what that means.

Feature selection by filtering

Feature selection by filtering is one of the simplest ways of performing wavelength selection. For an overview of the methods that are currently used, check out this excellent review paper by T. Mehmood et al.: A review of variable selection methods in Partial Least Squares Regression.

The idea behind this method is very simple, and can be summarised in the following:

  1. Optimise the PLS regression using the full spectrum, for instance using cross-validation or prediction data to quantify its quality.
  2. Extract the regression coefficients form the best model. Each regression coefficient uniquely associates each wavelength with the response. A low absolute value of the regression coefficient means that specific wavelength has a low correlation with the quantity of interest.
  3. Discard the lowest correlation wavelengths according to some rule. These wavelengths are the ones that typically worsen the quality of the calibration model, therefore by discarding them effectively we expect to improve the metrics associated with our prediction or cross-validation.

One way to discard the lowest correlation wavelengths is to set a threshold and get rid of all wavelengths whose regression coefficients (in absolute value) fall below that threshold. However this method is very sensitive to the choice of threshold, which tends to be a subjective choice, or requires a trial-and-error approach.

Another approach, that is much less subjective, is to discard one wavelength at a time (the one with the lowest absolute value of the associated regression coefficient) and rebuild the calibration model. By choosing the MSE (means square error) of prediction or cross-validation as metric, the procedure is iterated until the MSE decreases. At some point, removing wavelengths will produce a worse calibration, and that is the stopping criterion for the optimisation algorithm.

Alternatively, one could simply remove a fixed number of wavelengths iteratively, and then check for which number of removed wavelengths the MSE is minimised. Either way we have a method that does not depend on the subjective choice of the threshold and can be applied without changes regardless of the data set being analysed.

One caveat of this method, at least in its simple implementation is that it may get a bit slow for large datasets.

Extracting PLS regression coefficients: Python code

We start, as usual, with the list of import you’ll need to run this example

Next, we import the data, calibration and calculate the derivatives. The data is available for download at our Github repository.

Our aim for this tutorial is to define a function that will simultaneously optimise the number of PLS components and the sequence of wavelengths to minimise the MSE in cross-validation (take a look at our previous post on PLS regression if you want to brush up on the basic PLS code and the terminology). Before doing that however, let’s look at the basic feature-selection idea: the absolute value of the PLS coefficients.

What we have done is a basic PLS regression with 8 components and used it to fit the first derivative data. The PLS coefficients of this fit can be accessed by pls.coef_[:,0], that is the first (and in this case only) column of the pls.coef_ object. It is worth reiterating that these are the regression coefficients that quantify the strength of the association between each wavelength and the response. We are interested only in the absolute value of these coefficients, as large positive and large negative coefficients are equally important for our regression, denoting large positive or large negative correlation with the response respectively. In other words we want to identify and discard the regression coefficients that are close to zero: they are the ones with poor correlation with the response.

The association between each coefficient and the spectral feature can be visualised using the the second part of the code above, producing this plot.

As you can see, a number of wavelengths are associated with fairly small regression coefficients. We want to use this basic idea to identify and discard small coefficients iteratively, starting from the smallest, and optimise the PLS regression at the same time.

NIR variable selection for PLS regression

We are almost ready to write down the code for the complete function we want to use. One last thing to point out is the trick we’ll use to discard small PLS regression coefficients.

What we have done is to extract the sequence of indices corresponding to sorting the absolute value of the PLS regression coefficients in ascending order, and use that sequence to sort the spectra accordingly. Note that sorting the spectral component according to the strength of the associated PLS coefficients has no influence whatsoever on the PLS regression, but it enables us to easily discard one component at a time.

OK, we are now ready to reveal the entire optimisation function.

This function works by running a PLS regression with a given number of components (up to a specified maximum), then filtering out one regression coefficient at a time up to the maximum number allowed. All data are stored in the 2D array mse. At the end of the double loop we search for the global minimum of mse excluding the zeros. That will give us the number of components and the wavelength bands that corresponds to the optimal cross-validation MSE.

To give you an idea of the huge improvement we can get with this approach, let’s compare the result obtained using the full spectrum with the one optimised by variable selection. We start by defining a function to perform simple PLS with a fixed number of components, and calculate the metrics in cross-validation.

This is the output for 8 components

Now we run the optimisation process

And this is the results

The R^{2} score jumped from 0.44 to 0.64, while the MSE cross-validation decreased from 2.6 to 1.7! That amounted to discarding 452 wavelengths out of 600, that is finding the truly significant bands for Brix calibration.
As a sanity check, we can plot the discarded bands on top of the absorbance spectra. The code is as follows.

This is the plot of the first derivative spectra with overlapped the discarded bands in red. Note how the main water peaks are actually discarded.

Thanks for reading this tutorial. I hope you enjoyed it and until next time!