Partial Least Square (PLS) regression is one of the staples of chemometrics. It solves the multicollinearity problem present in most spectroscopy data, while at the same time projecting the data into a conveniently small set of components useful for regression. This is very good, but sometimes some of the measured spectral bands are uncorrelated with the parameter we are trying to predict. Moving window PLS regression is a useful technique to identify and select useful bands and improve the quality of our regression model.
In this post we will discuss a Python implementation of moving window PLS regression and some recommendations to make the most of it with real world data. For related posts on PLS regression feel free to check out:
The data used in this post are taken from the paper In Situ Measurement of Some Soil Properties in Paddy Soil Using Visible and Near Infrared Spectroscopy by Ji Wenjun et al. The associated data is available here.
Implementing the moving window: the basics
First of all: what is a window and why do we want to move it? A window is simply a defined wavelength (or wave number) range which is usually narrower than the full measurement range. For instance we may have measured a NIR spectrum from 900 nm to 1400 nm, and a window is a limited range – say for instance 1000-1250 nm – within.
Selecting a narrower range before running PLS regression enables one to test whether there is a spectral range producing a tighter prediction. In other words some spectral components may be more representative than others for predicting an unknown sample parameter. A window will enable to home on only those components.
For this reason, after having defined a window, we actually want to move it. In general we don’t know where the representative components are located. To find out, we can slide the window through the spectral range and optimise a PLS model at each position. In this way we can quantitatively compare models obtained at different locations and identify whether an optimal location exists.
In the quest for the ultimate prediction model, a judicious application of the moving window must be definitely part of your bag of tricks!
Standard PLS: Python code
Before looking at the moving window PLS, let’s quickly write some code for conventional PLS. This bit has been extensively discussed in our PLS post, but in the spirit of having a self-contained post, let’s quickly cover it here as well, albeit in less detail.
First thing, as always, the imports
1 2 3 4 5 6 7 8 9 10 |
from sys import stdout import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib.collections as collections from sklearn.cross_decomposition import PLSRegression from sklearn.model_selection import cross_val_predict from sklearn.metrics import mean_squared_error, r2_score |
The dataset contains several labels corresponding to the same spectra. We are going to use the “TOC” label (total organic carbon, for those of you who are interested).
1 2 3 4 5 6 7 8 |
# Import data data = pd.read_excel('../data/File_S1.xlsx') # Read labels y = data["TOC"].values[1:].astype('float32') # Read spectra X = data.values[1:,8:].astype('float32') # Define the wavelength array wl = np.linspace(350,2500, num=X.shape[1], endpoint=True) |
And now here’s the code snippet to run a simple PLS optimiser. We fit a PLS regressor to the data and locate the minimum root mean square error (RMSE) in cross-validation as a function of the number of PLS components.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
rmse_simple = np.zeros(30) for nc in range(30): # Simple PLS pls_simple = PLSRegression(n_components=nc+1) # Fit pls_simple.fit(X, y) # Cross-validation y_cv = cross_val_predict(pls_simple, X, y, cv=10) # Calculate scores score = r2_score(y, y_cv) rmse_simple[nc] = np.sqrt(mean_squared_error(y, y_cv)) print(np.min(rmse_simple)) with plt.style.context(('ggplot')): plt.plot(rmse_simple, 'o-') plt.show() |
The minimum RMSE = 5.14 is attained with 13 components.
We can now run a fit with 13 components and check the cross-validation results with a scatter plot.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
pls_simple = PLSRegression(n_components=13) # Fit pls_simple.fit(X, y) # Cross-validation y_cv_simple = cross_val_predict(pls_simple, X, y, cv=10) # Calculate scores score_simple = r2_score(y, y_cv_simple) rmse_simple = np.sqrt(mean_squared_error(y, y_cv_simple)) # Fit a line to the CV vs response z = np.polyfit(y, y_cv_simple, 1) with plt.style.context(('ggplot')): fig, ax = plt.subplots(figsize=(9, 5)) ax.scatter(y_cv_simple, y, c='red', edgecolors='k') #Plot the best fit line ax.plot(np.polyval(z,y), y, c='blue', linewidth=1) #Plot the ideal 1:1 line ax.plot(y, y, color='green', linewidth=1) plt.title('$R^{2}$ (CV): %5.2f ' % score_simple) plt.xlabel('Predicted value') plt.ylabel('Measured value') plt.show() |
Well, it’s clear that we still have some work to do. It’s time to put our moving window to work.
Moving window PLS: Python code
Firstly, let’s write down the function we’ll be using. We’ll comment the relevant lines below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
def mwpls(X,y, win_size, plot=False): win_pos = np.arange(0, X.shape[1]-win_size) rmse = np.zeros((X.shape[1]-win_size,min(20,win_size))) for i in win_pos: for j in range(min(20,win_size)): # number of PLS components is limited by the win size pls = PLSRegression(n_components=j+1) # Fit pls.fit(X[:,i:i+win_size], y) # Cross-validation y_cv = cross_val_predict(pls, X[:,i:i+win_size], y, cv=10) rmse[i,j] = np.sqrt(mean_squared_error(y_cv, y)) comp = 100*(i+1)/(X.shape[1]-win_size) stdout.write("\r%d%% completed" % comp) stdout.flush() stdout.write("\n") # Calculate and print the position of minimum in MSE rmseminx,rmseminy = np.where(rmse==np.min(rmse[np.nonzero(rmse)])) print("Suggested window position: ",rmseminx[0],rmseminy[0], rmse[rmseminx,rmseminy]) if plot is True: plt.imshow(rmse, interpolation=None, aspect=0.1) plt.show() return(rmseminx[0],rmseminy[0], rmse[rmseminx,rmseminy]) |
In order to define a window we need two parameters: its size and its position. Both parameters will have to be adjusted to find the best result. The function above varies only the position parameter, while the size parameter is tuned in a separate loop. You can modify the function to loop over the window size as well, if you wish.
I haven’t done that for a specific reason: the function contains already a nested loop – over the window position and over the number of PLS components. Adding a third loop would have made the code a bit too convoluted, and harder to explain.
Anyway, the spirit of the mwpls function is simple: we define a window and we slide it over the allowed range. At each position we optimise a PLS regression using cross-validation. More specifically we define the RMSE metric and we seek the two parameters (window position and number of components) that yield the minimum RMSE.
The first thing to note is the definition of the rmse array, which is directly affected by the choice of the range in the number of components we decide to loop over. The reason is straightforward: we can’t choose a number of component larger than the window size. If the window is 20 pixels wide, we can’t use more than 20 components.
Remember, PLS is a linear transformation of coordinates. If we start from 20 coordinates (the wavelengths contained in the window) we can’t use more components than that to transform our data. That is the reason we call the function min(20,win_size). We want to loop over 20 PLS components unless the window is narrower than that.
The other thing to note is the position of the window, which is defined as the position of its first point. The window goes from the pixel i to the pixel i+win_size.
The rest is standard Python. We populate the 2D rmse array in the nested loop, and we search for the minimum when the loop is over. You can optionally display the RMSE map in this parameter space, which might give further insights on the optimisation process.
Optimising the window size
The function mwpls requires the window size to be specified beforehand. As discussed, we chose to do it this way to avoid another level in the nested for loop which would make the function unnecessary convoluted. The other reason is that we usually don’t need a fine grid in the window size array. A few points will generally be enough.
For instance, this is the typical code you can run
1 2 3 |
win_size = [20, 40, 60, 80, 100] for winsize in win_size: win_pos, pls_comp, rmse = mwpls(X,y, winsize, plot=False) |
We just specified 5 values for the window size, in step of 20 pixels. Looking at the code output you can get something like this (the output might be slightly different depending on the initialisation of the random seed for cross-validation)
1 2 3 4 5 6 7 8 9 10 |
100% completed Suggested window position: 25 7 [5.3609496] 100% completed Suggested window position: 134 10 [4.56977221] 100% completed Suggested window position: 124 13 [4.26532923] 100% completed Suggested window position: 104 16 [3.95596715] 100% completed Suggested window position: 84 17 [4.36353938] |
This immediately gives you insights as to what is going on. A window size of 80 pixels starting at the position 104 produces a minimum RMSE of 3.96 for 16 PLS components. This is a much better result when compared to the full-spectrum PLS. In that case we had RMSE above 5, more than 20% improvement just by selecting the right portion of the spectrum.
Now, let’s take the optimised values, run the PLS and visualise the results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
rmseminy = 16 rmseminx = 104 win_size = 80 pls = PLSRegression(n_components=rmseminy+1) # Fit pls.fit(X[:,rmseminx:rmseminx+win_size], y) # Cross-validation y_cv = cross_val_predict(pls, X[:,rmseminx:rmseminx+win_size], y, cv=10) # Calculate scores score_mw = r2_score(y, y_cv) rmse_mw = np.sqrt(mean_squared_error(y, y_cv)) # Fit a line to the CV vs response z = np.polyfit(y, y_cv, 1) with plt.style.context(('ggplot')): fig, ax = plt.subplots(figsize=(9, 5)) ax.scatter(y_cv, y, c='red', edgecolors='k') #Plot the best fit line ax.plot(np.polyval(z,y), y, c='blue', linewidth=1) #Plot the ideal 1:1 line ax.plot(y, y, color='green', linewidth=1) plt.title('$R^{2}$ (CV): %5.2f ' % score_mw) plt.xlabel('Predicted value') plt.ylabel('Measured value') plt.show() |
While there is still some work to do, the result is undoubtedly much better that the previous one. One might try to apply derivatives or other preprocessing tricks to improve the result further. For now I hope you got the basic idea on implementing a moving window PLS in your data analysis pipeline.
One last thing. As a bonus you can highlight the portion of the spectra that gives the best results. This can be helpful to assess which feature (or features) in the spectra are more useful for prediction and perhaps gain some intuition of the process at work. Here’s how you can do that.
1 2 3 4 5 6 7 8 9 10 11 |
with plt.style.context(('ggplot')): fig, ax = plt.subplots(figsize=(9, 6)) plt.plot(wl,X.T) plt.xlabel("Wavelength (nm)") collection = collections.BrokenBarHCollection.span_where( wl, ymin=0, ymax=1, where=(np.logical_and(wl>=wl[rmseminx], wl<=wl[rmseminx+win_size])), \ facecolor='green', alpha=0.5) ax.add_collection(collection) plt.show() |
That’s it for now. Thanks for reading and see you next time.