Updates and additions to the PLS Regression code

The original post on PLS Regression in Python is still one of the most popular of the entire blog, despite being now a few years old (and despite the presence of many other very good tutorials out there!). Over the years, I’ve received a number of requests for further explanations and for new or improved content, so I finally decided to publish this companion post.

Partial Least Squares (PLS) Regression is one of the mainstays of statistical learning applied to chemometrics. In spite of new and improved methods that have been developed over the years — methods that can often claim improved performances over it — PLS regression is pretty much considered a benchmark, to which one can compare other approaches.

That is, I believe, the reason why my old post still goes very strong, and why I’ve decided to add new material here to complement it. Keep in mind that some of the material presented here already exists elsewhere on this blog (or in the internet ocean), and that large chunks of it can also be applied to other regression techniques. It is however useful, I believe, to have all the content in one place in this post update.

Needless to say: if you haven’t read the original post, feel free to do it now. Here we assume that content to be known. The sections below contains some updated code (compared to the original version) and some additional one, in no particular order. The post is a work in progress, and I will surely add content to it as new ideas will come about.

Optimising the number of latent variables

The original post contains a function, called optimise_pls_cv, that finds the optimal number of latent variables (LV, or components) for a given dataset, where optimal means the number of LV that minimises the root mean squared error (RMSE) in cross-validation. This function can be replaced by a more compact code based on the GridSearchCV function of scikit-learn. Here’s how it works.

First, we write the imports

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from scipy.signal import savgol_filter

from sklearn.cross_decomposition import PLSRegression

from sklearn.model_selection import cross_val_predict, GridSearchCV

from sklearn.metrics import mean_squared_error, r2_score

Then we load the data and apply the required pre-processing (just second derivative smoothing here, but you can get fancier)

# Read data from Github

url = 'https://raw.githubusercontent.com/nevernervous78/nirpyresearch/master/data/peach_spectra_brix.csv'

data = pd.read_csv(url)

# Extract arrays

y = data['Brix'].values

X = data.drop(['Brix'], axis=1).values

wl = np.arange(1100,2300,2) #wavelength

# Calculate second derivative

X2 = savgol_filter(X, 17, polyorder = 2, deriv=2)

Finally, use the grid-search procedure to find the optimal number of components using 10-fold cross-validation.

# Define the parameter space for the search. Just the number of LV here

parameters = {'n_components':np.arange(1,20,1)}

# Define the grid-search estimator based on PLS regression

pls = GridSearchCV(PLSRegression(), parameters, scoring = 'neg_mean_squared_error', verbose=0, cv=10)

# Fit the estimator to the data

pls.fit(X2, y)

# Optional: print the best estimator

print(pls.best_estimator_)

# Apply the best estimator to calculate a cross-validation predicted variable

y_cv = cross_val_predict(pls.best_estimator_, X2, y, cv=10)

# Optional: calculate figures of merit

rmse, score = np.sqrt(mean_squared_error(y, y_cv)), r2_score(y, y_cv)

Plotting the results

Endless options are available to display the results, but here’s a utility function that produces a scatter plot of measured versus predicted variables, plot a linear fit and a 45 degree line for reference.

def regression_plot(y_ref, y_pred, title = None, variable = None):

# Regression plot

z = np.polyfit(y_ref, y_pred, 1)

with plt.style.context(('seaborn')):

fig, ax = plt.subplots(figsize=(8, 8))

ax.scatter(y_ref, y_pred, c='red', edgecolors='k', alpha=0.75)

ax.plot(y_ref, z[1]+z[0]*y_ref, c='blue', linewidth=1)

ax.plot(y_ref, y_ref, color='green', linewidth=1)

if title is not None:

plt.title(title, fontsize=20)

if variable is not None:

plt.xlabel('Measured ' + variable, fontsize=20)

plt.ylabel('Predicted ' + variable, fontsize=20)

plt.xticks(fontsize=18)

plt.yticks(fontsize=18)

plt.show()

The function can be applied to the cross-validation data above

1	regression_plot(y, y_cv, title = "", variable = "Brix")

For a more interactive option, Plotly is an excellent place to start.

import plotly.express as px

fig = px.scatter(x=y, y=y_cv.flatten(), trendline="ols", width=600, height=600,\

labels = {"x":"Measured Brix", "y":"Predicted Brix"})

fig.show()

Exporting the data

One option to export the results of the cross-validation process (or any predicted versus actual data in fact) in readable form is to combine the arrays into a pandas data frame and then export to csv file.

1 2	result_df = pd.DataFrame(np.vstack((y, y_cv.flatten())).T, columns=["Measured", "Predicted"]) result_df.to_csv("export.csv")

This will give you the option of exporting the data for further processing with different software packages.

Conclusions… but it’s a WIP

As mentioned in the opening, I consider this post as a work in progress. I’m sure there will be many more ideas that I will come up with, or that you, dear reader, will be able to suggest. They will be continually added here.

But for now, thanks for reading and until next time,
Daniel

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Optimising the number of latent variables

Plotting the results

Exporting the data

Conclusions… but it’s a WIP

About The Author

Daniel Pelliccia

Optimising the number of latent variables

Plotting the results

Exporting the data

Conclusions… but it’s a WIP

About The Author

Daniel Pelliccia

Related Posts