Updates and additions to the PLS Regression code

The original post on PLS Regression in Python is still one of the most popular of the entire blog, despite being now a few years old (and despite the presence of many other very good tutorials out there!). Over the years, I’ve received a number of requests for further explanations and for new or improved content, so I finally decided to publish this companion post.

Partial Least Squares (PLS) Regression is one of the mainstays of statistical learning applied to chemometrics. In spite of new and improved methods that have been developed over the years — methods that can often claim improved performances over it — PLS regression is pretty much considered a benchmark, to which one can compare other approaches.

That is, I believe, the reason why my old post still goes very strong, and why I’ve decided to add new material here to complement it. Keep in mind that some of the material presented here already exists elsewhere on this blog (or in the internet ocean), and that large chunks of it can also be applied to other regression techniques. It is however useful, I believe, to have all the content in one place in this post update.

Needless to say: if you haven’t read the original post, feel free to do it now. Here we assume that content to be known. The sections below contains some updated code (compared to the original version) and some additional one, in no particular order. The post is a work in progress, and  I will surely add content to it as new ideas will come about.

Optimising the number of latent variables

The original post contains a function, called optimise_pls_cv, that finds the optimal number of latent variables (LV, or components) for a given dataset, where optimal means the number of LV that minimises the root mean squared error (RMSE) in cross-validation. This function can be replaced by a more compact code based on the GridSearchCV function of scikit-learn. Here’s how it works.

First, we write the imports

Then we load the data and apply the required pre-processing (just second derivative smoothing here, but you can get fancier)

Finally, use the grid-search procedure to find the optimal number of components using 10-fold cross-validation.

Plotting the results

Endless options are available to display the results, but here’s a utility function that produces a scatter plot of measured versus predicted variables, plot a linear fit and a 45 degree line for reference.

The function can be applied to the cross-validation data above

For a more interactive option, Plotly is an excellent place to start.

Exporting the data

One option to export the results of the cross-validation process (or any predicted versus actual data in fact) in readable form is to combine the arrays into a pandas data frame and then export to csv file.

This will give you the option of exporting the data for further processing with different software packages.

Conclusions… but it’s a WIP

As mentioned in the opening, I consider this post as a work in progress. I’m sure there will be many more ideas that I will come up with, or that you, dear reader, will be able to suggest. They will be continually added here.

But for now, thanks for reading and until next time,
Daniel