The process of developing and optimising a regression model requires, almost invariably, a sequence of steps. These steps can be combined in a single predictor using the Pipeline function of scikit-learn.
In this short post, we are going to discuss two simple examples of applying a Pipeline for the optimisation of common regression models used in spectroscopy:

Secondly, let’s import the data and preprocess it. The data is taken directly from our
GitHub repo

Note that, in the last line, we calculate the first derivative and smooth the data with the
Savitzky-Golay method. In principle, this step could also be included in the pipeline. That would require writing a custom class, which is outside the aim of this introductory tutorial.
Now we can define the pipeline as follows:

The pipeline is composed of three functions: 1) Standard Scaler, 2) PCA decomposition and 3) Linear regression. The pipeline so designed will act like any other predictor, whereby the functions are applied in the order they are defined in the list.
In order to achieve the best performance in PCR however, we would like to evaluate what is the optimal number of principal components to select. The PCA procedure decomposes the data into a reduced array whose size is given by the number of samples, or the number of wavelength bands (whichever is smaller). Usually however, only a fairly limited number of principal components contains useful information for the model we want to develop. Therefore one can ignore principal components above a certain optimal value.
To evaluate this optimal value we can run a cross-validation experiment by changing the relevant parameter in this way

The ‘scoring’ parameter in
GridSearchCV specifies what type of metric we want to optimise. In this case, we chose the usual Mean-Squared Error (MSE). Fitting the estimator means calculating the MSE in cross-validation for each parameter, as defined in line 2 (i.e. for a number of principal components ranging from 1 to 31 in steps of 1). Once the estimator is fitted we can print the result. The output is our case reads

The cross-validation procedure has established that 6 is the number of principal components that minimises the MSE. Now, to fit the estimator to the training set (note: here I’m using the same set of data for the sake of illustration, but you’ve got the drift), you simply have to type

That’s it. The
pcr.best_estimator_ instance (i.e. the optimised pipeline) works just as any other estimator. In particular it has the “fit” and “predict” functions we know and love.

In this case we are chaining a
MinMax Scaler and the PLS Regression instance (Note that fitting a PLSR instance with scikit-learn does apply a Standard Scaler by default, hence adding a Standard Scaler to the pipeline would have been redundant.
All the other commands are the same as the previous case. Here, the
GridSearchCV fit will return the optimal number of latent variables for the PLS Regression.
That’s it for today. As always, thanks for reading and don’t forget to share this content with friends and colleagues!
Until next time.
Daniel

- Principal Component Regression (PCR)
- Partial Least Square Regression (PLSR)

## Pipeline PCR

Principal Components Regression is the combination of two steps:- Principal Component Analysis (PCA), which is the decomposition of the spectra into a new set of variables chosen as to maximise the variance of the data. An introduction to PCA is available here, and many resources are available on the web.
- Linear Regression on the selected number of principal components.

1 2 3 4 5 6 7 8 9 |
import pandas as pd import numpy as np from scipy.signal import savgol_filter from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV |

1 2 3 4 5 6 7 8 9 |
# Read data url = 'https://raw.githubusercontent.com/nevernervous78/nirpyresearch/master/data/peach_spectra_brix.csv' data = pd.read_csv(url) # Define arrays X = data.values[:,1:].astype("float32") y = data["Brix"].values.astype("float32") # Calculate first derivative X1 = savgol_filter(X, 21, polyorder = 2, deriv=1) |

1 |
pcr_pipe = Pipeline([('scaler', StandardScaler()), ('pca', PCA() ), ('linear_regression', LinearRegression())]) |

1 2 3 4 5 6 7 8 9 |
# Define the parameter to vary. Here's the number of PC parameters = {'pca__n_components':np.arange(1,31,1)} # Use GridSearchCV to find the optimal parameter(s) in cross-validation pcr = GridSearchCV(pcr_pipe, parameters, scoring = 'neg_mean_squared_error') # Fit the instance pcr.fit(X1, y) # Print the result print(pcr.best_estimator_) |

1 2 |
Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=6)), ('linear_regression', LinearRegression())]) |

1 |
pcr.best_estimator_.fit(X1,y) |

## Pipeline PLSR

The previous example can be replicated to define a Partial Least Square Regression model and optimise the number of latent variables used by the regressor. The code is
1 2 3 4 5 6 7 8 9 |
from sklearn.cross_decomposition import PLSRegression from sklearn.preprocessing import MinMaxScaler # Define the parameter to vary. Here's the number of PC parameters_pls = {'pls__n_components':np.arange(1,16,1)} plsr_pipe = Pipeline([('scaler', MinMaxScaler()), ('pls', PLSRegression())]) plsr = GridSearchCV(plsr_pipe, parameters_pls, scoring = 'neg_mean_squared_error') plsr.fit(X1, y) print(plsr.best_estimator_) |