If you have been in and around statistical learning for some time, you likely came across the expression bias-variance trade-off. In the good books – and the good Wikipedia articles like this one – the bias-variance trade-off (which we are going to call BVT from now on) is explained in terms of the error of prediction once a statistical model is developed.

That is undeniably correct, but as it is in the style of this blog, we would like to take some time to understand how the BVT works in practice, and specifically how it would affect our PLS regression models.

We are going to write a gentle introduction, trying to understand the issue using the analogy of a physical measurement of a quantity. Then we’ll describe the practical effects of the BVT on a PLS model, and how it drives the choice of the number of latent variables for an optimal model. We’ll also briefly muse on how this affects the choice of hyper-parameters in the cross-validation procedure.

## Bias-Variance trade-off: the measurement analogy

To conceptualise the problem, I found it useful to think of a prediction done by a statistical model as a measurement of sorts. Think about a scale, a voltmeter, a temperature probe, etc. What is the error in each measurement operation?

By quoting again Wikipedia, the observational error (or measurement error) is the difference between a measured value of a quantity and its true value. We don’t know what the true value is, and we want to estimate it through measurement. Given the inherent variability of the process, our estimation is affected by an error.

Measurement errors are divided into systematic errors and random errors.

Understanding the difference between systematic errors and random errors is key, in my opinion, to get a grip on the BVT.

So let’s begin: what is a systematic error? That is an offset present in the measurement instrument that systematically biases (oh no, I’m giving it away already!) the measurement. A systematic error is inherent in the measurement system and doesn’t decrease with repeated measurement.

The random error, as its name suggests, is a stochastic variability of the measurement outcome due to all possible factors that influence the measurement and are unaccounted for. Typically the random error depends on the environment or on accidental factors affecting the measurement. This does indeed decrease by taking multiple readings, as the random variations tend to cancel out by averaging.

OK, here’s the next step. Let’s think of a statistical model as a sort of calibrated instrument, which we can use to estimate the value of a parameter. Sure enough, we can associate an error to this type of measurement process (a prediction).

The error can be divided into systematic and random one, just as we did before (in fact there is a third component which we’ll get to later). We take the systematic error and we give it the name bias. We take the random error and call it variance.

I hope we were able to build some intuition there. So let’s go ahead and try to be a tiny bit more rigorous.

## Bias-Variance trade-off: the maths

Suppose we have a target variable y and a vector of inputs X. For instance X is the collection of spectra and y is the variable we are trying to model.

Now suppose also that a true relation between those variables exists. By that I mean that there exists a function f(X) such that y = f(X) + e. The term e represents the error on the target variable. This is related to the third term we hinted at before, and comes from the fact that the error in the predictions of a calibrated model must also depend on the error in the variables used for training it: not only X but also y.

Suppose we use our training data to fit a model. Let’s call this model \hat{f}. \hat{f} is an approximation to the true relation f. once trained, we use \hat{f} to make predictions on new spectra. We call these \hat{y} = \hat{f}(X).

OK, so what is the error on \hat{y}? If you follow the proof on the Wikipedia page you find out that the error is

\mathrm{Error} = \mathrm{Bias}^{2} + \mathrm{Variance}^{2} + \sigma^{2} \newline\sigma^{2} is the third term we mentioned already. It is called the irreducible error, because it depends on the target variables and cannot be reduced further.

The bias is the systematic error, i.e. the difference between the trained model \hat{f} and the true relation f(X). This cannot be reduced if we take more predictions.

The variance is the squared error in the predictions (the difference between \hat{y} and y) and it is reduced by making multiple predictions.

So there you go: here’s a simple framework to understand the prediction error of a statistical model.

## Hang on, where is the trade-off then?

I’m glad you asked.

This is the point where the analogy between a statistical model and a measurement instrument becomes a bit shaky. Bias and variance of a statistical model are not fixed, but depends on the training process of that model. Even if we keep the training data fixed, the way we set the hyper-parameters of the model is going to affect bias and variance.

Importantly, bias and variance do not vary independently. To a decreasing bias (for instance) generally corresponds an increasing variance.

To be more specific, to decrease the bias, we need to increase the model complexity so that it fits better the training data. In doing so we run the risk of overfitting. A model that describes every nook and cranny of the training data, including the noise, has certainly low bias, but is expected to generalise poorly to new data. That is, it is expected to have high variance.

To build a model that is fairly consistent with the training data and can generalise well, means looking for this famous trade-off.

PLS regression can help us to appreciate this point in practice.

## Bias-Variance trade-off in PLS regression

For this tutorial we will use a freely available Vis-NIR data set from the paper In Situ Measurement of Some Soil Properties in Paddy Soil Using Visible and Near Infrared Spectroscopy by Ji Wenjun et al. The data set is available at this link.

The purpose of this exercise is to build PLS models with an increasing number of latent variables and keep track of the root mean square error in prediction and cross validation.

The prediction error gives us a measure of the bias. It is the error we measure when we predict values corresponding to data that were used for training. The cross-validation error on the other hand, is a better measure of the variance, i.e. a measure of the ability of the model to generalise to data not used for training.

Let’s begin with the required imports

1 2 3 4 5 6 7 8 9 | import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import KFold from sklearn.cross_decomposition import PLSRegression from sklearn.metrics import mean_squared_error, r2_score from scipy.signal import savgol_filter |

We now define a simple PLS regression function, which fits a model on train and test set and returns the corresponding predictions.

1 2 3 4 5 6 7 8 9 10 11 12 | def base_pls(X_train,y_train, X_test,n_components): # Define PLS model and fit it to the train set pls = PLSRegression(n_components=n_components) pls.fit(X_train, y_train) # Prediction on train set y_c = pls.predict(X_train) # Prediction on test set y_p = pls.predict(X_test) return(y_c, y_p) |

Let’s import the data and perform some pre-processing

1 2 3 4 5 6 7 | raw = pd.read_excel("../data/File_S1.xlsx") X = raw.values[1:,13:].astype('float32') y = raw.values[1:,1] wl = np.linspace(350,2500, num=X.shape[1], endpoint=True) X = savgol_filter(X, 11, polyorder = 2,deriv=2) |

and take a look at the data.

1 2 3 4 5 | with plt.style.context(('seaborn')): plt.plot(wl, X.T, linewidth=1) plt.xlabel('Wavelength (nm)') plt.ylabel('Second Derivative Reflectance') plt.show() |

Now we are ready to write the key part of the script, which essentially amounts to a cross-validation procedure from which we extract the root mean square error for the training and the test set for every split. At the end we average the results over the number of folds, so to obtain an estimate of the prediction and cross-validation error as a function of the number of latent variables.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | # Define k-fold cross-validation with 5 splits cval = KFold(n_splits=5, shuffle=True) cval.get_n_splits(X,y) # Max number of latent variables to consider max_lv = 50 # Define prediction arrays y_c = np.zeros_like(y) y_cv = np.zeros_like(y) # Define RMSE array rmse_c = np.zeros(max_lv).astype('float32') rmse_cv = np.zeros(max_lv).astype('float32') # Loop over the latent variables for comp in range(max_lv): rmsec = [] rmsecv = [] # Cross-Validation for train, test in cval.split(X): # Run basic pls y_t, y_p = base_pls(X[train,:],y[train],X[test,:], comp+1) # Calculate metrics corresponding to this split rmsec.append(np.sqrt(mean_squared_error(y[train], y_t[:,0]))) rmsecv.append(np.sqrt(mean_squared_error(y[test], y_p[:,0]))) # Calculate the average of the RMSE over the splits for the selected number of latent variables rmse_c[comp] = np.mean(np.array(rmsec)) rmse_cv[comp] = np.mean(np.array(rmsecv)) |

Finally, let’s plot the results.

1 2 3 4 5 6 7 | with plt.style.context(('seaborn')): plt.plot(rmse_c, 'o-r', label = "RMSE Calibration") plt.plot(rmse_cv, 'o-b', label = "RMSE Cross Validation") plt.ylabel('RMSE') plt.xlabel('Number of LV included') plt.legend() plt.show() |

The previous plot encapsulates the topic of this post. As we include more and more latent variables the model is adapting better and better to the training data. This is demonstrated by the RMSE calibration getting smaller and smaller. The bias is decreasing with increasing number of latent variables.

The RMSE cross-validation on the other hand decreases for the first 18 latent variables and then tends to increase again. This is not surprising for most of you who are familiar with model development through cross-validation. The increase in the RMSE cross-validation corresponds to an increase in variance. This is another way of saying that the model is overfitting. It describes the training data very well, but it generalises poorly to data outside of the training set.

The trade-off has to be found somewhere in the region where the RMSE cross-validation is minimal. That is the best we can do given the training data and the other hyper-parameters of the model. Obviously one could build model by varying some of the other hyper-parameters, for instance trying different preprocessing strategies or playing with band selection. Regardless, the rational to choose the model is the same: we want our model to be able to predict future data, so we want to keep the variance to a minimum.

That’s us for today. Thanks for reading and until next time,

Daniel