Imagine you have fitted a regression model (say a PLS regression model) to your data. Now you’d like to evaluate how good is the model. Generally, you may look at metrics such as the coefficient of determination R^2, the root mean squared error RMSE, the mean absolute error MAE, or other suitable metrics. In addition, if you have a validation data set (you should!), you can evaluate the performance of the model on it.

While all of the above operations will give you a good idea about the quality of your model, there are more diagnostics that you can run on your model, to evaluate whether the assumptions behind the choice of model were justified.

In this post, I’ll describe a few diagnostic plots for linear regression, how to build them and how to interpret them. I’ll compare diagnostic plots of regression models trained on two datasets:

- NIR data taken from the paper In Situ Measurement of Some Soil Properties in Paddy Soil Using Visible and Near Infrared Spectroscopy by Ji Wenjun
*et al*. The data is available here. The data contains NIR spectra of soil samples and related Total Organic Content (TOC). - “peach_spectra+brixvalues.csv”, available at our GitHub repository and containing the NIR spectra and related brix values of fresh peaches.

## Residual plot

The simplest diagnostic operation is to calculate the *residuals* and generate a so-called *residual plot*.

A residual value is the difference between the training value (observed value) and the corresponding prediction from the model. A residual plot is obtained by plotting the residuals versus the predicted values.

This is a quick way to ascertain if any trend is present in the residual, which may hint at deeper issues with the model. When we fit a linear model, we (implicitly) make the assumption that a linear relation is present between the observed variable and a suitable linear transformation (such as PLS) of our data.

Mathematically, we assume that the i-th observed value y_i is connected to the M predictors via

y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots \beta_M x_{i,M} + e_iThe beta parameters are the coefficients of the regression, and the final term e_i is the error term. The linearity assumption means that we exclude any dependence that is quadratic (or higher order) in the predictors. Moreover, we assume that the errors are normally distributed, have zero mean, and constant sigma (i.e. independent of i).

A residual plot will help us evaluate (visually, at least) if all of these conditions are met, and here’s a worked example.

Let’s start with the imports

1 2 3 4 5 6 7 8 9 |
import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy.signal import savgol_filter from sklearn.cross_decomposition import PLSRegression from sklearn.model_selection import cross_val_predict, GridSearchCV from sklearn.metrics import mean_squared_error, r2_score |

Now we can load the datasets

1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Read data from Github base_url = 'https://raw.githubusercontent.com/nevernervous78/nirpyresearch/master/data/' data1 = pd.read_excel("../../data/File_S1.xlsx") X1 = data1.values[1:,8:].astype('float32') y1 = data1["TOC"].values[1:].astype("float32") data2 = pd.read_csv(base_url+'peach_spectra_brix.csv') y2 = data2['Brix'].values X2 = data2.drop(['Brix'], axis=1).values X = [X1, X2] y = [y1, y2] |

Now let’s train a basic PLS regression on each dataset and calculate a cross-validation result. For each regression, we calculate the residuals

1 2 3 4 5 6 7 8 9 |
parameters = {'n_components':np.arange(1,20,1)} y_cv = [] for i in range(2): pls = GridSearchCV(PLSRegression(), parameters, scoring = 'neg_mean_squared_error', verbose=0, cv=10) pls.fit(X[i], y[i]) print(i, pls.best_estimator_) y_cv.append(cross_val_predict(pls.best_estimator_, X[i], y[i], cv=10)) |

These are the “predicted-v-reference” scatter plot for the two cases

And these are the corresponding residual plots, which are obtained by plotting the residuals versus the model-predicted values

At a first glance, the distribution of the residuals looks fairly random in the second case, while there are perhaps trends visible in the plot on the left. That is obviously related to the fact that a simple PLS models, as developed above, fails to capture the data variability at either end of the distribution.

In addition, there seems to be also an issue with the assumption of constant variance (sigma) of the residuals. It seems that the spread of the residuals is higher for higher predicted values (plot of the left).

If the spread of residual were indeed not constant, we are in the presence of the so-called heteroscedasticity (and the opposite of that is called homoscedasticity). This (unfortunate) name just means that the distribution of variance is not the same across the samples. In this case, it would mean that the prediction would (on average) be off more often at the higher end of the distribution, indicating perhaps a deviation from the normality assumption (i.e. the variance is functionally dependent form the mean).

## Q-Q plots

This brings us to the second type of diagnostic plots, called Q-Q plots (Quantile=-Quantile plots).

Q-Q plots are a type of probabilistic plots that compare the theoretical distribution of points that are normally distributed with our specific dataset which we’d like to evaluate. In our case, we compare the residuals with what would be expected if these residual were normally distributed.

As the name suggests, the Q-Q plot is constructed by charting the quantiles of the empirical distribution of the data versus the theoretical quantiles of a normal distribution. If the two variables are identically distributed (i.e. if the residuals are normally distributed), the Q-Q plot is a line with slope 1 passing through the origin. If the two variable are linearly related, the Q-Q plot is still a line, but its slope and intercept may change.

Any deviation from the linear behaviour is an indication that the normality assumption on the residuals (hence the linearity assumption on the model) may not hold.

Q-Q plots can be directly generated using the relevant function from the statsmodel library. For the first residuals, this looks like this

1 2 3 4 5 |
with plt.style.context(('seaborn-v0_8')): fig, ax = plt.subplots(figsize=(5, 5)) sm.qqplot(residuals[0], fit=True, line='r', ax=ax) ax.set_title("TOC in Soil") plt.show() |

and similarly for the second residuals. Here’s the result

Now, it’s easy to visualise that, in the first case, the deviation from linear behaviour is quite significant at the lower end of the distribution. In the second case, the distribution is closer to linear, even though there may be a deviation in the range from 0 to 0.5.

## Scale-location plots

This brings us to the last diagnostic plots we are covering today. The scale-location plot is defined as the scatter plot of the predictions versus the standardised residuals, and it is used primarily to test for heteroscedasticity issues.

To produce this plot, we first need to calculated the standardised residuals. This is obtained by dividing the residuals by an **estimate** of their standard deviation. This concept is however a bit tricky, and you can find any number of incorrect estimations out there.

The incorrect estimation would be to take the residuals array and divide it by its standard deviation. This is incorrect!

The correct calculation is based on the fact that each residual is interpreted (independently from any other residual) as a particular random sample extracted from a statistical distribution (which we don’t know, in general). The point here is that we seek an **estimate** of the standard deviation of this (unknown) distribution, and we use this value to normalise each residual.

The correct estimation of the standardised residuals t_i is

t_i = \frac{ e_i }{\sigma \sqrt{1-h_i}}where h_{i} is called the leverage which, for linear regression, can be calculated as (see page 99 of Dunn & Smyth)

h_i = \frac{1}{n} + \frac{\left(e_i - \bar{e}\right)^2}{ss_e}.

Finally ss_{e} is the sum of squares total of the residuals array,

ss_e = \sum_{i=1}^{N} \left(e_i - \bar{e}\right)^2,

with \bar{e} being the average of the residual array.

Here’s the code to calculate the standardised residuals for the two datasets:

1 2 3 4 5 6 7 8 |
leverage = [] standardised_residuals = [] for i in range(2): sse = np.sum( (residuals[i] - np.mean(residuals[i]))**2 ) lvrg = 1./residuals[i].shape[0] + (residuals[i] - np.mean(residuals[i]))**2 / sse leverage.append(lvrg) std_res = residuals[i] / (np.std(residuals[i]) * np.sqrt(1-lvrg) ) standardised_residuals.append(std_res) |

And here’s the code to generate the scale-location plot:

1 2 3 4 5 6 7 8 9 10 11 |
with plt.style.context(('seaborn-v0_8')): fig, ax = plt.subplots(1, 2, figsize=(10,5)) for i in range(2): ax[i].scatter(y_cv[i], np.sqrt(np.abs(standardised_residuals[i]))) ax[i].set_xlabel('Predicted values') ax[i].set_ylabel('Residuals') ax[i].set_title(dataset_name[i]) fig.suptitle('Scale-location plot', fontsize=16) plt.show() |

In the scatter plot on the right, the standardised residuals seems to have a fairly uniform spread, perhaps with the exclusion of the points around 20 brix. On the other hand, the plot on the left shows that the spread of the standardised residuals is larger in the middle of the range, compared to the extremes. In addition, the larger spread is mostly due to some outlying data points with standardise residuals above 1.25.

From this analysis, one may conclude that the Soil Carbon dataset contains at least two groups. While the linearity assumption may hold for each individual group, when they are put together this is no longer true. Therefore, a single regression model may not be the best choice for this specific dataset.

This concludes my brief survey on diagnostic plots on real datasets. I leave you with a few reading suggestions, and until next time.

Daniel

## See also

- Regression validation on Wikipedia
- STAT 462 course material by Penn State
- Nicolas Christou’s (UCLA) handouts on standardised residuals [PDF]

## Credits

Cover photo by Egor Kamelev on Pexels