Diagnostic plots for linear regression

Imagine you have fitted a regression model (say a PLS regression model) to your data. Now you’d like to evaluate how good is the model. Generally, you may look at metrics such as the coefficient of determination R^2, the root mean squared error RMSE, the mean absolute error MAE, or other suitable metrics. In addition, if you have a validation data set (you should!), you can evaluate the performance of the model on it.

While all of the above operations will give you a good idea about the quality of your model, there are more diagnostics that you can run on your model, to evaluate whether the assumptions behind the choice of model were justified.

In this post, I’ll describe a few diagnostic plots for linear regression, how to build them and how to interpret them. I’ll compare diagnostic plots of regression models trained on two datasets:

NIR data taken from the paper In Situ Measurement of Some Soil Properties in Paddy Soil Using Visible and Near Infrared Spectroscopy by Ji Wenjun et al. The data is available here. The data contains NIR spectra of soil samples and related Total Organic Content (TOC).
“peach_spectra+brixvalues.csv”, available at our GitHub repository and containing the NIR spectra and related brix values of fresh peaches.

Residual plot

The simplest diagnostic operation is to calculate the residuals and generate a so-called residual plot.

A residual value is the difference between the training value (observed value) and the corresponding prediction from the model. A residual plot is obtained by plotting the residuals versus the predicted values.

This is a quick way to ascertain if any trend is present in the residual, which may hint at deeper issues with the model. When we fit a linear model, we (implicitly) make the assumption that a linear relation is present between the observed variable and a suitable linear transformation (such as PLS) of our data.

Mathematically, we assume that the $i$ -th observed value $y_i$ is connected to the M predictors via

y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots \beta_M x_{i,M} + e_i

The beta parameters are the coefficients of the regression, and the final term $e_i$ is the error term. The linearity assumption means that we exclude any dependence that is quadratic (or higher order) in the predictors. Moreover, we assume that the errors are normally distributed, have zero mean, and constant sigma (i.e. independent of $i$ ).

A residual plot will help us evaluate (visually, at least) if all of these conditions are met, and here’s a worked example.

Let’s start with the imports

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from scipy.signal import savgol_filter

from sklearn.cross_decomposition import PLSRegression

from sklearn.model_selection import cross_val_predict, GridSearchCV

from sklearn.metrics import mean_squared_error, r2_score

Now we can load the datasets

# Read data from Github

base_url = 'https://raw.githubusercontent.com/nevernervous78/nirpyresearch/master/data/'

data1 = pd.read_excel("../../data/File_S1.xlsx")

X1 = data1.values[1:,8:].astype('float32')

y1 = data1["TOC"].values[1:].astype("float32")

data2 = pd.read_csv(base_url+'peach_spectra_brix.csv')

y2 = data2['Brix'].values

X2 = data2.drop(['Brix'], axis=1).values

X = [X1, X2]

y = [y1, y2]

Now let’s train a basic PLS regression on each dataset and calculate a cross-validation result. For each regression, we calculate the residuals

parameters = {'n_components':np.arange(1,20,1)}

y_cv = []

for i in range(2):

pls = GridSearchCV(PLSRegression(), parameters, scoring = 'neg_mean_squared_error',

verbose=0, cv=10)

pls.fit(X[i], y[i])

print(i, pls.best_estimator_)

y_cv.append(cross_val_predict(pls.best_estimator_, X[i], y[i], cv=10))

These are the “predicted-v-reference” scatter plot for the two cases

And these are the corresponding residual plots, which are obtained by plotting the residuals versus the model-predicted values

At a first glance, the distribution of the residuals looks fairly random in the second case, while there are perhaps trends visible in the plot on the left. That is obviously related to the fact that a simple PLS models, as developed above, fails to capture the data variability at either end of the distribution.

In addition, there seems to be also an issue with the assumption of constant variance (sigma) of the residuals. It seems that the spread of the residuals is higher for higher predicted values (plot of the left).

If the spread of residual were indeed not constant, we are in the presence of the so-called heteroscedasticity (and the opposite of that is called homoscedasticity). This (unfortunate) name just means that the distribution of variance is not the same across the samples. In this case, it would mean that the prediction would (on average) be off more often at the higher end of the distribution, indicating perhaps a deviation from the normality assumption (i.e. the variance is functionally dependent form the mean).

Q-Q plots

This brings us to the second type of diagnostic plots, called Q-Q plots (Quantile=-Quantile plots).

Q-Q plots are a type of probabilistic plots that compare the theoretical distribution of points that are normally distributed with our specific dataset which we’d like to evaluate. In our case, we compare the residuals with what would be expected if these residual were normally distributed.

As the name suggests, the Q-Q plot is constructed by charting the quantiles of the empirical distribution of the data versus the theoretical quantiles of a normal distribution. If the two variables are identically distributed (i.e. if the residuals are normally distributed), the Q-Q plot is a line with slope 1 passing through the origin. If the two variable are linearly related, the Q-Q plot is still a line, but its slope and intercept may change.

Any deviation from the linear behaviour is an indication that the normality assumption on the residuals (hence the linearity assumption on the model) may not hold.

Q-Q plots can be directly generated using the relevant function from the statsmodel library. For the first residuals, this looks like this

with plt.style.context(('seaborn-v0_8')):

fig, ax = plt.subplots(figsize=(5, 5))

sm.qqplot(residuals[0], fit=True, line='r', ax=ax)

ax.set_title("TOC in Soil")

plt.show()

and similarly for the second residuals. Here’s the result

Now, it’s easy to visualise that, in the first case, the deviation from linear behaviour is quite significant at the lower end of the distribution. In the second case, the distribution is closer to linear, even though there may be a deviation in the range from 0 to 0.5.

Scale-location plots

This brings us to the last diagnostic plots we are covering today. The scale-location plot is defined as the scatter plot of the predictions versus the standardised residuals, and it is used primarily to test for heteroscedasticity issues.

To produce this plot, we first need to calculated the standardised residuals. This is obtained by dividing the residuals by an estimate of their standard deviation. This concept is however a bit tricky, and you can find any number of incorrect estimations out there.

The incorrect estimation would be to take the residuals array and divide it by its standard deviation. This is incorrect!

The correct calculation is based on the fact that each residual is interpreted (independently from any other residual) as a particular random sample extracted from a statistical distribution (which we don’t know, in general). The point here is that we seek an estimate of the standard deviation of this (unknown) distribution, and we use this value to normalise each residual.

The correct estimation of the standardised residuals $t_i$ is

t_i = \frac{ e_i }{\sigma \sqrt{1-h_i}}

where $h_{i}$ is called the leverage which, for linear regression, can be calculated as (see page 99 of Dunn & Smyth)

$h_i = \frac{1}{n} + \frac{\left(e_i - \bar{e}\right)^2}{ss_e}$ .

Finally $ss_{e}$ is the sum of squares total of the residuals array,

$ss_e = \sum_{i=1}^{N} \left(e_i - \bar{e}\right)^2$ ,

with $\bar{e}$ being the average of the residual array.

Here’s the code to calculate the standardised residuals for the two datasets:

leverage = []

standardised_residuals = []

for i in range(2):

sse = np.sum( (residuals[i] - np.mean(residuals[i]))**2 )

lvrg = 1./residuals[i].shape[0] + (residuals[i] - np.mean(residuals[i]))**2 / sse

leverage.append(lvrg)

std_res = residuals[i] / (np.std(residuals[i]) * np.sqrt(1-lvrg) )

standardised_residuals.append(std_res)

And here’s the code to generate the scale-location plot:

with plt.style.context(('seaborn-v0_8')):

fig, ax = plt.subplots(1, 2, figsize=(10,5))

for i in range(2):

ax[i].scatter(y_cv[i], np.sqrt(np.abs(standardised_residuals[i])))

ax[i].set_xlabel('Predicted values')

ax[i].set_ylabel('Residuals')

ax[i].set_title(dataset_name[i])

fig.suptitle('Scale-location plot', fontsize=16)

plt.show()

In the scatter plot on the right, the standardised residuals seems to have a fairly uniform spread, perhaps with the exclusion of the points around 20 brix. On the other hand, the plot on the left shows that the spread of the standardised residuals is larger in the middle of the range, compared to the extremes. In addition, the larger spread is mostly due to some outlying data points with standardise residuals above 1.25.

From this analysis, one may conclude that the Soil Carbon dataset contains at least two groups. While the linearity assumption may hold for each individual group, when they are put together this is no longer true. Therefore, a single regression model may not be the best choice for this specific dataset.

This concludes my brief survey on diagnostic plots on real datasets. I leave you with a few reading suggestions, and until next time.

Daniel

Credits

Cover photo by Egor Kamelev on Pexels

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Residual plot

Q-Q plots

Scale-location plots

See also

Credits

About The Author

Daniel Pelliccia

Residual plot

Q-Q plots

Scale-location plots

See also

Credits

About The Author

Daniel Pelliccia

Related Posts