The Akaike Information Criterion for model selection

Model selection is a topic we discuss in many examples in this blog. That consists in choosing the optimal processing pipeline, and/or to select wavelength bands which, once selected (or discarded) improve the accuracy of a model. The accuracy in turns, is generally quantified by defining a cost function – for instance the RMSE – and applying a cross-validation procedure (or, if you have enough data, by a test-train-validation split) to minimise it. In other words, we are assuming that the best model is obtained by discarding those wavelengths that minimise the RMSE.

One of the potential problems with these types of procedure, is that they compare different models based only on their performance on the test data set and not on their complexity. The Akaike Information Criterion (AIC) is an alternative procedure for model selection that weights model performance and complexity in a single metric.

In this post we are going to discuss the basics of the information criterion and apply these to a PCR regression problem.

Variable selection and model comparison

Before getting into the weeds of the information criteria, let’s make sure we understand the conceptual argument at their roots. You might have noticed that in the introduction I somehow managed to use the terms “variable selection” and “model selection” almost interchangeably, which is a bit inaccurate and needs to be better qualified.

What we really do when we run a variable selection method, is to build individual models for every choice of the variables. For instance we discard one wavelength band at a time and we build a (brand new) PLS model with the remaining bands. To establish the effect of the wavelength band in question we compare the models with and without it. If, by removing that band, the model has improved (it has decreased its RMSE in cross-validation, for instance) we then happily discard that band. Therefore, to select variables we need to compare models based on different selections of variables. In this sense we can speak of model selection as well as variable selection interchangeably. In other situations, we use the term “model selection” which is more appropriate when we are comparing models on the basis of some other parameters (such as pre-processing pipelines, number of components, etc).

With this clarification out of the way, we are finally ready to talk about the information criteria and their meaning.

The Akaike information criterion

The Akaike information criterion (from now on AIC for short) was first proposed by Japanese statistician Hirotugu Akaike in the late 70s (according to Akaike, as quoted here, the acronym AIC stands simply for “an information criterion”).

The AIC addresses the problem of finding an optimal model given the data and a set collection of given alternative models. In other words the AIC can score the set models relative to each other. The preferred model, given the data, is the one that minimises the AIC. For least square regression models, which are the ones we are interested here, the formula for the AIC is
AIC=n \log(\sigma^2) + 2k
where \sigma^2 is the MSE (mean square error), n is the total number of samples and k is the total number of parameters estimated in the model.

As an aside, for more information and formal derivation of the formulas discussed here, see the book  Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach by K.P. Burnham and D.R. Anderson (link at the end)

As you may have noticed, the first term in the AIC decreases with the MSE. That is a measure of the model accuracy. More accurate models tend to decrease the AIC, which is what we’re after. There is however a second term in the formula. This term depends on the total number of parameters estimated by the model. Using too many parameters (which, as you know, leads to overfitting) has the opposite effect of increasing the AIC. If we seek to minimise the AIC, the total number of parameters can be used as a penalty, so that the AIC will penalise models with too many parameters. Therefore the AIC metric tries to strike a balance between model accuracy (low MSE) and model parsimony (low number of parameters).

Please be reminded however that the AIC makes no claim as to whether a model correctly describes the data or not. The AIC can only compare different alternative models. This observation has one important practical implication: for any given data set, the absolute value of the AIC is immaterial. The only thing that counts is the comparison (or the difference) between the AIC values of different models.

Small data sets

One possible shortcoming of the AIC (see the Model Selection and Multimodal Inference book) is that it may perform poorly if the number of estimated parameters is large compared to the sample size. Fortunately for us, a second-order version of the AIC, which is valid for small sample sizes, was also derived and is called AICc.

AIC_c=AIC + \frac{2k (k+1)}{n-k-1}

Note how the second term in the formula above tends to disappear where n \gg k so that the AICc tends to the AIC in these conditions and all is well. These arguments set the scene for the third key concept we want to discuss here: using the AIC to perform variable selection.

Number of components optimisation in PCR with the Akaike information criterion

From what we have discussed, it’s clear that we should be able to use the AIC in place of (or together with) the RMSE as a metric to minimise when performing model selection. The only variable in the AIC formula that is specific to PCR is the calculation of the number of parameters k. In PCR that is the number of principal components plus one. The one we add is the intercept (if we decide to calculate it in the fit) which must be added to the number of principal components.

OK, with this in mind, we are able to write some code. We start with the relevant imports

Now we define the AIC function. For the sake of comparison, we calculate both AIC and AICc.

Let’s load a data set from our GitHub repo and assign spectra and primary values to X and y respectively

These are 50 NIR scans, so that n=50 in the AIC formula. We are now ready to build two different PCR models on the data. In this simple example, we are just changing the number of principal components and running a cross-validation procedure

In my case, the script prints

Let’s look at these values and understand their meaning. In the first model we use 5 principal components. Therefore k=6 and n/k~8. In the second case, k=21 and n/k~2.3. When the ratio n/k is large enough AICc is not too different from AIC. In the second case however the ratio is fairly small, so that there is a larger discrepancy between AIC and AICc. In this case it’s recommended to use the AICc formula (As another aside, the AICc formula in fact depends on some underlying assumptions on the statistical nature of the data. Therefore it technically depends on that and different formulas have been derived for different models. More on that in the book in Ref. 1).

Another observation is that, as expected, the AIC for the larger model (the one with more PCs) is larger. On this basis alone we could use the AIC to optimise the number of latent variables in a PCR regression model. A simple script, containing a loop over the number of principal components, would be this one

And here’s the result

This procedure informs us that the optimal number of principal components is 5. Less than that, the model is underfitting, which means that the MSE term in the AIC formula is dominant and large. Above 5, the k term (number of variables) in the AIC formula becomes dominant (the model starts to overfit). Also, as noted before, AIC and AICc tend to diverge with an increasing number of components.

In conclusion, the AIC can be used as an additional metric to evaluate the quality/robustness of your prediction model. In particular, it can be used for variable selection, and I plan to use a future post to work through an example of wavelength selection using the AIC.

As usual, thanks for reading and until next time,
Daniel

P.S. I initially wrote this post showing an example application in PLS regression, where the AIC/AICc was (erroneously) calculated using the number of latent variables in PLS. Soon after, I was made aware that the PLS case is actually more complicated, as explained in Ref [4] below. I am indebted to Dr Pierre Dardenne for bringing this to my attention.

References

  1. Kenneth P. Burnham and David R. Anderson, Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, Springer (2002).
  2. A discussion of the AIC for model selection in PLS is in B. Li, J. Morris, and E.B. Martin. Model selection for partial least squares regression. Chemometrics and Intelligent Laboratory Systems 64, 79-89 (2002).
  3. For another example of AIC in linear regression — and a comparison with other information criteria — see this post.
  4. AIC applied to PLS is a bit more complicated: Matthieu Lesnoff, Jean-Michel Roger, and Douglas N. Rutledge, Monte Carlo methods for estimating Mallows’s Cp and AIC criteria for PLSR models. Illustration on agronomic spectroscopic NIR data. Journal of Chemometrics e3369 (2021).
  5. Featured image credit: John Barkiple on Unsplash.