Training a chemometrics (or machine learning) model involves dividing a dataset into at least two groups: a calibration subset and a validation subset. The calibration set is used to fit the model, while the validation set is used to check that the trained model works as intended on a set of unknown data. The split into calibration and validation can be done randomly or according to some predefined criterion. The Kennard-Stone algorithm is one such criteria.

In this post, I’ll introduce the basics of the Kennard-Stone (KS) algorithm and use a Python implementation of it on a real dataset. Let’s get started!

UPDATE: Since the publication of this post, I was made aware of a new Python library with a much faster implementation of the Kennard-Stone algorithm. I’ve added a section at the end to discuss the new implementation.

## The principle of the Kennard-Stone algorithm

Let’s introduce the KS algorithm as we might apply it to a spectroscopy experiment. The method is more general (it can be applied to data of any kind of data, in a generic machine learning model development) but we’ll ignore other cases to keep with the spirit of this blog.

Suppose you have collected N spectra and you want to select a subset n < N spectra for the calibration set. The KS algorithm provides a principled way to make this choice.

The **first principle** is that the spectra are assigned to the test set sequentially.

The **second principle** is that we require the n spectra to be “uniformly” spaced over the entire dataset, including having spectra at the boundary of this dataset. The term “uniformly spaced” requires defining the notion of a **distance between spectra**, so that the principle of uniformly spaced spectra is actually meaningful.

### Defining the distance metric

The distance metric chosen in the KS algorithm is the Euclidean distance (we talked about Euclidean distance in this blog before, for instance when defining the PCA correlation circle, or in comparison with the Mahalanobis distance for outliers detection). The squared Euclidean distance between the spectrum i and the spectrum j is

D^{2}_{ij} = \sum_{k=1}^{K} \left( x_{ik} - x_{jk} \right)^{2}

where k = 1, ..., K is an index denoting the wavelength band, K is the maximum number of wavelength in our spectra, and the symbol \sum indicates sum over the wavelength bands. The distance is usually squared so that we will be dealing with positive quantities and the order in which we chose the spectra doesn’t matter (in other words D^{2}_{ij} = D^{2}_{ji}).

### Kennard-Stone algorithm sequence

Now that the important definitions are explained, we are ready to describe the algorithm sequence.

- The first step is to choose two starting point for the test set. This is accomplished by choosing the two spectra which have the largest distance between them (out of all possible pairs of spectra). Note that this step ensures we choose spectra at the boundary of the dataset.
- Any subsequent spectrum is added by computing the distance of a candidate spectrum for the spectra already selected, and requiring this distance to be the largest. In other words the next spectrum in the test set is the one that has the largest separation from the spectra already selected. This is the part that is usually a bit complicated to explain: the distance of a spectrum from a set of spectra is the minimal distance from each of the spectra in the set. Therefore the next spectrum is chosen as the one whose minimal distance from the already-chosen spectra is the largest available.

This latest point is probably better explained with a diagram.

Suppose the two red points have been already selected in the test set. The next point to be added has to be chosen between two candidate points: A and B. For each of the candidate points, we calculate the minimal distance between the point and the test set. The two minimal distances are marked with an arrow. Out of the two, we chose the largest, which in this case is the one marked in red. Therefore the next point to be added to the test set is A.

The sequential process of adding point is stopped when we reach a set number of points (i.e. a specified size of the test set) decided beforehand.

This is the gist of the Kennard-Stone algorithm.

## Python implementation of the Kennard-Stone algorithm

For the Python implementation of the Kennard-Stone algorithm, we turn to an existing, open-source library called (you guessed it) kennard-stone, also available at the Github repository of **yu9824**. This library is extremely handy, as its syntax is self-explaining if you know scikit-learn. The data used for this example are taken from the paper by Ji Wenjun *et al*. linked here, and the associated data is available here. It consists of NIR spectra of soils with associated Total Organic Carbon (TOC) labels.

Let’s start with the imports.

1 2 3 4 5 6 |
import numpy as np import matplotlib.pyplot as plt import pandas as pd import kennard_stone as ks from sklearn.cross_decomposition import PLSRegression from sklearn.model_selection import cross_validate |

Now, let’s load the data and extract the quantities of interest, i.e. spectra and primary values.

1 2 3 4 5 |
data = pd.read_excel("../../data/File_S1.xlsx") X = data.values[1:,8:].astype('float32') #spectra wl = np.linspace(350, 2500, X.shape[1]) # wavelengths y = data["TOC"].values[1:].astype("float32") #labels |

### Train & Test split

Splitting the data into train and test set is now as easy as

1 |
X_train, X_test, y_train, y_test = ks.train_test_split(X, y, test_size = 0.2) |

We can check the result by plotting how spectra and primary values have been divided by the KS method. Here’s the code.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
with plt.style.context(('seaborn')): f, axs = plt.subplots(2,1,figsize=(8,10)) n, bins, patches = axs[0].hist(y_train, bins = 20, alpha=0.75, label="Training set") axs[0].hist(y_test, bins = bins, alpha=0.75, label="Test set") axs[0].legend() axs[0].set_xlabel("TOC content") axs[0].set_ylabel("Number of samples") train_legend = np.repeat("_Training set", X_train.shape[0]) train_legend[0] = "Training set" test_legend = np.repeat("_Test set", X_test.shape[0]) test_legend[0] = "Test set" axs[1].plot(wl, X_train.T, 'b', alpha=0.5, label = train_legend) axs[1].plot(wl, X_test.T, 'g', alpha=0.5, label = test_legend) axs[1].legend() axs[1].set_xlabel("Wavelength (nm)") axs[1].set_ylabel("NIR reflectance") plt.tight_layout() plt.show() |

And here’s the result.

Note how, in this implementation, the KS algorithm has acted on the independent variable y to generate a test set that is uniformly distributed across the parameter space, including points at the boundary.

If you want to divide the data on the basis of the spectra, you’ll have to invert the order of the variables as

1 |
y_train, y_test, X_train, X_test = ks.train_test_split(y.reshape(-1, 1), X ,test_size = 0.2) |

where the reshape operation is required as the variable y (interpreted in this case as predictor) has a single feature.

### Cross-validation

The Kennard-Stone algorithm can be used to generate a KFold cross-validation. For instance, here’s how we can print the number of folds (check our post on K-fold, Montecarlo and Bootstrap methods for more info)

1 2 3 |
kf = ks.KFold(n_splits=5) for train, test in kf.split(X): print("Train:", y[train]," Test:", y[test]) |

The method can be used in cross validation, for instance, here’s a minimal implementation of a PLS regression algorithm.

1 2 |
pls = PLSRegression(n_components=10) print(cross_validate(pls, X, y, scoring = 'neg_mean_squared_error', cv = kf)) |

## A faster implementation of the Kennard-Stone algorithm

Recently, Jackson Burns (currently Chem Eng PhD student at MIT) alerted me of a new library called astartes, developed by him and coworkers as a drop-in replacement for the test_train_split routine of scikit-learn. Astartes implements a much faster version of the Kennard-Stone algorithm. The library can be installed using pip. The basic code looks familiar

1 2 |
import astartes as at X_train, X_test, y_train, y_test = at.train_test_split(X, y ,test_size = 0.2) |

With the new library, I get a massive 430x speed-up compared to the K-S algorithm as described above. Given the iterative nature of the K-S algorithm, this is a massive improvement which is going to make the K-S algorithm very applicable to larger datasets.

## A final observation and wrap up

You might have noticed that, unlike other sampling strategies based on randomness, the KS methods don’t have a “random_state” option. For what we have explained above, the Kennard-Stone sampling is completely determined by the dataset and, for all intents and purposes, unique. I say “for all intents and purposes” because the KS split may not be unique in the mathematical sense (there may be two or more spectra with the same distance from the selected set), but noise and other vagaries make sure that it is unique in practice.

To conclude, the Kennard-Stone algorithm is a way to perform a split between training and test set based on a distance metric between data points, spectra or labels. It can be used if one requires to have a test set that is roughly uniformly spaced and therefore consistent with the training set.

That’s it for today and thanks for reading!