Principal component selection with a greedy algorithm

Feature selection is one of the topics that come up over and over again when dealing with chemometrics for spectroscopy – and rightly so. In many cases “feature” selection amounts to wavelength (or wavelength bands) selection prior to the regression, to boost the performance of the model. The interesting question is how to search for the best features. Today we discuss a widely used class of algorithms for random search optimisation, called greedy algorithms. To make things interesting, we are going to discuss a kind of feature selection that you might not see very often: principal component selection. The basic idea is to calculate principal components in the usual way, then use a greedy algorithm to find the combination of principal components that yields the best principal component regression model.

For those interested in the topic, we discussed wavelength selection in two examples in previous posts:

Unlike our previous forays in the topic of feature selection, in this and a subsequent post we’d like to take a more general look at algorithmic ways to select features. Feature selection algorithms are a special case of an optimisation algorithm based on a (more or less) random search over the feature space (technically ‘biased’ random walks). Therefore it makes sense to discuss in detail basic algorithms to run such a random search in the best way to optimise the performance of a model.

Everything will become more clear as we progress, but for now let’s define what we mean:

  1. Optimising the performance of a model means minimising an aptly chosen metric (also called cost, or loss, function). In regression problems the Root Mean Squared Error in prediction or cross-validation are usually valid metrics.
  2. An optimisation algorithm is a search algorithm over the possible combinations of features that, when used in a regression problem, minimises the cost function.

A straightforward example we discussed in a previous post is to exclude those wavelengths that are poorly correlated with the parameter of interest, thus increasing the overall performance of a model. The concept is however a lot more general. In fact we are going to show an example of an optimisation algorithm used not on primary wavelengths, but on principal components which are derived from those wavelengths. But let’s not rush ahead and go step by step.

The plan for this post is as follows:

    1. We’ll give a gentle but general introduction to the concept of optimisation problems and introduce a basic class of random search algorithms known as greedy algorithms
    2. We’ll talk specifically of how we can implement a greedy algorithm for principal component selection in NIR spectroscopy
    3. We’ll discuss common pitfalls of greedy algorithms, which will pave the way to a subsequent post on random search algorithms

Optimisation Problems and Greedy Algorithms

A greedy algorithm is a common approach used to solve an optimisation problem. An optimisation problem is a very wide definition encompassing all those situations that can be recast as a maximisation or minimisation of some metric, or cost function. If you are looking for the best deal for a pair of sneakers, you are trying to solve an optimisation problem. This problem will different depending on your choice of cost function. For instance you may want to find the cheapest sneakers on the market, or given some constraint, for instance a specific brand, what is the best product that satisfies your requirements.

The features we have just described are the two general characteristics of an optimisation problem:

  1. Having a defined function to maximise or minimise. A cost (or loss) function is usually something you want to minimise (in line with the meaning of “cost” as something you pay)
  2. Possibly, but not necessarily, having a set of constraints.

Our prototypical situation is to minimise the RMSE of a regression model in a way that may or may not be subject to constraints. In what follows we assume the optimisation problem is in fact a minimisation problem of a cost function. Everything we’ll discuss can obviously be applied to a maximisation problem just as well, with trivial adjustments.

Very well. Now that the definition of optimisation problem is out of the way, let’s discuss greedy algorithms as a class of approaches to iteratively solve the optimisation problem.

First of all, why is the algorithm called greedy? As bad as it might sound, that is just a technical name describing the defining feature of this approach. A greedy algorithm is an iterative optimisation procedure aiming at decreasing the cost function at every step. It essentially amounts to an iterative process in which, at every step, we randomly change the input parameters and evaluate the cost function. If the cost function has decreased we keep the new parameters, otherwise we reset the parameters to the previous step.

Before looking at the details of how to write a greedy algorithm in Python, let’s discuss the regression problem at hand.

Components selection in Principal Components Regression

Principal Components Regression is a linear regression algorithm done on the principal components of a primary dataset. As we discussed in a dedicated post on the topic, a simple linear regression can’t be applied to spectral data because of multicollinearity problems. That is, the signal corresponding to different wavelengths may be highly correlated and this creates serious issues to the least square regression problem based on matrix inversion.

The solution to this problem is to run the linear regression on suitable components extracted from the primary wavelengths. Principal components are an obvious example. Another example – possibly the most widely used of all – is Partial Least Squares (PLS) regression, where the regression is done on the latent variables.

We discussed the advantages of PLS over PCR in this post. This however is not the end of the story. Developing algorithms based on PCR can afford us some flexibility that is not present with PLS. In conventional PCR we just select the first few principal components. The number of principal components to retain is decided on the basis of a cross-validation process. Nobody prevents us however from picking a different set of principal components, not necessarily in order.

Remember that the principal components are calculated according to the spectra only. The primary values (y) do not enter in the PCA algorithm. Therefore the larger principal components are not necessarily the ones that correlate better with the primary values. Therefore, picking and choosing principal components makes the algorithm a lot more flexible and generally improves the result.

That’s all good. The next question however is how do we pick the principal components that give the best regression performances? Here’s where we can use an optimisation algorithm based on random search.

Time to write some code.

Baseline PCR

We are going to use NIR spectra from fresh peaches and corresponding brix values. The dataset is available for download at our Github repo.

We will use two basic functions for PCR called (of course) pcr and pcr_optimise_components. Similar functions have been discussed in our PCR post.

The function pcris a simple cross-validation PCR with a set number of principal components. It makes use of the cross_val_predict function of scikit-learn to return the predicted values and a bunch of metrics.

The function pcr_optimise_components is instead used to find the optimal number of principal components by minimising the RMSE in cross-validation. This is done by adding principal components one by one and calculating the corresponding RMSE.

The following code is a basic PCR regression making use of the two functions above.

The code prints the following output

And this is the scatter plot:

Principal component selection with a greedy algorithm

The result of the conventional PCR is not exciting. The simple algorithm written in the pcr_optimise_components function does not allow for a free selection of principal components. It only allows selecting the first n principal components in the order they are produced by PCA.

Suppose instead we’d like to find the optimal choice of 15 principal components out of 30. Optimal means the combination of principal components that minimises the RMSE in cross-validation. Working this problem into a greedy optimisation can be done in the following way:

  1. Start with a random selection of 15 principal components out of 30
  2. Create two lists: one with the selected principal components and one with the excluded principal components
  3. Use the pcr_optimise_components to find the best RMSE in this configuration
  4. Swap a random element of the first list with a random element of the second list
  5. Find the best RMSE in the new configuration
  6. If the new RMSE is smaller than the previous, keep the new configuration
  7. Repeat 4-6 a few hundred times or until convergence

Here is the Python code to accomplish the algorithm above. Note this code assumes the PCA decomposition has been already done, see previous code snippet.

After running the iterative optimisation above, we can use the following code to evaluate the result of PCR regression with the selected principal components

which prints

And plots

Note that your result may be different from this since each run of the algorithm will start from a different random selection of principal components. In fact, I have decided to use 500 repetition to make sure that different runs produce not too dissimilar results.

As you can see, the improvement is noticeable. We’ve got a nearly 20% improvement in the RMSE, and this result is achieved by using a lower number of principal components altogether.

So this is certainly a very good result. The next question however is: can we do any better? In other words, how do we know if the result found by a greedy algorithm is optimal? (Hint: we don’t). This leads us to the last section of this post.

Properties and pitfalls of greedy algorithms

As mentioned above, a greedy algorithm is such that only changes that decrease the cost function are retained. Since we generally ignore the cost function landscape, we are never sure that the solution found by a greedy algorithm is optimal.

This is best explained using a visual example. Suppose you have a function as the one plotted below, which has two minima. The lowest one is what we call a ‘global minimum’, i.e. is the overall minimum point of the function. We have however also a ‘local minimum’ which is a point that sits below all of its neighbours, within a defined distance.

greedy algorithm simple parameter space visualisation

Now, suppose we would like to find the minimum of this function using a greedy algorithm. What we would do is to choose a random starting point and then move by one step in a random direction. At each point we can evaluate the function so that if we find a lower value we stay in the new position, otherwise we step back to the previous position (this is the sense of ‘biased’ random walk).

This process may well find the global minimum but chances are that it may get stuck in the secondary minimum instead. It is conceivable that the random starting point happens to be close to the secondary minimum and the greedy algorithm will take us all the way down the local minimum. Once there, there’s no chance we can get out, as any step will yield a higher function value and will thus be rejected.

Remember: the algorithm doesn’t don’t know the specific shape of the function, so it blindly takes steps trying to minimise the value. Therefore, we are never sure the solution found by the greedy algorithm is optimal.

In this simple example, we are using a function of two variables only, which allow us to visually explain the problem.  In regression problems such as the one discussed above, the RMSE is a function of a large number of principal components. Given the complexity of the function, there will likely be a large number of local minima (let alone the presence of noise). As a consequence each run of the algorithm will likely produce a different result and there is no guarantee the result is the global minimum. Basically we can hope we found a global minimum, or a place sufficiently close to it, allowing for the unavoidable noise in the data.

The take-home message however is that we can’t be certain of hitting a global minimum using a greedy algorithm. In the next post we’ll discuss a modification of the basic greedy approach which enables jumping out of local minima.

Thanks for reading and until next time,

Daniel