Robust PCA

Robust PCA  (PCA = Principal Component Analysis) refers to an implementation of the PCA algorithm that is robust against outliers in the dataset.

We have discussed methods to detect and remove outliers in spectral data using the Mahalanobis distance or the PLS decomposition. Here we discuss an example of neglecting outliers, using a robust implementation of a known algorithm such as PCA.

In writing this post, I’ve been inspired by Chapter 3 of Varmuza & Filzmoser, Introduction to Multivariate Statistical Analysis in Chemometrics (2008).

Back to basics: the PCA algorithm

If you check the Wikipedia page on PCA or Section 3.6 of Varmuza & Filzmoser (2008), you’ll learn that PCA decomposition of a dataset X can be calculated in two ways:

Incidentally, the first approach is explained in detailed in the very first post on this blog (see also our post on PCA and Kernel PCA). On the contrary, the scikit-learn implementation of the PCA algorithm uses the singular value decomposition approach, which is faster and better behaved, especially for larger datasets. The presence of outliers however can produce deviations in the calculated components that cannot be easily eliminated.

The approach based on the covariance matrix can help with that, since one can implement a robust estimation of the covariance matrix which is less sensitive to outliers.

In the next section, we will describe an example of a robust PCA implementation using existing functions in scikit-learn in Python.

Robust PCA using scikit-learn in Python

Let’s begin with the relevant imports

and import the data from our GitHub repository. The data consists of NIR spectra of milk.

Now we create a copy of the dataset and we modify a single spectrum (the first in the array) to turn it into an outlier

The idea here is that, if even a single spectrum in a dataset is significantly different from the rest, this can introduce measurable differences in the PCA decomposition.

To verify this statement, let’s build two identical pipelines, consisting of mean-centring and PCA decomposition. We fit the first pipeline to the original data, and the second pipeline to the data with one outlier

Let’s compare the first components of each decomposition

The two components are markedly different. That single outlier is enough to distort the PCA calculation.

We can circumvent this issue by calculating the eigenvectors of the covariance matrix. The original calculation of the PCA with this approach is described in this post. Now we replace the conventional estimation of the covariance matrix, with a robust estimator, available in scikit-learn. Here’s the code

Let’s compare the components now.

As anticipated, the first component, calculated with the robust version of the PCA algorithm, is essentially identical to the original first component, the one calculated without outliers. The robust implementation of the covariance matrix is unaffected by the outliers, so that in turns the PCA decomposition is unaffected by the outliers.

A caveat, and a possible workaround

As you have probably noticed, in the code above we centred the data but we avoided the actual scaling, i.e. we didn’t normalise the data by the standard deviation. The reason is that the estimation of the standard deviation itself is not robust. One large outlier can easily distort it. You can easily verify that, if you run the code above by setting with_std=False , the result will not be correct.

If one needs to calculate the standard deviation in a robust way, the solution is again the MinCovDet function used above. The variance of a dataset is the diagonal of the covariance matrix. Therefore, we can go for a robust estimation of the covariance matrix, extract the variance and take the square root of it to get the standard deviation. Here’s the comparison

The two estimations are identical, which means we could use the robust standard deviation estimated through the MinCovDet to standardise the data.

Note that, the simple application of the Robust Scaler (which sounds like the one we would need) does not actually work in this case, as its calculation of the standard deviation is still affected by the large outlier.

As always, thanks for reading until the end and until next time.