Suppose you measure a spectrum from a sample containing multiple components and you want to extract the contribution to the spectrum from each component. Multivariate Curve Resolution (MCR) is a group of techniques that were developed to address this problem.

The multi-component sample is often called a mixture, and the idea behind MCR is that one has limited or no information about the individual components, except for what’s in the measured spectrum. As a terminology note, the MCR class of technique is also called *self-modelling mixture analysis*, which is perhaps more self-explanatory.

As usual for most of the topics covered in this blog, the literature on the subject is rich and a few references are given at the end of the post. Our aim here is to provide a practical understanding of the problem and a few ideas on how to approach its solution. We’ll use a toy dataset made of synthetic mixtures of Raman spectra of minerals, as a way to play with the code and use the known reference materials to guide us in the learning process.

In the process, we’ll apply concepts that were already covered in this blog, such as the gradient descent and the Kennard-Stone algorithm, and learn a few new ones.

I hope that at the end of this post you’ll be able to appreciate the issue of mixture analysis and have a few ideas that could be applied to your real-world spectroscopy data.

## The spectra of synthetic mixtures

The data used in this post is available at our Github repository. The spectral mixtures were generated from data available on the website of the RRUFF Project, containing Raman spectra of a large number of minerals.

We chose three minerals, Quartz, Orthoclase, and Albite and wrote a script that generates a set of mixtures spectra with random concentrations of these three pure components. The only constraint is that the (normalised) sum of concentrations must add up to one. The script to generate the synthetic spectra is for now only available to our amazing Patreon supporters.

The script generates a csv file (which is the one available on Github) of 300 random mixes and adds the three pure spectra at the end, for a total of 303 spectra. Let’s take a look at it

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
import numpy as np import matplotlib.pyplot as plt import pandas as pd # Load data data = pd.read_csv("Raman_spectra_mixtures.csv") # Read the concentrations arrays conc = data[["Albite", "Quartz", "Orthoclase"]].values[:-3,:] # Read spectra arrays of the mixtures spectra = data.values[:-3,4:1004] # read pure spectra pure_spectra = data.values[-3:,4:1004] # Read wavenumbers wn = data.keys()[4:1004].astype('float32') with plt.style.context(('seaborn-whitegrid')): fig, ax = plt.subplots(figsize=(8, 5)) for i,j in enumerate(["Albite", "Quartz", "Orthoclase"]): plt.plot(wn, pure_spectra[i,:], lw=2, label=j) plt.xlabel("Wavenumber (cm$^{-1}$)", fontsize=14) plt.ylabel("Raman spectra", fontsize=14) plt.legend(fontsize=14) plt.show() |

The three components have substantially different spectra, but with a few overlapping features.

Let’s now work our way towards the MCR problem, by first addressing simpler problems.

- Assuming that we know the concentration of each mixture, can we derive the pure spectra?
- Assuming that we know the pure spectra, can we derive the concentrations?

The full MCR approach then will try to solve the problem of finding both concentrations and pure spectra starting from the spectra of the mixtures alone.

## Finding pure spectra from concentrations

I must admit that this case is a bit contrived, but nevertheless useful to understand the logic behind our approach. Suppose the concentrations are known. To find the pure spectra, we can use an approach that we encountered a few times on this blog, namely the gradient descent. We discussed the gradient descent in some detail in a post describing classification with a perceptron. Feel free to take a look at that post to refresh the concept, before reading on.

The idea is to iteratively attain the right pure spectra by starting from a guess. The guess can be random or a zero array, or any other really. Once the guess is chosen, we can use the known concentration to generate ‘guessed’ mixtures. We can then compare the guessed mixtures with the known mixtures and estimate the error we made.

Gradient descent is the process of decreasing the error by changing the guessed pure spectra in the direction that decreases the gradient. The amount of change (i.e. the step size) is defined by the learning rate. Let’s define our modified version of the gradient descent in a dedicated function

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
def gradient_descent(learning_rate, spectra, concentrations, n_iter=1000, pure_spectra_guess=None): if pure_spectra_guess is None: pure_spectra_guess = np.random.random((concentrations.shape[1], spectra.shape[1])) elif pure_spectra_guess == 'zero': pure_spectra_guess = np.zeros((concentrations.shape[1], spectra.shape[1])) else: pass error = [] # error for l in range(n_iter): # Calculate estimated spectra est_spectra = np.dot(concentrations, pure_spectra_guess) # Calculate errors e = spectra - est_spectra error.append(e.sum()) # print(l, e.sum()) # optional print # Calculate gradient vector g = -np.dot(conc.T,e)/spectra.shape[0] # Gradient descent -> update the guess pure_spectra_guess -= learning_rate*g pure_spectra_guess = np.abs(pure_spectra_guess) #positivity constraint return(pure_spectra_guess, error) |

Note that, in addition to the conventional gradient descent approach, in the last line of the previous function we calculate the absolute value of the updated function. This is a quick and dirty way to impose a non-negativity constraint to the solution we seek. Keep in mind that a non-negativity constraint only makes sense when you are sure it applies to the functions you are trying to estimate, as a way to speed up the convergence process. Get rid of it in the general case.

OK, let’s run this function and plot the result

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
eta = 0.1 pure_spectra_gd, error = \ gradient_descent(eta, spectra, conc, n_iter=5000, pure_spectra_guess=None) with plt.style.context(('seaborn-whitegrid')): fig, ax = plt.subplots(figsize=(8, 5)) plt.plot(pure_spectra_gd[0,:], lw=4, label = 'Original pure spectrum') plt.plot(pure_spectra[0,:], 'r', label = 'Guessed spectrum') plt.xlabel("Wavenumber (cm$^{-1}$)", fontsize=14) plt.ylabel("Raman spectra", fontsize=14) plt.legend(fontsize=14) plt.show() |

Given a sufficient number of iterations, the guessed pure spectra converge to the right shape, as we have verified by comparing it to the actual pure spectra.

As mentioned, this case is a bit contrived, but it was helpful to think about the optimisation process and how it can be designed for problems of this kind.

## Finding concentrations from pure spectra

Let’s now work on the opposite type of problem. Knowing the measured spectra of mixtures, and the pure spectra of the components, we want to derive the relative concentration of the pure components in each sample.

This problem is, in essence, a calibration. If the pure spectra are sufficiently distinct, solving for the concentration can be done independently on each mixed spectra with a simple linear regression.

1 2 3 4 5 6 |
from sklearn.linear_model import LinearRegression est_conc = np.zeros_like(conc) for i in range(conc.shape[0]): lr = LinearRegression() lr.fit(pure_spectra.T, spectra[i,:]) est_conc[i,:] = lr.coef_ |

The concentrations are just the coefficients of the linear regressions, one for each sample. You can verify by comparing the arrays, that the estimated concentrations are identical or very close to the actual concentration.

## Multivariate curve resolution

This is it. If you made it so far, here’s the reward. Let’s estimate both concentrations and pure spectra starting from spectral mixtures.

From what we have learned so far, you may guess that the process we want to set up is an iterative one. You can imaging a process whereby you start from a guess of the pure spectra, you calibrate for guessed concentration using linear regression (possibly applying some constraints), you then estimate a refined guess of the pure spectra using gradient descent, and you keep going alternating linear regression and gradient descent until (hopefully) the solution converges.

This is the philosophy behind the Alternating Least Squares (ALS) approach. Now, that sounds simple enough but, as usual, reality is more complicated. ALS approaches rely on the sage use of constraints to guide the solution towards convergence, a process that is complicated by things like measurement noise, etc.

Lucky for us, there is a handy Python library which we can tap to solve this problem, called pyMCR. In the rest of this section, I describe a method to solve the MCR problem with our toy dataset, using one of the functions of pyMCR. We need a few imports, which we’ll describe in a minute

1 2 3 4 |
from pymcr.mcr import McrAR from pymcr.regressors import NNLS from pymcr.constraints import ConstraintNonneg, ConstraintNorm import kennard_stone as ks |

To start off, we would like to make a guess for the pure spectra. In the absence of any other information, and under the assumption the pure spectra are sufficiently distinct from each other, we could find the three most different spectra in our dataset and hope that these are close enough to being the three pure spectra.

The problem of finding the three ‘most different’ datasets is one we have already solved when we discussed the Kennard-Stone algorithm. The KS algorithm is specifically designed to find a number of samples in a dataset that are the farthest away from each other, for some definition of ‘farthest away’, i.e. for some definition of distance metric. These details are discussed in the post linked above, so here we just go ahead and apply the Kennard-Stone split to our dataset, to extract the three spectra we’re after

1 2 |
# Estimate pure spectra picking the three most different spectra in the set est_pure_spectra, X_test = ks.train_test_split(spectra, train_size = 3) |

We now use the function McrAR following the examples given in this page.

1 2 3 |
mcrar = McrAR(max_iter=1000, st_regr='NNLS', c_regr='OLS', c_constraints=[ConstraintNonneg(), ConstraintNorm()]) mcrar.fit(spectra, ST=est_pure_spectra) |

The function requires an estimate of either concentrations or pure spectra.We go with the pure spectra guessed according to the KS algorithm.

Furthermore, the function requires two regression procedures for the pure spectra (st_regr) and concentration (c_regr). We go with the Non-Negative Least square (NNLS) and Ordinary Least Squares (OLS) respectively (incidentally, the OLS approach is identical to the one we have explained above to calibrate for the concentrations).

Finally, we specify constraints for the concentrations, which in our case are the non-negativity (the concentrations are positive quantities) and the normalisation (the concentrations add up to one).

After that we fit the spectral mixtures.

Once the algorithm has finished running, we can compare the reconstructed pure spectra with the original ones

1 2 3 4 5 6 7 8 9 |
with plt.style.context(('seaborn-whitegrid')): fig, ax = plt.subplots(figsize=(8, 5)) plt.plot(wn, mcrar.ST_.T, 'k', lw=3) plt.plot(wn, pure_spectra.T, 'b') plt.xlabel("Wavenumber (cm$^{-1}$)", fontsize=14) plt.ylabel("Raman spectra", fontsize=14) plt.show() |

The original pure spectra are shown in black, while the one recovered through the MCR approach are in blue. In addition, the reconstructed concentrations are available in the array mcrar.C_ , which can be compared to the actual concentrations.

Following the procedure described here, the MCR algorithm will converge without too many problems. This is, of course, a simplified problem, and results in real-life may be not as easy. Nevertheless, I hope this introduction will be useful to start your journey on Multivariate Curve resolution

Thanks for reading and until next time,

Daniel

## References

- S. C. Rutan, A. de Juan, and R. Tauler, (2020). “Introduction to multivariate curve resolution”, in Comprehensive Chemometrics, 2nd edition, Elsevier.
- Charles H. Camp Jr., pyMCR: A Python Library for Multivariate Curve Resolution Analysis with Alternating Regression (MCR-AR).
- pyMCR: Multivariate Curve Resolution in Python