What is the first thing you usually do when dealing with the analysis of a complicated dataset (NIR or otherwise)? Chances are that you start with taking a look at tabulated data, or plotting some of the scans. In other word you start with some exploratory analysis of your data, before delving into more advanced processing. In this post we are going to describe a way to produce NIR data correlograms with Seaborn in Python.
Correlograms, or correlation plots, are simply scatter plot of a variable against another. This is a handy way to explore the existence of correlations between those variables. The typical scatterplots we produced to analyse the results of a Principal Components decomposition, or a Linear Discriminant Analysis, are in fact correlograms.
Seaborn is a graphical package for Python, built on top of Matplotlib, that makes it easier to produce statistical graphs such as correlograms for exploratory analysis. Let’s see why that may be useful.
Suppose, for the sake of argument, that you have performed a PCA on your NIR data and extracted the first five principal components. It is certainly possible to produce scatterplots of any two (or three) of those variables to look for correlations. Seaborn however makes this operation a breeze.
In this post we are will show a few examples of exploratory analysis that can be done on NIR data, before launching ourselves into some serious regression (or classification) work. We are going to use freely available Vis-NIR data from the paper In Situ Measurement of Some Soil Properties in Paddy Soil Using Visible and Near Infrared Spectroscopy by Ji Wenjun et al. The associated data is available here.
The aim of this paper was to explore the predictive ability of Vis-NIR reflectance spectroscopy for some of the properties of paddy soils, such as organic matter, total organic carbon (TOC, which we’ll be using here), total nitrogen, and others. As an aside, the paper gives an example of least-square Support Vector Machine regression, which we will be dealing with in the near future.
For the sake of our post, we are going to use the data to produce exploratory plots of principal components, without getting into the regression analysis at all.
NIR data correlograms of principal components
The first example is producing correlograms of principal components using Seaborn with Python. Here’s the list of the imports.
1 2 3 4 5 6 7 8 9 |
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from scipy.signal import savgol_filter |
All data are gathered in an Excel spreadsheet, which can be imported into Python using Pandas. After importing we extract the spectra X
, the reference values of the total organic carbon y_toc
, and the wavelengths.
1 2 3 4 5 6 |
raw = pd.read_excel("File_S1.xlsx") X = raw.values[1:,13:].astype('float32') y_toc = raw.values[1:,1] wl = np.linspace(400,2500, num=X.shape[1], endpoint=True) |
Here’s a plot of the raw data.
Now we take the second derivative, and run a PCA algorithm to extract the first five principal components.
1 2 3 4 |
X2 = savgol_filter(X, 11, polyorder = 2,deriv=2) pca = PCA(n_components=5) Xs = StandardScaler().fit_transform(X2) Xpca = pca.fit_transform(Xs) |
To produce a correlogram, Seaborn requires us to put the data into a Pandas dataframe, which will be directly interpreted to build a matrix of correlation plots between any two of the elements of the dataframe. An introductory example is available at the relevant Seaborn documentation page.
1 2 3 |
df = pd.DataFrame(Xpca, columns=['PC1', 'PC2', 'PC3', 'PC4', 'PC5']) sns.pairplot(df) plt.show() |
Let’s spend some time to understand what we see. The correlogram is an array of scatterplots, for each pair of principal components. The dimension of this array of graphs is obviously equal to the number of elements in the dataframe.
Along the diagonal Seaborn plots by default the histogram of the relevant variable, in our case the distribution of values of the five principal components.
The PC1 vs PC2 scatterplots show two clusters of data, which presumably corresponds to two types of measurements that the authors of the paper performed: in-situ and in the lab, using different instruments.
The other scatterplots do not show obvious clusters or trends. To learn some more about the data, let’s use the additional information about the labels associated with the spectra
Producing a labelled correlogram with Seaborn
The basic idea behind a labelled correlogram is to use the labels y_toc
to colour-code the scatters. This simply amounts to add another column to the dataframe, which contains the labels.
Now, in practice this procedure would work best when there are just a handful of labels, relative to a limited number of classes. In other words this would be the case for classification problems, where one would associate each spectrum to the relevant class. On the contrary, we are dealing here with a regression problem, where the labels (the total organic carbon values) are continuously distributed within a certain range.
To circumvent this problem, and simplifying the colour-coding process, we round the labels to the nearest integers. Once again, we are not trying to actually fit a regression model here, just trying to spot potential trends in the data.
The bit of Python code to do what we described is here below.
1 2 3 4 5 |
# Add a column to the dataframe and round to the nearest integer df["TOC"] = y_toc.T.astype("int") sns.pairplot(df, hue='TOC', palette='OrRd') plt.show() |
The total organic carbon (TOC) content is now coded as shades of red. Darker reds corresponds to higher TOC values. With this additional information at hand we can immediately spot that not all principal components are equally correlated with the TOC.
For instance the first 2 principal components are useful to distinguish between lab and in-situ data, but are only mildly correlated with the TOC. The third principal component on the other hand, correlates a lot better with the TOC, as visible in the third plot of the last column. That is a piece of information that can be useful when we’ll start working our way with different regression models. And all it takes is a few lines of code using the Seaborn package in Python.
I’m working on yet another example of correlation plots using the same set of data. That will be available in the near future. Until then, thanks for tuning in.