Inspired by an apt question from one of our reader, I decided to write this short post to discuss the use of parallel computation in cross-validation analysis. One of the main objectives when developing a model, is to select the best parameter (or set of parameters), whereby “best” means the parameter that minimises a cost function in cross-validation and/or prediction.

This type of search usually involves looping on a series of values of the parameter (or parameters) in question. As a simple example, which we discussed multiple times on this blog, finding the optimal number of latent values in PLS regression involves looping over the number of latent variables, run a PLS regression study in cross-validation, and finally choosing the optimal number by looking at the minimum of the RMSE or MAE (Mean Absolute Error). Depending on the size of the data, or the size of the parameter space that need to be explored, this type of optimisation can be quite lengthy and resource-intensive, if done sequentially.

As we shall discuss here however, improved performances (i.e. a computation speedup) can be achieved by using parallel computation, in a fairly straightforward way.

The type of loop we described, cycles through the number of latent variables in sequence. Each of the steps in the sequence however is independent from any other step. They could be performed in any order or, in fact, simultaneously. Back to our PLS example, to calculate the RMSECV for 10 latent variables, we don’t need to know the model with 9 latent variables. They can be done independently of one another.

This is what, in computer science, is called “embarrassingly parallel” problem, that is a problem that, by its very nature, can be parallelised straight away. In this post we will show a worked example of how a cross-validation loop in can be parallelised. We will use PLS regression in Python as an example. We will use custom functions (which were introduced in past posts) and modify them for parallel computation. Finally we’ll also show how the same thing can be achieved using the built-in function GridSearchCV in scikit-learn.

## Optimising the number of components in PLS regression

We have written about PLS regression in many guises in this blog. In the following I will assume you are familiar with the concept. If not, please take some time to read our introductory posts here and here.

We start with the imports

1 2 3 4 5 6 7 8 9 |
import numpy as np import matplotlib.pyplot as plt import pandas as pd from joblib import Parallel, delayed from sklearn.metrics import mean_squared_error, r2_score from sklearn.cross_decomposition import PLSRegression from sklearn.model_selection import cross_val_predict, GridSearchCV import time |

After that, we write a basic function that performs cross-validation analysis using PLS regression

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
def base_pls(X,y,n_components, return_model=False): pls_simple = PLSRegression(n_components=n_components) pls_simple.fit(X, y) y_cv = cross_val_predict(pls_simple, X, y, cv=10) # Calculate scores score = r2_score(y, y_cv) rmsecv = np.sqrt(mean_squared_error(y, y_cv)) if return_model == False: return(y_cv, score, rmsecv) else: return(y_cv, score, rmsecv, pls_simple) |

The function above will calculate and return R^{2} and RMSE in a 10-fold cross-validation for a PLS regression with a fixed number of latent variables.

If we want to evaluate the metrics for any number of components, we just insert the above function in a loop and then select the number of latent variables that minimises the RMSE.

1 2 3 4 5 6 7 8 9 10 11 12 |
def pls_optimise_components(X, y, npc, cv =10): rmsecv = np.zeros(npc) score = np.zeros(npc) for i in range(1,npc+1,1): y_cv, score[i-1], rmsecv[i-1] = base_pls(X, y,n_components=i) opt_comp, score_max, rmsecv_min = np.argmin(rmsecv),\ score[np.argmin(rmsecv)],rmsecv[np.argmin(rmsecv)] return (opt_comp+1, score_max, rmsecv_min) |

This is the basic code. No parallelisation applied yet, just a conventional loop. It’s now time to introduce parallel computation.

## Parallelising loops with Joblib in Python

The Joblib library offers an easy and transparent way to parallelise loops in Python, with minimal modifications of the code required. Citing from their documentation, “Joblib addresses these problems while leaving your code and your flow control as unmodified as possible (no framework, no new paradigms)”.

Let’s show this straight away, by writing a modified version of the pls_optimise_components function as follows

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
def pls_optimise_components_parallel(X, y, npc, cv =10): out = Parallel(n_jobs=6, verbose=1)\ (delayed(base_pls)(X, y, n_components=i) for i in range(1,npc+1,1)) rmsecv = np.zeros(npc) score = np.zeros(npc) for i,j in enumerate(out): y_cv, score[i], rmsecv[i] = j[0],j[1],j[2] opt_comp, score_max, rmsecv_min = np.argmin(rmsecv),\ score[np.argmin(rmsecv)],rmsecv[np.argmin(rmsecv)] return (opt_comp+1, score_max, rmsecv_min) |

As you can see, the for loop has been replaced with a call to the Parallel function of Joblib, that is used to parallelise the loop. The keyword n_jobs is the number of jobs running at the same time. In practice this means the number of CPUs that are used for the computation. By contrast, in a conventional loop, a single CPU is used. The verbose keyword control the descriptive output that is printed on screen during the execution of the operation.

The second part of the statement contains the function to be parallelised, in this case the base_pls function defined above. For those of you who are familiar with Python list comprehension, this second part of the statement is exactly that, with the addition of the function delayed, which allows us to define the list to be computed without actually doing it until the Parallel computation is executed (you can check this Stack Overflow question for some more details).

The output of the function is indeed a list (as it would be in a list comprehension). Each item in the list contains the output of one run of the function, conveniently sorted according to the way we defined the loop inside the list comprehension. That is why, in the second part of the function, we extract these items back, before finally returning them.

Let’s compare the performance of the conventional loop and its parallelised version, using the following code. We import a sample data set from our GitHub repository and run a cross validation PLS regression with up to 30 latent variables.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
url = 'https://raw.githubusercontent.com/nevernervous78/nirpyresearch/master/data/peach_spectra_brix.csv' data = pd.read_csv(url) X = data.values[:,1:] y = data["Brix"].values npc=30 t1 = time.time() opt_comp_1, score_1, rmsecv_1 = pls_optimise_components(X,y,30) t2 = time.time() opt_comp_2, score_2, rmsecv_2 = pls_optimise_components_parallel(X,y,30) t3 = time.time() print("Conventional loop, execution time %1.2f sec." %(t2-t1)) print("Parallelised loop, execution time %1.2f sec." %(t3-t2)) |

In my case (*very* old 3.2GHz Xeon processor) this code prints

1 2 |
Conventional loop, execution time 6.20 sec. Parallelised loop, execution time 1.77 sec. |

with a 3.5X increase in speed using 6 processors. If I were to use all of my available 12 CPUs, and run a loop up to 50 latent variables (I know, I know) the speedup would be about 9X, which certainly makes a lot of difference for large datasets.

## Parallel computation using the built-in GridSearchCV function

Now, if you followed up to here, I have bad news and good news for you. The bad news is that you can forget all we have said so far, because — and here’s the good news! — a built-in GridSearchCV function is available in scikit-learn, which can aptly replace our lengthy cross-validation functions. What’s an even better news, the GridSearchCV function supports joblib out of the box, so we don’t even have to worry about writing the parallelised list comprehension.

Here’s how the entire workflow can be written

1 2 3 4 5 6 |
parameters = {'n_components':np.arange(1,npc+1,1)} pls = PLSRegression() mod = GridSearchCV(pls, parameters, scoring = 'neg_mean_absolute_error', cv=10, n_jobs=6) mod.fit(X,y) t4 = time.time() print("Built-in function, execution time %1.2f sec." %(t4-t3)) |

which, in my case prints

1 |
Built-in function, execution time 1.70 sec. |

using 6 processors and a maximum of 30 latent variables. That is consistent with our hand-written parallelised loop.

Now, while GridSearchCV is absolutely the way to go to run cross-validation studies, the template used above can be used to parallelise any for loop that can be written as the serial execution of a single function.

To wrap up, we have discussed the general way of parallelising a loop using parallel computing with Joblib, and a specific example of invoking Joblib from within the GridSearchCV function of scikit-learn. In both cases we can achieve a significant speedup in executing a cross-validation study.

Thanks for reading and until next time,

Daniel

Featured image credit: Deepak Rautela on Unsplash.