Parallel computation of loops for cross-validation analysis

Inspired by an apt question from one of our reader, I decided to write this short post to discuss the use of parallel computation in cross-validation analysis. One of the main objectives when developing a model, is to select the best parameter (or set of parameters), whereby “best” means the parameter that minimises a cost function in cross-validation and/or prediction.

This type of search usually involves looping on a series of values of the parameter (or parameters) in question. As a simple example, which we discussed multiple times on this blog, finding the optimal number of latent values in PLS regression involves looping over the number of latent variables, run a PLS regression study in cross-validation, and finally choosing the optimal number by looking at the minimum of the RMSE or MAE (Mean Absolute Error). Depending on the size of the data, or the size of the parameter space that need to be explored, this type of optimisation can be quite lengthy and resource-intensive, if done sequentially.

As we shall discuss here however, improved performances (i.e. a computation speedup) can be achieved by using parallel computation, in a fairly straightforward way.

The type of loop we described, cycles through the number of latent variables in sequence. Each of the steps in the sequence however is independent from any other step. They could be performed in any order or, in fact, simultaneously. Back to our PLS example, to calculate the RMSECV for 10 latent variables, we don’t need to know the model with 9 latent variables. They can be done independently of one another.

This is what, in computer science, is called “embarrassingly parallel” problem, that is a problem that, by its very nature, can be parallelised straight away. In this post we will show a worked example of how a cross-validation loop in can be parallelised. We will use PLS regression in Python as an example. We will use custom functions (which were introduced in past posts) and modify them for parallel computation. Finally we’ll also show how the same thing can be achieved using the built-in function GridSearchCV in scikit-learn.

Optimising the number of components in PLS regression

We have written about PLS regression in many guises in this blog. In the following I will assume you are familiar with the concept. If not, please take some time to read our introductory posts here and here.

We start with the imports

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

from joblib import Parallel, delayed

from sklearn.metrics import mean_squared_error, r2_score

from sklearn.cross_decomposition import PLSRegression

from sklearn.model_selection import cross_val_predict, GridSearchCV

import time

After that, we write a basic function that performs cross-validation analysis using PLS regression

def base_pls(X,y,n_components, return_model=False):

pls_simple = PLSRegression(n_components=n_components)

pls_simple.fit(X, y)

y_cv = cross_val_predict(pls_simple, X, y, cv=10)

# Calculate scores

score = r2_score(y, y_cv)

rmsecv = np.sqrt(mean_squared_error(y, y_cv))

if return_model == False:

return(y_cv, score, rmsecv)

else:

return(y_cv, score, rmsecv, pls_simple)

The function above will calculate and return $R^{2}$ and RMSE in a 10-fold cross-validation for a PLS regression with a fixed number of latent variables.

If we want to evaluate the metrics for any number of components, we just insert the above function in a loop and then select the number of latent variables that minimises the RMSE.

def pls_optimise_components(X, y, npc, cv =10):

rmsecv = np.zeros(npc)

score = np.zeros(npc)

for i in range(1,npc+1,1):

y_cv, score[i-1], rmsecv[i-1] = base_pls(X, y,n_components=i)

opt_comp, score_max, rmsecv_min = np.argmin(rmsecv),\

score[np.argmin(rmsecv)],rmsecv[np.argmin(rmsecv)]

return (opt_comp+1, score_max, rmsecv_min)

This is the basic code. No parallelisation applied yet, just a conventional loop. It’s now time to introduce parallel computation.

Parallelising loops with Joblib in Python

The Joblib library offers an easy and transparent way to parallelise loops in Python, with minimal modifications of the code required. Citing from their documentation, “Joblib addresses these problems while leaving your code and your flow control as unmodified as possible (no framework, no new paradigms)”.

Let’s show this straight away, by writing a modified version of the pls_optimise_components function as follows

def pls_optimise_components_parallel(X, y, npc, cv =10):

out = Parallel(n_jobs=6, verbose=1)\

(delayed(base_pls)(X, y, n_components=i) for i in range(1,npc+1,1))

rmsecv = np.zeros(npc)

score = np.zeros(npc)

for i,j in enumerate(out):

y_cv, score[i], rmsecv[i] = j[0],j[1],j[2]

opt_comp, score_max, rmsecv_min = np.argmin(rmsecv),\

score[np.argmin(rmsecv)],rmsecv[np.argmin(rmsecv)]

return (opt_comp+1, score_max, rmsecv_min)

As you can see, the for loop has been replaced with a call to the Parallel function of Joblib, that is used to parallelise the loop. The keyword n_jobs is the number of jobs running at the same time. In practice this means the number of CPUs that are used for the computation. By contrast, in a conventional loop, a single CPU is used. The verbose keyword control the descriptive output that is printed on screen during the execution of the operation.

The second part of the statement contains the function to be parallelised, in this case the base_pls function defined above. For those of you who are familiar with Python list comprehension, this second part of the statement is exactly that, with the addition of the function delayed, which allows us to define the list to be computed without actually doing it until the Parallel computation is executed (you can check this Stack Overflow question for some more details).

The output of the function is indeed a list (as it would be in a list comprehension). Each item in the list contains the output of one run of the function, conveniently sorted according to the way we defined the loop inside the list comprehension. That is why, in the second part of the function, we extract these items back, before finally returning them.

Let’s compare the performance of the conventional loop and its parallelised version, using the following code. We import a sample data set from our GitHub repository and run a cross validation PLS regression with up to 30 latent variables.

url = 'https://raw.githubusercontent.com/nevernervous78/nirpyresearch/master/data/peach_spectra_brix.csv'

data = pd.read_csv(url)

X = data.values[:,1:]

y = data["Brix"].values

npc=30

t1 = time.time()

opt_comp_1, score_1, rmsecv_1 = pls_optimise_components(X,y,30)

t2 = time.time()

opt_comp_2, score_2, rmsecv_2 = pls_optimise_components_parallel(X,y,30)

t3 = time.time()

print("Conventional loop, execution time %1.2f sec." %(t2-t1))

print("Parallelised loop, execution time %1.2f sec." %(t3-t2))

In my case (*very* old 3.2GHz Xeon processor) this code prints

1 2	Conventional loop, execution time 6.20 sec. Parallelised loop, execution time 1.77 sec.

with a 3.5X increase in speed using 6 processors. If I were to use all of my available 12 CPUs, and run a loop up to 50 latent variables (I know, I know) the speedup would be about 9X, which certainly makes a lot of difference for large datasets.

Parallel computation using the built-in GridSearchCV function

Now, if you followed up to here, I have bad news and good news for you. The bad news is that you can forget all we have said so far, because — and here’s the good news! — a built-in GridSearchCV function is available in scikit-learn, which can aptly replace our lengthy cross-validation functions. What’s an even better news, the GridSearchCV function supports joblib out of the box, so we don’t even have to worry about writing the parallelised list comprehension.

Here’s how the entire workflow can be written

parameters = {'n_components':np.arange(1,npc+1,1)}

pls = PLSRegression()

mod = GridSearchCV(pls, parameters, scoring = 'neg_mean_absolute_error', cv=10, n_jobs=6)

mod.fit(X,y)

t4 = time.time()

print("Built-in function, execution time %1.2f sec." %(t4-t3))

which, in my case prints

1	Built-in function, execution time 1.70 sec.

using 6 processors and a maximum of 30 latent variables. That is consistent with our hand-written parallelised loop.

Now, while GridSearchCV is absolutely the way to go to run cross-validation studies, the template used above can be used to parallelise any for loop that can be written as the serial execution of a single function.

To wrap up, we have discussed the general way of parallelising a loop using parallel computing with Joblib, and a specific example of invoking Joblib from within the GridSearchCV function of scikit-learn. In both cases we can achieve a significant speedup in executing a cross-validation study.

Thanks for reading and until next time,
Daniel

Featured image credit: Deepak Rautela on Unsplash.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Optimising the number of components in PLS regression

Parallelising loops with Joblib in Python

Parallel computation using the built-in GridSearchCV function

About The Author

Daniel Pelliccia

Optimising the number of components in PLS regression

Parallelising loops with Joblib in Python

Parallel computation using the built-in GridSearchCV function

About The Author

Daniel Pelliccia

Related Posts