Minimal prediction models for linear regression

There’s a question I’m often asked: how do you save a scikit-learn model after training and separately reload it to make predictions? That is not an idle question: fitting a model takes time, while prediction is generally faster. So, if you only need the use the model to make predictions—not evaluate the model quality, and that’s true even for relatively simple models—you don’t need to fit it all over again each time. I proposed a couple of approaches to answer this question in a previous post on saving and loading regression models.

In that article, I discussed how the model is, at its very essence, a Python dictionary. Once the model is trained, the dictionary gets populated with all sorts of information related to the various quantities that describe the model. These are the ones that can be exported, and (eventually) imported in a separate session to populate a newly-defined model for prediction.

The fact is however that, if your aim is only to make predictions—and by that I mean that you are not looking at evaluating the model performance, accuracy, etc., assume you have already done that—than only a small part of the information in the dictionary is required. In these cases you can get away with a bare-bone version of your model. This is the topic of this post, and we’ll see an example of such a minimal prediction model for linear regression.

Linear regression refresher

Now, here’s the catch. In order to simplify the model down to its core functions, we need to understand how the regression model works. The fit/predict functions of most scikit-learn models are exceedingly useful for consistency and usage practice, but come at the risk of obscuring the inner mechanisms.

Our aim is to eventually circumvent the “predict” function and write down a simplified version of it, which only needs a handful of parameters to work (at least for linear regression). For that we need to understand the very basic mechanism of linear regression.

All linear regression problems can be written in mathematical form as a linear system of equations, which usually is cast in matrix notation

\mathbf{y}= \mathbf{X} \mathbf{b} + \mathbf{e}

The matrix \mathbf{X} contains the explanatory variables. Its dimensions are m \times n, where m is the number of variables and n is the number of samples. For instance, in the case of spectra, m would be the number of wavelengths and n the number of samples measured.

The vector \mathbf{y} is the response variable. It has n elements, equal to the number of samples. The vector \mathbf{b} is the one that is trained when the regression model fitting is done. It contains the information to “transform” the explanatory variables into to the response variable (you can probably already see where this is going). It has m components, one for each variable.

Finally \mathbf{e} is a residual vector. You can think of it as an error (which is ideally very small if the model is good) that remains after the model is fitted due the unavoidable noise or other experimental accidents of data collection. You should know it’s there (and you should check it when evaluating the model performances), but we are going to ignore it for the rest of the article.

The way \mathbf{b} is calculated depends on the specific model we are employing. PLS is different from PCA, etc. But once defined it you can always refer to the matrix equation aboce, as all linear models work in the same way. So, the bare-bone approach consists in exporting just the vector \mathbf{b} and a (small) bunch of other numbers out of the scikit-learn model object.

The only other information required are the mean value and standard deviation of \mathbf{X} and the mean value of \mathbf{y}, which are generally (but not always!) required to scale the data before fitting.

OK, let’s move one and look at some code.

Minimal prediction model with PLS regression

Let’s begin with the imports. Here’s the bare minimum

Now, let’s load the data, in this case our NIR spectra of fresh peaches available at our Github repo. The data can be loaded directly from the repository using the bit of code below

The response variable is the sugar content in the peach, measured in degree Brix. (But that’s not terribly important for this example.) Now, let’s split the data into training and test set and fit the model on the training examples

Normally, once the model is optimised, you would then go ahead and predict the value of the test set, using the “predict” function associated with the fitted model:

Now here comes the linear algebra bit. You can do the same prediction by using the matrix equation above, and by taking care that the values of the vectors are suitably rescaled

In words, we have first “reduced” the test sample array by subtracting its mean from it and dividing by its standard deviation. This is the same procedure of the Standard Scaler in scikit-learn which is applied by default (though the option can be changed) by the PLS regression routine.

The vector \mathbf{b} we discussed above is none other that the regression coefficients, pls.coef_. We need to select its first column as its dimensions by default are m \times 1 when the prediction is done on a single response variable. The product \mathbf{X} \mathbf{b} is implemented with the numpy.dot function and, as anticipated, we neglected the residuals.

You can go ahead and verify that the results of the operations above (the pls.predict and our two-liner) are identical. The advantage of the second option is that, if we want to export this model, we have just a few arrays to take care of and not the entire dictionary built in the pls object.

Exporting the minimal prediction model

We covered the procedure of exporting a model in this post. Here we follow exactly the same process, but with a much simplified model. As explained in that article, we are going to create a dictionary, populate it with the required arrays, and dump it into a JSON object. The JSON object is then saved to file.

As discussed in that earlier post, in order to handle the different data types that may be present in our dictionary (especially NumPy arrays), it’s useful to have a little class to do that. I found the code for such a class in this stackoverflow thread. The code, reproduced below, is a modifier to the custom JSON serialiser, that handle NumPy integers, floats and arrays.

We can now create our dictionary, dump it to JSON and save the file

That’s all is required to save the model.

Suppose you have now a new session and want to predict the result on X_test without having to fit the model again. The only thing you need to do is to load the dictionary, extract the various quantities from it, and run the dot product. Here’s the code (assume you are running this in a new session)

Go ahead, check it out and verify that the “prediction_new” array is identical to the corresponding predictions array written above.

That’s it. This is how you create and save a minimal prediction model, developed for PLS regression, and how you can use it to fit new data. Thanks for reading and until next time,

Daniel

References

  1. Exporting NIR regression models built in Python
  2. Featured image credit: Glen Carrie on Unsplash.