Hi everyone, and thanks for tuning in to our new post on exporting NIR regression models built in Python.
One of the reader of this blog asked me this question: “How can we export a model that we just build, so that we can use it over and over again without having to fit the training data every time?” I must admit, I didn’t have the answer straight away, it’s a very good question. Once the training part is completed, it would be good to export the model to file, store it and retrieve it at a later time.
If you’d like to get started with building your calibration models in Python, take a look at some of our previous posts.
If you’re already up to speed, let’s see how to export (and load back) your NIR regression models using Python. We’ll discuss three methods:
- Using the
pickle
module: easy to use, but with compatibility and security issues possible - Using the
joblib
module: same as pickle, but optimised for large arrays - Exporting to JSON: arguably a bit more involved, but safe and language-independent
The pickle
module
A calibration model is not simply an array that can be exported or saved to disk. It comprises of a number of attribute values that have been set by the fitting process. For instance, let’s use the following bare-bone PLS regression model as an example. The data is available for download at our Github repository.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import pandas as pd import numpy as np from scipy.signal import savgol_filter from sklearn.cross_decomposition import PLSRegression url = 'https://raw.githubusercontent.com/nevernervous78/nirpyresearch/master/data/peach_spectra_brix.csv' data = pd.read_csv(url) X = data.values[:,1:] y = data['Brix'] # Calculate second derivative X2 = savgol_filter(X, 13, polyorder = 2,deriv=2) # Define the PLS regression object pls = PLSRegression(n_components=8) # Fit data pls.fit(X2, y) |
In our code snippet above pls is a Python object. To export a Python object to disk we have to do what is called ‘serialisation’, that is to transform an object into a format that can be stored and can be read back (‘deserialised’) at a later time. The serialisation process in Python is called ‘pickling’ (and the deserialisation is called ‘unpickling’). The whole thing can be done using an out of the box Python module called (of course) pickle.
1 2 3 4 |
import pickle # save the model to disk filename = 'nir_pls_model.pkl' pickle.dump(pls, open(filename, 'wb')) |
With the three lines of code above, the calibration model and all its attributes can be safely stored to disk. When, some time later, we would like to retrieve the model to run some new predictions, all we have to do is
1 2 |
# load the model from disk imported_pls = pickle.load(open('nir_pls_model.pkl', 'rb')) |
It’s that easy.
Now, I’ve given the object a different name – imported_pls as opposed to pls – so that you can verify, within a single script, that the two objects are in fact identical. This can be done for instance by trying to run a prediction using the new pls object, i.e. by running the command imported_pls.predict(X) . In actual applications, you’d probably have two different scripts, one for calibration, and one for analysis and therefore you do not need to bother with using different names.
The joblib
option
The pickle library is a general-purpose serialisation package that works with all types of objects. If you are primarily dealing with numerical arrays, or more specifically with NumPy data structures, you have the option of a different package called joblib.
Now, joblib is pretty much the same thing as pickle, but it has been optimised for large NumPy arrays, so it is going to be more efficient than pickle in those situations.
For relatively small data sets this is unlikely to make much of a difference, but it may become important if you are working with large calibration sets.
At any rate, exporting your model using joblib is going to look something like this.
1 2 3 |
from sklearn.externals import joblib filename = 'nir_pls_model.sav' joblib.dump(pls, filename) |
And to load it back later simply type
1 |
imported_pls = joblib.load('nir_pls_model.sav') |
As you can see the commands are in fact identical (in this simple example at least) to those used with pickle
.
A word of caution, and a third alternative
Both pickle and joblib are quick and easy ways to store calibration models in Python. If you plan to work on your models on a single machine, without leaving the Python environment, then that’s probably the best way to go.
Be aware however of two main limitations of both methods.
- Both serialisation methods can in principle be used to store malicious content. As the pickle Python documentation warns: “The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.” This is true also for joblib, and it means that it is not a good idea to make a practice of sharing serialised calibration files in this way.
- Data serialised with pickle and joblib can’t be used outside Python. If you need to transfer your calibrations across systems and languages none of these option will work.
- Sometimes you may have problems in unpickling data if the underlying Python libraries have been updated. If the update affected the way a Python object is structured, the pickled object using the old library will be incompatible with the new library.
These are all serious limitations of both modules, but luckily there is an alternative. It is a bit more involved technically, but it guarantees security and interoperability across systems. The alternative is called JSON.
JSON data interchange format
JSON is the acronym of JavaScript Object Notation, and is a data interchange format developed to be lightweight, human-readable but also easy for a machine to generate and read.
Python supports JSON natively, that means that a JSON encoder/decoder is built in in your Python installation. The principle of JSON is the same as discussed above: a Python object is serialised by breaking it down into its constituents that are then stored.
However while pickle
and joblib
create binary files – that are non human-readable and can be used to store arbitrary or malicious code – JSON outputs a text file. The advantages of this approach are:
- The JSON output is completely language-independent and can be transferred across software and programming languages.
- It can’t be use to execute malicious code. In fact it can’t be used to execute any code.
The shortcoming is an added layer of complication when it comes to serialising/deserialising our calibration objects. Here I’m going to show a method I came up with to store a PLS calibration into JSON, then load it back to run a different prediction. Along the way I’ll explain the underlying principle of a PLS calibration object, so that this method can be extended to other (more complicated) Python calibration objects.
Before diving into the code, let’s take a moment to understand how a PLS object is structured. Suppose we have read our data (using the code provided in the first section above) and now we want to define a generic PLS object. By now we know how to do that.
1 2 |
from sklearn.cross_decomposition import PLSRegression pls = PLSRegression() |
By using the empty brackets, we in fact defined a PLS object with the default parameters. To check these parameters just call
1 |
pls.__dict__ |
When the PLS object hasn’t been fitted yet, the call above outputs the following
1 2 3 4 5 6 7 8 9 |
{'algorithm': 'nipals', 'copy': True, 'deflation_mode': 'regression', 'max_iter': 500, 'mode': 'A', 'n_components': 2, 'norm_y_weights': False, 'scale': True, 'tol': 1e-06} |
This is a Python dict object which contains a bunch of parameters identified by their keys (e.g. ‘max_iter’) and the corresponding value (e.g. 500). The list of keys can be obtained by typing
1 |
pls.__dict__.keys() |
which outputs the following
1 |
dict_keys(['mode', 'algorithm', 'tol', 'deflation_mode', 'n_components', 'max_iter', 'scale', 'copy', 'norm_y_weights']) |
After fitting the regressor to the data, the pls object will contain the whole set of fitted parameters required in a PLS regression
1 2 |
pls.fit(X2,y) pls.__dict__.keys() |
There are many more keys now. The new keys are used to store numerical values related to the regression, such as score, weights, etc.
1 2 3 4 |
dict_keys(['y_mean_', 'mode', 'y_scores_', 'norm_y_weights', 'max_iter', 'x_weights_', 'deflation_mode', 'y_loadings_', 'x_loadings_', 'copy', 'y_weights_', 'n_components', 'n_iter_', 'x_scores_', 'y_rotations_', 'algorithm', 'y_std_', 'tol', 'x_std_', 'x_rotations_', 'scale', 'x_mean_', 'coef_']) |
To store a calibration in JSON we will dump the content of pls.__dict__
into a JSON (text) file. To read it again, we will import the file, define a default pls object, then re-assign all the keys (from file) to the new pls object. Once we have done that the newly defined pls object in fact contain all the information it needs to be able to make prediction on new data.
There are (as always) a few technical complications to overcome to do what we have described, and that’s the topic of the next and final section.
Exporting NIR calibration to JSON in Python
Congratulations for making it this far along the post. Here’s the last stride.
The major problem I have encountered in dumping the content of a pls object into a JSON file, is that a standard NumPy array (a numpy.ndarray obect) is not JSON serialisable. You can verify this by yourself by running the following code
1 2 3 4 |
import numpy as np import json arr = np.ndarray([1, 2, 3]) json_txt = json.dumps(arr) |
This code will throw an error (at least as of Python 3.6.7 with which I ran this code)
1 |
TypeError: Object of type 'ndarray' is not JSON serializable |
This is a problem for us, since most of the values out of pls.__dict__
are NumPy arrays.
The workaround to this problem is to transform all arrays to lists. Happily, lists can be handled by the Python JSON encoder without problems. In the simple example above this amounts to doing
1 2 3 4 |
import numpy as np import json arr = np.ndarray([1, 2, 3]) json_txt = json.dumps(arr.tolist()) |
A small change to the last line transforms the array into a list and ensures the code executing without errors.
This is the principle we will follow for our PLS, but since our pls object contains several arrays and other data types in our dict, we want to handle all these cases automatically without having to transform each individual array. For that I found a most useful code snippet in this stackoverflow thread. The code, reproduced below, is a modifier to the custom JSON serialiser, that handle NumPy integers, floats and arrays.
1 2 3 4 5 6 7 8 9 10 |
class MyEncoder(json.JSONEncoder): def default(self, obj): if isinstance(obj, np.integer): return int(obj) elif isinstance(obj, np.floating): return float(obj) elif isinstance(obj, np.ndarray): return obj.tolist() else: return super(MyEncoder, self).default(obj) |
Put this class at the beginning of your script, after the imports. With that you can export your model to JSON with a few more lines.
1 2 3 4 5 |
# Create json and save to file filename = 'nir_pls_model.json' json_txt = json.dumps(pls.__dict__, cls=MyEncoder) with open(filename, 'w') as file: file.write(json_txt) |
There you go. Now you should have the file ‘nir_pls_model.json’ stored on your drive. The nice thing about this method is that the JSON file is actually readable in any text editor, it can’t be used to store malicious executable code, and can be in principle exported and used with other software that supports JSON.
In Python, here’s how you import the JSON calibration back.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Load data from file json_data=open('nir_pls_model.json').read() load_dict = json.loads(json_data) # Define a new generic pls object pls1 = PLSRegression() # Assign all the dict keys of the loaded calibration object to the new pls object # The if/else statement ensures that the lists are transformed back into numpy arrays for k,v in enumerate(load_dict.keys()): print(v) if type(load_dict[v]) is list: pls1.__dict__[v] = np.array(load_dict[v]) else: pls1.__dict__[v] = load_dict[v] # The new pls object is ready for prediction pls1.predict(X2) |
Et voilà, I’ve shown you a few ways to save and load your NIR calibration in Python. As always, feel free to get in touch if you have any queries about this post, or if you’d like to professionally engage with me on your project.
Thanks for reading and catch up next time!