[How-To] Machine Learning in Practice: Consistent Preprocessing

Preprocessing and data transformation are the most important parts of all machine learning pipelines. No matter what type of model you use, if the preprocessing pipeline is buggy, your model will deliver wrong predictions . This remains also true, if you are dealing with deep learning. Furthermore, every trained model is tied to a specific preprocessing pipeline. Every change in the preprocessing can turn the model into a useless piece of complex calculation, if it is not adjusted to deal with the new circumstances.

Prototyping and production workflows

For learning purposes in academia and tutorials, these pipelines are often interactive jupyter notebooks to allow fast prototyping. In practice it is necessary to persist and deploy these pipelines and the model together to have a consistent working and deployable bundled piece of software.

Typical parts of preprocessing are:

  • Data extraction from domain objects (getting vectors)
  • Scaling/normlization of features
  • Dimensionality reduction or transformation
  • Indexing
  • Filtering

A bunch of those preprocessing steps are also trained and tied directly to the training data. A scaler can make the hole pipeline useless, when it is not consistent for training and predicting. Indexed features like dictionaries from nlp feature extraction will almost certain make all predictions wrong, if the word indexes don’t match. In comparison to usual software development, changes like this do not break the code.

Without monitoring you don´t even notice that something is not working anymore.

In python, persisting pipelines is often just a matter of custom serialization. Everything is quite flexible, you usually don´t manage a bunch of models and a lot of different versions. You just put them together as python files + serialized models.

In productions environments based on compiled languages like Java, artifacts must be rebuilt and deployed for every change in the code. Furthermore, replacing a serialized model with a new version is only possible if the preprocssing didn´t change. In every other case the hole preprocessing must also be updated.

Example in Java with the smile framework

public void train(){         
	List passengers = passgengerProvider.getAll();
	double[][] x = passengers.stream().map(FeatureExtractor::extractDoubleArrayFeaures).toArray(double[][]::new);
	int[] y = passengers.stream().mapToInt(Passenger::getSurvied).toArray();
	MyClassifier clf = new MyClassifier(x, y);
public int predict(Passenger passenger){
	MyClassifier deserialized = MyClassifier.deserialize("classifier.model");
	double[] x = FeatureExtractor.extractDoubleArrayFeaures(passenger);
	return deserialized.predict(x);     

Let’s assume we are using the titanic dataset and we want to predict if a passenger died. In this example everything is very easy, just to show what is the point. We have a data source (PassengerProvider) which gives us the domain objects and we have a Object, which holds our preprocessing (FeatureExtractor).  The model is trained and serialized. In the prediction case the model is deseriali


This works if we don´t change anything. Once we have deployed the code and we replace the model with a new one without updating the preprocessing(FeatureExtractor), the model and the preprocessing is out of sync and all predictions may be wrong. In this case the FeatureExtractor object must also be serialized and used in the prediction case. Exchanging a non-compatible model without deployment of the code and vice versa is causing a Runtime Exception or it may produce wrong predictions, without any error shown!

Consistent artifacts, hide details

To avoid such cases the ETL part and the model is stored together into one single artifact. This could look like the following:

public static void train(){
	List<Passenger> passengers = passgengerProvider.getAll();
	int[] y = passengers.stream().mapToInt(Passenger::getSurvied).toArray();
	MyBetterClassifier clf = new MyBetterClassifier(passengers, y);

Developers working with the trained model can now just load the model including all the preprocessing and it just works. We want a classifier, which can be used directly from the domain objects, not from the bare feature vectors. As a developer it should be transparent, which steps are involved. It should be possible to access and change all the steps involved, but it should not be required to put all steps together, manually. A lot of stackoverflow questions regarding machine learning are actually about such issues. Consistent, persisted models including the data wrangling are much easier to use. The only breaking change is a change of the domain object (Passenger class in this case). This will be noticed much earlier and not produce wrong predictions.

Webservice or API

It clear that predictions for the end-user should be accessibe through a web service or any other API and hide all these internals. Anyway, someone must develop and maintain these services, and this is the targeted audience for this post.

Inspired from here

Related Posts

Leave a reply