[How-To] Machine Learning in Practice: Consistent Preprocessing

Preprocessing and data transformation are the most important parts of all machine learning pipelines. No matter what type of model you use, if the preprocessing pipeline is buggy, your model will deliver wrong predictions . This remains also true, if you are dealing with deep learning. Furthermore, every trained model is tied to a specific preprocessing pipeline. Every change in the preprocessing can turn the model into a useless piece of complex calculation, if it is not adjusted to deal with the new circumstances.

Prototyping and production workflows

For learning purposes in academia and tutorials, these pipelines are often interactive jupyter notebooks to allow fast prototyping. In practice it is necessary to persists and deploy these pipelines and the model together to have a consistent working and deployable artifact.

Typical parts of preprocessing (FeatureExtractor) are:

  • Data extraction from domain objects (getting vectors)
  • Scaling of features
  • Dimensionality reduction or transformation
  • Indexing
  • Filtering

A bunch of those preprocessing steps are also trained and tied directly to the training data. A scaler can make the hole pipeline useless, when it is not consistent. Indexed features like dictionaries from nlp feature extraction will almost certain make all predictions wrong, if the word indexes don’t match. In comparison to usual software development, changes like this do not break the code. Without monitoring you don´t even notice that something is not working anymore.

In python, persisting pipelines is often just a matter of custom serialization. Everything is quite flexible, you usually don´t manage a bunch of models and a lot of different versions. You just put them together as python files + serialized models.

In productions environments based on compiled languages, code must be deployed for almost every change in the code. Replacing the serialized model with a new one is only possible if the inputs (ETL processes) are compatible.

Java example with the smile framework

The smile framework is a powerful and modern machine learning framework with a lot of features. Using such a framework involves a lot of ETL code to process the data, until you have a set of features (X) and corresponding set of labels (Y). In Smile there are basically two different data types, dense and sparse arrays, which is true for the most other frameworks, too. These details should not be exposed. Besides future replacements of these internals, information hiding offers more reliable ml applications.

Fast, easy and with runtime errors

The fastest way to get started is to follow the examples in the documentation. Let’s assume we are using the titanic dataset and we want to predict if a passenger died. We have the passenger domain object and we define some feature extraction. The model is trained and serialized. In the prediction case the model is deserialized.

public void train(){         List<Passenger> passengers = passgengerProvider.getAll();         double[][] x = passengers.stream().map(FeatureExtractor::extractDoubleArrayFeaures).toArray(double[][]::new);         int[] y = passengers.stream().mapToInt(Passenger::getSurvied).toArray();         MyClassifier clf = new MyClassifier(x, y);         clf.serialize("classifier.model");     } public int predict(Passenger passenger){         MyClassifier deserialized = MyClassifier.deserialize("classifier.model");         double[] x = FeatureExtractor.extractDoubleArrayFeaures(passenger);         return deserialized.predict(x);     }

This works if we don´t change anything in the code, especially in the FeatureExtractor part. If there is any data depended preprocessing, the FeatureExtractor must also be serialized. Every change requires to deploy the application and the model together. Exchanging a non-compatible model without deployment of the code and vice versa is causing a Runtime Exception. Or it may produce wrong predictions, without any error shown!

Consistent artifacts, hide details

To avoid such cases the ETL part and the model is stored together into one single artifact. This could look like the following:

 public static void train(){         List<Passenger> passengers = passgengerProvider.getAll();         int[] y = passengers.stream().mapToInt(Passenger::getSurvied).toArray();         MyBetterClassifier clf = new MyBetterClassifier(passengers, y);         clf.serialize("classifier.model");     }

Developers working with the trained model should not have to deal with manually putting the ml pipeline steps together. We want a classifier, which can be used directly from the domain objects, not from the bare feature vectors. As a developer it should be transparent, which steps are involved. It should be possible to access and change all the steps involved, but it should not be required to put all steps together, manually. A lot of stackoverflow questions regarding machine learning are actually about such issues. Consistent, persisted models including the data wrangling are much easier to use. The only breaking change is a change of the domain object (Passenger class in this case). This will be noticed much earlier (SerialVersionUID) and not produce wrong predictions.

Webservice or API

It clear that predictions for the end-user should be accessibe through a web service or any other API and hide all these internals. Anyway, someone must develop and maintain these services, and this is the targeted audience for this post.

Inspired from here

Related Posts

Leave a reply