[How-To] Machine Learning in Practice: Consistent Preprocessing

Preprocessing and data transformation are the most important parts of all machine learning pipelines. No matter what type of model you use, if the preprocessing sucks, your model sucks too. This remains also true, if you are dealing with deep learning. Furthermore, every trained model is tied to a specific preprocessing pipeline, every change in the preprocessing can turn the model into a useless piece of complex calculation.

Prototyping and production workflows

For learning purposes in academia and tutorials, these pipelines are often interactive jupyter notebooks to allow fast prototyping. In practice it is necessary to persists and deploy these pipelines and the model together to have a consistent working and deployable artifact.

Typical parts of preprocessing (FeatureExtractor) are:

  • Data extraction from domain objects (getting vectors)
  • Scaling of features
  • Dimensionality reduction or transformation
  • Indexing
  • Filtering

A bunch of those preprocessing steps are also trained and tied directly to the training data. A scaler, which is not the same for training and test data, can make the hole pipeline useless. Indexed features like dictionaries from nlp feature extraction will almost certain make all predictions wrong, if they don’t match. In comparison to usual software development, changes like this do not break the code, without monitoring you don´t even notice that something is not working anymore.

In python, persisting pipelines is often just a matter of custom serialization. Everything is quite flexible, you usually don´t manage a bunch of models and a lot of different versions and if so, you just put them together as python files + serialized models.

In productions environments based on compiled languages, code must be deployed for almost every change in the code. Replacing the serialized model with a new one is only possible if the inputs (ETL) are compatible.

Java example with the smile framework

The smile framework is a powerful and modern machine learning framework with a lot of features. Using such a framework involves a lot of ETL code to process the data, until you have a set of features (X) and corresponding set of labels (Y). In Smile there are basically two different data types, dense and sparse arrays, which is true for the most other frameworks, too. In my opinion these details should not be exposed. Besides future replacements of these internals, information hiding offers more reliable ml applications.

Fast, easy and with runtime errors

The fastest way to get started is to follow the examples in the documentation. Let’s assume we are using the titanic dataset and we want to predict if a passenger died. We have the passenger domain object and we define some feature extraction. The model is trained and serialized. In the prediction case the model is deserialized.

This works if we don´t change anything in the code, especially in the FeatureExtractor part. If there is any data depended preprocessing, the FeatureExtractor must also be serialized. Every change requires to deploy the application and the model together. Exchanging a non-compatible model without deployment of the code and vice versa is causing a Runtime Exception or even produces wrong predictions, without any error shown!

Consistent artifacts, hide details

To avoid such cases the ETL part and the model is stored together into one single artifact. This could look like the following:

Any developer working with the trained model should not have to deal with all the ETL stuff, what we want is a classifier, which can be used directly from the domain objects. As a developer it should be transparent, which steps are involved, and it should be possible to access and change all the steps involved, but it should not be required to manually put all steps together. In this solution the only breaking change is a change of the Passenger class and this will probably be noticed much earlier (SerialVersionUID) and not produce wrong predictions.

Webservice or API

It clear that predictions for the end-user should be accessibe through a web service or any other API and hide all these internals. Anyway, someone must develop and maintain these services, and this is the targeted audience for this post.

Inspired from here

Related Posts

Leave a reply