Three ways of deploying deep learning models into production

After weeks of training and optimizing a neural net at some point it might be ready for production. Most deep learning projects never reach this point and for the rest it’s time to think about frameworks and technology stack. The are several frameworks, which promise easy productionalizing of deep neural nets. Some might be a good fit, others are not. This post will not about these paid solutions, this post is about how to bring your deep learning model into production without third-party tools. This post shows an example with flak as a RESTful API and with deploying a keras model into the java enterprise ecosystem for batch predictions. This post focuses on deep neural networks, for traditional ML, take a look at this series.

Out-of-the-box solutions

First let’s look at some tools, which might be feasible for the most projects. The straightforward solution for tensorflow models is tensorflow serve. It’s basically using google cloud ml engine for model deployment and offers some convenience tools and functions. The cloud solution has some limitations, the maximum model size is 250MB (afaik), it’s quite expensive and furthermore its serving models via a REST API, which introduces network. The same is true Amazon Sagemaker and Azure.

The simplest way of serving predictions from a trained model is using Jupyter Kernel Gateway, which allows headless access to Jupyter notebooks. But it is obviously not really ment for production usage.

Serving with flask as RESTful API

The most general solution would be using any python web server and wrap the model into a REST API. The straight forward solution with flask does not need much explanations, it a simple HTTP service for on-demand predictions. Here is a very simple example webapp, which can then be deployed on cloud services like AWS. Here is a good tutorial.

import os
from flask import Flask
from tensorflow.python.keras.models import load_model

# use only CPU
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
init = None
model = None
app = Flask(__name__)


@app.route("/predict", methods=["POST"])
def predict():
    init_if_neccecary()
    if flask.request.method == "POST":
        preds = model.predict(image)
        data["predictions"] = preds
        data["label"] = ['Class A', 'Class B', ...]

    return flask.jsonify(data)


def init_if_neccecary():
    global init, model
    if not init:
        print('init')
        model = load_model('mymodel.model')
        init = True

Serving models directly in Java

All these options involve third party frameworks, HTTP or any other network communication. In applications where you predcit millions of examples it’s better to do predictions directly on the local machine.

There is deeplearning4j, which integrated directly into the java world. Unfortunately, I cannot really recommend it, because it moves the ML code to java and honestly ML in Java is not fun for data wrangling and to verbose, plus you must rewrite the code. Furthermore, it does not support important layers like GRU, TimeDistributed. For simple ANNs it ok, but researchers do not use it, so there are no state-of-the-art models available. Be aware of that before moving to such a framework!

There is also a Java API for tensorflow, which can be used to load SavedModels. But once you are using custom models with lots of preprocessing, it gets complicated. Code must be moved from python to java. It might be easier to call python script.

Batch predictions

Before looking at how make your predictions locally, let’s think about deep learning frameworks. What is TensorFlow actually doing? It is basically a library for parallel computing, and it can utilize GPUs through CUDA but also SSE, AVX, etc. on CPUs. Python is the API to access the C++ core.

Using the Java API is experimental and basically is very small, it does not offer a lot functionality. However, it is not difficult to call python from java directly. In this case we can write everything related to the ML part in python and just invoke the script by java. Moreover, we can utilize java performance for preprocessing. We can stay with all ml stuff in python and stay in the java ecosystem for everything else.

Call python from java

Example: A java backend is running batch jobs and we want to predict a lot of examples at once. We want to apply deep learning model to classify text. Invocation of any HTTP services is not feasible. The only thing we should do before running python scripts from java is to install python and all the required libraries on the machine, where the java application is running.

Calling a python script in java is straight forward. In a batch processing scenario, we export preprocessed data to csv, json or avro. This data will be used by python for predictions and we call the python script directly or via a bash script.

public void run( Path predictionsFilePath ) {
   log.info("Starting prediction");
   // target file data.json
   dataFile = Paths.get("data.json");
   predictionsFilePath.toFile().deleteOnExit();
// here java does some text preprocessing and exports an json file with the data preProcessor.exportJson(_dataFile); log.info("Executing Python based prediction");
// Now we run the bash script, which is then calling the python script, passing the data file, the file where we read the predictions and the model, which is used ProcessBuilder pr = new ProcessBuilder("run.sh", String.format("predict --modelpath %s --datafile %s --targetfile %s", "latest_model", dataFile.toAbsolutePath(), predictionsFilePath.toAbsolutePath())); pr.directory(new File("/mypath")); pr.redirectOutput(ProcessBuilder.Redirect.INHERIT); pr.redirectError(ProcessBuilder.Redirect.INHERIT); try { Process start = pr.start();
// here we set the timeout for the process, we assume something went wrong, if the it takes longer then 15 minutes to make the predictions start.waitFor(15, TimeUnit.MINUTES); processPredictions(predictionsFilePath); } catch ( IOException | InterruptedException e ) { throw new RuntimeException("Execution failed", e); } }
public void processPredictions(Path predictionsFilePath){ CsvToBean< MyType> csvToBean = new CsvToBeanBuilder(Files.newBufferedReader(predictionsFilePath)) // .withType(MyType.class) // .withIgnoreLeadingWhiteSpace(true)// .build(); for ( MyType bean: csvToBean ) { // Do Something with predictions }
log.info("Succesfull processed %s predictions".format(csvToBean.size()));
}

The java code calls this shell script, where we can setup additional things and finally call the python script, passing the params:

#!/bin/bash
python "main.py" predict $@

In our python script we can parameterize prediction and training scenario, so we can share the same codebase. Here is an short example of the idea. We load a NLP model, where we have a tensorflow model plus a dictionary (latest_model.model, latest_model.dict).

if self.opts.operation == 'predict':
    print('----------- Running Prediction ---------')

   print('Using given model: {}'.format(self.opts.modelname))
   model = load_model(self.opts.modelname + '.model',
                      custom_objects=[])
   with open(self.opts.modelname + '.dict', 'rb') as fp:
       word_dict = pickle.load(fp)

    predictor = Predictor(self.opts.targetfile,
                          model, word_dict, self.opts.datafile)
    predictor.predict()
elif self.opts.operation == 'train': print('----------- Running Training ---------') trainer = Trainer() trainer.train(self.get_training_data())

Related Posts

Leave a reply