[How-To] Workload aware deployment of deep learning models

After weeks of training and optimizing a neural net at some point it might be ready for production. Most deep learning projects never reach this point and for the rest it’s time to think about frameworks and technology stack. In the last years a lot of fully integrated platforms have been created to solve this problem. Some might be a good fit, the most are not. This post will not about these paid solutions, this post is about how to bring your deep learning model into production without these paid services. This post shows different approaches, depending on the use case. This post focuses on deep neural networks, for traditional ML, take a look at this series.

Batch or API

In my opinion there are only two different usage patterns for machine learning models. If you have users directly interacting with services and you need on demand predictions of your model, you need an API approach. with low latency as your main goal.

On the other hand you might have ETL systems interacting with your model or scheduled cronjobs, where you predict millions of examples. Here your your main concern is probably throughput.

API solutions

First let’s look at API solutions, as they can also cover the batch scenarios, even if it’s maybe not the best way. We will start with the simplest, non-scalable to the “real” production solutions.

The hacky notebook server

The simplest way of serving predictions from a trained model is using Jupyter Kernel Gateway, which allows headless access to Jupyter notebooks. But it is obviously not really meant for production usage, just for demonstration.

Serving with flask as RESTful API

A common solution would be using a python web server and wrap the model into a REST API. Flask does not need much explanations, it a simple HTTP service for on-demand predictions.

The code below shows a very simple example webapp, which can then be deployed on servers and cloud services like AWS Elastic Beanstalk. (tutorial). However this approach is also not really production ready, nor it is scalable. But it should be mentioned here, as it is pretty easy to set up.

from flask import Flask
from tensorflow.python.keras.models import load_model

model =  load_model('mymodel.h5')
app = Flask(__name__)

@app.route("/predict", methods=["POST"])
def predict():
    if flask.request.method == "POST":
        preds = model.predict(image)
        data["predictions"] = preds
        data["label"] = ['Class A', 'Class B', ...]

    return flask.jsonify(data)

TensorflowX and model servers

Now we reach the point, where we can talk about production-readiness. With TensorFlowX we have a powerful model server, written in C++. There are also others like TensorRT, Clipper, MLFlow, DeepDetect.

TensorFlow model server offers many features. Serving multiple models at the same time, while reducing the overhead to a minimum. It allows you to version your models, without downtime when deploying a new version, while still being able to use the old version. It also has an optional REST API endpoint additionally to the gRPC API. The throughput is magnitudes higher than using a Flask API, as it is written in C++ and uses multithreading. Additionally, you can even enable batching, where the server batches multiple single predictions into a batch for very high load settings. And finally, you can put it into a docker container and scale even more.

Batch scenarios

All these options involve web servers and HTTP or any other network communication. In applications where you predict millions of examples it’s better to do predictions directly on a local machine.

In a batch processing scenario, we export data to csv, json, avro or any other format and process it. The processing is usually scheduled on a specific time or triggered by an event.

Bash and Crontab

If our application is running on a linux machine, we have already a powerful tool to schedule workflows or we can even trigger a execution right away via bash.

We create a bash file, which runs our python classifier and pass all arguments to the python script.

python "main.py" predict $@

To directly invoke the script from our application we can now simply call this bash script e.g: bash_script –input_file=~/data.csv

For scheduled local workflows on a local linux server we can make use of crontab, which executes a script automatically. Therefore we add a line to crontab -e:

0 0 * * * bash_script --input_file=~/data.csv >~/log.txt

This line makes crontab executing our script every day on midnight, writing it’s log output to log.txt in our home folder.


Java deserves it’s own chapter, as it is one of the widely used languages.


There is deeplearning4j, which integrated directly into the java world. Unfortunately, I cannot really recommend it, because it moves the ML code to java and honestly ML in Java is not fun for data wrangling and to verbose, plus you must rewrite the code. Furthermore, it does not support important layers like GRU, TimeDistributed (2019). For simple ANNs it’s ok, but researchers do not use it, so there are no state-of-the-art models available.

Tensorflow Java API

There is also a Java API for TensorFlow, which can be used to load SavedModels. Before looking at the java API let’s think about deep learning frameworks. What is TensorFlow actually doing? It is basically a library for parallel computing, and it can utilize GPUs through CUDA but also SSE, AVX, etc. on CPUs. Python is the API to access the C++ core, but in the end it’s using highly optimized binaries in the backend.

If you use Java the same is true. The java API needs to ship all these binaries. It introduces a huge dependency with 145 MB called tensorflow-jni, JNI is the native Interface from Java to call native (C/C++) libraries. We don’t want a 145 MB binary package in our application or a 350 MB package with GPU support!

Furthermore using Tensorflow directly in java, introduces off head memory, which means the JVM is not in control of the memory anymore and memory leaks can exist. You don’t want to use such libraries in production servers.

Airflow and other schedulers

If you need more control, have multiple machines and complex workflows, you should take a look at tools like AirFlow or Rundeck. These tools are need much explanation so they break the frame here. there are tons of tutorials how to use them, so I prefer to just mention them.

Just be aware that such tools also need administration and as long as you don’t need their power, you can stay with simple solutions.

Related Posts


Thanks for the useful info. I also note that you can mix Java and Python, and call native Java deep learning packages using DataMelt https://jwork.org/dmelt/

Nice text, when you tell that the DeepLearning4J move the code to java is it right? because if it generated on python side, the topology of network should be the same right?

Hi Rafael, yes if you import a e.g. keras model, you have the same topology, but you cannot use any custom layers or not supported layers. With “move ml code to java” I was thinking about the code around the model, ETL things, preprocessing and so on, which is much better suited to python than to java. Cheers

[…] I wrote a post about the tools to use to deploy deep learning models into production depending on the workload. In […]

Leave a reply