While the most articles about deep learning are focusing at the modeling part, there are also few about how to deploy such models to production. Some of them (especially on towardsdatascience) say “production”, but they often simply use the unoptimized model and embed it into a Flask web server. In this post, I will explain why using this approach is not scaling well and wasting resources.
The “production” approach
If you search for how to deploy TensorFlow, Keras or Pytorch models into production there are a lot of good tutorials, but sometimes you come across very simple simple examples, claiming production ready. These examples often use the keras mode, a Flask web server and containerize it into a docker container. These examples use plain Python to serve predictions. The code for these “production” Flask webservers look like this:
from flask import Flask, jsonify, request from tensorflow import keras app = Flask(__name__) model = keras.models.load_model("model.h5") @app.route("/", methods=["POST"]) def index(): data = request.json prediction = model.predict(preprocess(data)) return jsonify({"prediction": str(prediction)})
Furthermore, they often show how to containerize the Flask server and bundle it with your model into docker. These approaches also claim that they can easily scale by increasing the number of docker instances.
Now let us recap what happens here and why it is not “production” grade.
Not optimizing models
First usually the model is used as it is, which means the Keras model from the example was simply exported by model.save(). The model includes all the parameters and gradients, which are necessary to train the model, but not required for inference. Also, the model is neither pruned nor quantized. As an effect using not optimized models have a higher latency, need more memory, compute and are larger in terms of file size.
Example with B5 Efficientnet:
- h5 keras model: 454 MByte
- Optimized tensorflow model (no quantization): 222 MByte
Using Flask and the Python API
The next problem is that plain Python and Flask is used to load the model and serve predictions. Here are a lot of problems.
First let’s look at the worst thing you can do ever: loading the model for each request. In the code example from above, the model is used when the script is called, but in other tutorials they moved this part into the predict function. What that does is loading the model every single time you make a prediction. Please do not do that.
That being said, let’s look at Flask. Flask includes a powerful and easy-to-use webserver for development. On the official website, you can read the following:
While lightweight and easy to use, Flask’s built-in server is not suitable for production as it doesn’t scale well.
That said, they also say you can use Flask as a WSGI app in e.g. Google App Engine. However, many tutorials are not using Google App Engine or NGIX, they just use it as it is and put it into a docker container. But even when they use NGIX or any other webservers, they usually turn off multithreading completely.
Let’s look a bit deeper into the problem here. If you use a TensorFlow, it handles compute resources (CPU, GPU) for you. If you load a model and call predict, TensorFlow uses the compute resources to make these predictions. While this happens, the resource is in-use aka locked. When your webserver only serves one single request at the time, you are fine, as the model was loaded in this thread and predict is called from this thread. But Once you allow more than one requests at the time, your webserver stops working, because you can simply not access a TensorFlow model from different threads, as the resource is already in-use. That being said, in this setup you can not process more than one request at once. Doesn’t really sound like scalable, right?
Example:
- Flask development web server: 1 simultaneous request
- TensorFlowX Model server: parallelism configurable
Scaling “low-load” instances with docker
Ok now the webserver does not scale, but what about scaling the number of webservers? In a lot of examples this approach is the solution to the scaling problem of single instances. There is not much to say about it, it works sure. But scaling this way wastes money, resources and energy. It’s like having a truck and putting in one single parcel and once there are more parcels, you get another truck, instead of using the existing truck smarter.
Example latency:
- Flask Serving like shown above: ~1s per image
- Tensorflow model server (no batching, no GPU): ~250ms per image
- Tensorflow model server (no batching, GPU): ~120ms per image
Not using GPUs/TPUs
GPUs made deep learning possible as they can do operations massively in parallel. When using docker containers to deploy deep learning models to production, the most examples do NOT utilize GPUs, they don’t even use GPU instances. The prediction time for each request is magnitudes slower on CPU machines, so latency is a big problem. Even with powerful CPU instances you will not achieve comparable results to the small GPU instances.
BTW: In general it is possible to use GPUs in docker, if the host has the correct driver installed.
Example costs:
- 2 CPU instances (16 Core, 32GByte, a1.4xlarge): 0,816 $/h
- 1 GPU instance (32G RAM, 4 Cores, Tesla M60, g3s.xlarge): 0,75 $/h
It’s already solved
As you can see, loading trained model and putting it into Flask docker containers is not an elegant solution. If you want deep learning in productions, start from the model, then think about webservers and finally about scaling instances.
Optimize the model
Unfortunately optimizing a model for inference is not that straight forward but can easily reduce inference time by multiples, so it’s worth it without doubts. The first step is freezing the weights and removing all the trainings overhead. This can be achieved with Tensorflow directly but requires you to convert your model into either an estimator or into a Tensorflow graph (SavedModel fromat), if you came from a Keras model. TensorFlow itself has a tutorial for this. To further optimize, the next step is to apply model pruning and quantization, where insignificant weights are removed and model size is reduced.
Use model servers
When you have an optimized model, you can look at different model servers, meant for deep learning models in production. For TensorFlow and Keras TensorFlowX offer the tensorflow model server. There are also others like TensorRT, Clipper, MLFlow, DeepDetect.
TensorFlow model server offers several features. Serving multiple models at the same time, while reducing the overhead to a minimum. It allows you to version your models, without downtime when deploying a new version, while still being able to use the old version. It also has an optional REST API endpoint additionally to the gRPC API. The throughput is magnitudes higher than using a Flask API, as it is written in C++ and uses multithreading. Additionally, you can even enable batching, where the server batches multiple single predictions into a batch for very high load settings. And finally, you can put it into a docker container and scale even more.
Hint: tensorflow_model_server (bash) is available on every AWS Deep learning AMI image, with TensorFlow 2 it’s called tensorflow2_model_server.
Use GPU instances
And lastly, I would recommend using GPUs or TPUs in inference environments. The latency and throughput are much higher with such accelerators, while saving energy and money. Note that it is only being utilized if your software stack can utilize the power of GPUs (optimized model + model server). In AWS you can look into Elastic Inference or just use a g3s.xlarge instance.
[…] Elastic Beanstalk. (tutorial). However this approach is also not really production ready, nor it is scalable. But it should be mentioned here, as it is pretty easy to set […]