Tensorflow Serving

Alex Egg,

In the general case, if you want to serve your Tensorflow model (or any ML model or software) in production, you just need to wrap it in some type of web API. However, presumably, to unifiy the community Google has released Tensorflow Serving which may provide some standards for the deployment and serving process.

TF Serving is webserver software which serves Tensorflow models via a gRPC API. If you export your TF model in certain way, TF Serving can load it and proxy it out for web inference requests in a managed way. The process has 3 main steps:

  1. Model Export/Seralization
  2. TF Serving Server Setup
  3. Client Setup

Seralization

Serving knows how to deseralize TensorFlow models expored using the SaveBuilder API. Once you are done training your model and a have a graph and variables in memory you must export the model by seralizing it to disk using the add_meta_graph_and_variables method. Additionally, you must tell serving what tensors/ops to expose to the client. The requires two steps:

  1. Create SignatureDef for your model
  2. Using SaveModelBuilder to save your model with the SignatureDef (Step 1s)

See example below:

def export(self, path):
    from datetime import datetime
    version = int(datetime.now().strftime("%s"))
    
    from tensorflow.python.saved_model import utils
    from tensorflow.python.saved_model import signature_constants
    from tensorflow.python.saved_model import signature_def_utils
    from tensorflow.python.saved_model import builder

    export_path = os.path.join(tf.compat.as_bytes(path), tf.compat.as_bytes(str(version)))
    print 'Exporting trained model to', export_path
    
    _builder = builder.SavedModelBuilder(export_path)
    
    # Build the signature_def_map.
    filenames = self.input_batch
    fc6 = self.end_points['vgg_16/fc6']
    
    prediction_inputs = utils.build_tensor_info(filenames)
    prediction_outputs = utils.build_tensor_info(fc6)

    inputs_map = { signature_constants.PREDICT_INPUTS: prediction_inputs } #these constants are literally "inputs" & "outputs"
    outputs_map = { signature_constants.PREDICT_OUTPUTS: prediction_outputs } #multi-headed, get fc6 & softmax
    
    prediction_signature = (signature_def_utils.build_signature_def(inputs=inputs_map,outputs=outputs_map,
            method_name=signature_constants.PREDICT_METHOD_NAME))

    #this code is idiotic and needs to use the weird python block or it doesnt work
    with self.session as sess:
        _builder.add_meta_graph_and_variables(
        sess, [tf.saved_model.tag_constants.SERVING],
        signature_def_map={signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: prediction_signature}, clear_devices=True)
    
    _builder.save(as_text=True)
    print('Done exporting!')

This will export the graph and session variables to an exports/[version] folder eg the model below has two versions exported:

$ tree model/export/
model/export/
├── 1509295157
│   ├── saved_model.pbtxt
│   └── variables
│       ├── variables.data-00000-of-00001
│       └── variables.index
└── 1509295446
    ├── saved_model.pbtxt
    └── variables
        ├── variables.data-00000-of-00001
        └── variables.index

Server

If you point TF Serving to this export directory it will now to read the newest model into memory. You start a server, by running the Serving binary tensorflow_model_server and passing in the respective parameters:

tensorflow_model_server --port=9000 --model_name=mnist --model_base_path=/model/export/

At this point you will have the server listing on port 9000 for gRPC calls from the client.

Client

Unfortunetly, to hit the server you need a client that speaks gRPC. The rest of the world would JSON/REST, however, we’re dealing w/ a google product :), so the client is a lot more complicated than I would hope.

from __future__ import print_function
from grpc.beta import implementations
import tensorflow as tf

from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2

host, port = "localhost:9000".split(':')
channel = implementations.insecure_channel(host, int(port))
stub = prediction_service_pb2.beta_create_PredictionService_stub(channel)

# Send request
with open("model/tmp/alexprofile.jpeg", 'rb') as f:
    # See prediction_service.proto for gRPC request/response details.
    data = f.read()
    request = predict_pb2.PredictRequest()
    request.model_spec.name = 'image-search'
    request.model_spec.signature_name = 'predict_images'
    request.inputs['inputs'].CopyFrom(tf.contrib.util.make_tensor_proto("path", shape=[1]))
    result = stub.Predict(request, 10.0)  # 10 secs timeout
    print(result)

Architecture

A more fomral study of Serving would be serviced well by an understanding of its internal API.

Figure 1: TF Serving API

I would like to highligh the yellow portions of Figure 1 which are extensable plugins. For example there is a File System Source Plugin and Tensorflow Servable Plugin. These come w/ Serving and allow it to load seralized TF models from disk. Other plugins could be S3 Source Plugin or a Scikit Learn Servable Plugin.

Extentions

To run your model on Serving you need to define a Servable plugin. Serving, of course, comes /w a TensorFlow plugin out-of-the-box. So, in theory you could write a Servable plugin for any ML model (Scikit, XGBoost, etc) and serve it using TF Servings infrastructure and best practices. A possible usage scenario could be a low latency deployment of a trained XGB model where Tensorflow Serving uses a custom servable:

xgboost-servable

In the figure above, XGBoost Servable is something that the developer will have to come up with - Tensorflow Serving allows having custom servables and XGBoost offers a C++ API but in practice this hasn’t been done yet.

Permalink: tensorflow-serving

Tags:

Last edited by Alex Egg, 2017-11-18 00:30:23
View Revision History