Setting up a server for NLP models in production

by abas
GNU/Linux ◆ xterm-256color ◆ bash 143 views

In machine learning, serving a trained model means making it available for people to use it to get predictions from their data and it is a fundamental step of bringing any NLP research outcome to production.

Here we will see how to set up a high-performing inference server capable of running models saved in different formats. We will be using the TensorRT Inference Server (TRTIS from now on), developed by nvidia. I’ll show how to deploy a model created with Tensorflow Keras, but TRTIS supports many popular ML model serialization formats, such as ONNX, PyTorch, and Caffe2.

Here is what we’ll be doing:

- train a sentiment analysis model and serialize it to disk
- setup TRTIS using docker
- deploy the sentiment analysis model
- send a few requests via HTTP to the inference server and get back the sentiment predictions