Quantization of TensorFlow Object Detection API Models

In this tutorial, we will examine various TensorFlow tools for quantizing object detection models. We start off by giving a brief overview of quantization in deep neural networks, followed by explaining different approaches to quantization and discussing the advantages and disadvantages of using each approach. We will then introduce TensorFlow tools to train a custom object detection model and convert it into a lightweight, quantized model with TFLiteConverter and TOCOConverter. Finally, as a use case example, we will examine the performance of different quantization approaches on the Coral Edge TPU.

Quantization in Neural Networks: The Concept

Quantization, in general, refers to the process of reducing the number of bits that represent a number. Deep neural networks usually have tens or hundreds of millions of weights, represented by high-precision numerical values. Working with these numbers requires significant computational power, bandwidth, and memory. However, model quantization optimizes deep learning models by representing model parameters with low-precision data types, such as int8 and float16, without incurring a significant accuracy loss. Storing model parameters with low-precision data types not only saves bandwidth and storage but also results in faster calculations.

The concept of quantization

Image source

Quantization Brings Efficiency to Neural Networks

Quantization improves the overall efficiency in several ways. It saves the maximum possible memory space by converting parameters to 8-bits or 16-bits instead of the standard 32-bit representation format. For instance, quantizing the Alexnet model shrinks the model size by 75%, from 200MB to only 50MB.

Quantized neural networks consume less memory bandwidth. Fetching numbers in the 8-bit format from RAM requires only 25% of the bandwidth of the standard 32-bit format. Moreover, quantizing neural networks results in 2x to 4x speedup during inference.

Faster arithmetics could be another benefit of quantizing neural networks in some cases, depending on different factors such as the hardware architecture. As an example, 8-bit addition is almost 2x faster than 64-bit addition on an Intel Core i7 4770 processor.

These benefits make quantization valuable, especially for edge devices that have modest compute and memory but are required to perform AI tasks in real-time.

Quantizing Neural Networks is a Win-win

By reducing the number of bits that represent a parameter, some information is lost. However, this loss of information incurs little to no degradation in the accuracy of neural networks for two main reasons:

  1. This reduction in the number of bits acts like adding some noise to the network. Since a well-trained neural network is noise-robust, i.e., it can make valid predictions in the presence of unwanted noises, the added noise will not degrade the model accuracy significantly.

  2. There are millions of weight and activation parameters in a neural network that are distributed in a relatively small range of values. Since these numbers are densely spread, quantizing them does not result in losing too much precision.

To give you a better understanding of quantization, we next provide a brief explanation of how numbers are represented in a computer.

Computer Representation of Numbers

Computers have limited memory to store numbers. There are only discrete possibilities to represent the continuous spectrum of real numbers in the representation system of a computer. The limited memory only allows a fixed amount of values to be stored and represented in a computer, which can be determined based on the number of bits and bytes the computer representation system works with. Therefore, representing real numbers in a computer involves an approximation and a potential loss of significant digits.

There are two main approaches to store and represent real numbers in modern computers: 1. Floating-point representation The floating-point representation of numbers consists of a mantissa and an exponent. In this system, a number is represented in the form of mantissa * base exponent, where base is a fixed number. In this representation system, the position of the decimal point is specified by the exponent value. Thus, this system can represent both very small values and very large numbers.

  1. Fixed-point representation In this representation format, the position of the decimal point is fixed. The numbers share the exponent, and they vary in the mantissa portion only.

floating-point and fixed-point representations

Image source

The amount of memory required for the fixed-point format is much less than the floating-point format since the exponent is shared between different numbers in the former. However, the floating-point representation system can represent a wider range of numbers compared to the fixed-point format.

The Precision of Computer Numbers

The precision of a representation system depends on the number of values it can represent precisely, which is 2b, where b is the number of bits. For example, an 8-bit binary system can represent 28 = 256 numbers precisely. In this system, only 256 values are represented precisely. The rest of the numbers are rounded to the nearest number of these 256 values. Thus, the more bits we can use, the more precise our numbers will be.

It is worth mentioning that the 8-bit representation system in the previous example is not limited to representing integer values from 1 to 256. This system can represent 256 pieces of information in any arbitrary range of numbers.

How to Quantize Numbers in a Representation System

To determine the representable numbers in a representation system with b bits, we subtract the minimum value from the maximum one to calculate r, the range of values. Then, we divide r by 2b to find u, the smallest unit in this format. The representable numbers in this format are in the form of k * u, where k = 0, 1, ..., 255. We can think of this as a mapping from integer values between 0 to 255 to values in the range of r, where k -> k * u. In this representation system, any value between k * u and (k + 1) * u cannot be represented precisely and is approximated by the closest quantized value. However, when quantizing neural networks, it is critical to represent the 0 value precisely (without any approximation error), as explained in this paper.

how to quantize numbers in a representation system

Image source

In the next section, we will explain how we can calculate the range of parameters in a neural network in order to quantize them.

How to Quantize Neural Networks

Quantization is to change the current representation format of numbers to another lower precision format by reducing the amount of the representing bits. In machine learning, we use the floating-point format to represent numbers. By applying quantization, we can change the representation to the fixed-point format and down-sample these values. In most cases, we convert the 32-bit floating-point to the 8-bit fixed-point format, which gives almost 4x reduction in memory utilization.

There are at least two sets of numerical parameters in each neural network; the set of weights, that are constant numbers (in inference) learned by the network during the training phase, and the set of activations, which are the output values of activation functions in each layer. By quantizing neural networks, we mean quantizing these two sets of parameters.

As we saw in the previous section, to quantize each set of parameters, we need to know the range of values each set holds and then quantize each number within that range to a representable value in our representation system. While finding the range of weights is straight-forward, calculating the range of activations can be challenging. As we will see in the following sections, each quantization approach deals with this challenge in its own way.

Most of the quantization techniques are applied to inference but not training. The reason is that in each backpropagation step of the training phase, parameters are updated with changes that are too small to be tracked by a low-precision data-type. Therefore, we train a neural network with high-precision numbers and then quantize the weight values.

Types of Neural Network Quantization

There are two common approaches to neural network quantization: 1) post-training quantization, and 2) quantization-aware training. We will next explain each method in more detail and discuss the advantages and disadvantages of each technique.

Post-training Quantization

The post-training quantization approach is the most commonly used form of quantization. In this approach, quantization takes place only after the model has finished training.

To perform post-training quantization, we first need to know the range of each parameter, i.e., the range of weights and activations. Finding the range of weights is straight-forward since weights remain constant after training has been finished. However, the range of activations is challenging to determine because activation values vary based on the input tensor. Thus, we need to estimate the range of activations. To do so, we provide a dataset that represents the inference data to the quantization engine (the module that performs quantization). The quantization engine calculates all the activations for each data point in the representative dataset and estimates the range of activations. After calculating the range of both parameters, the quantization engine converts all the values within those ranges to lower bit numbers.

The main advantage of using this technique is that it does not require any model training or fine-tuning. You can apply 8-bit quantization on any existing pre-trained floating-point model without using many resources. However, this approach comes at the cost of losing some accuracy because the pre-trained network was trained regardless of the fact that the parameters will be quantized to 8-bit values after training has been finished, and quantization adds some noise to the input of the model at inference time.

Quantization-Aware Training

As we explained in the previous section, in the post-processing quantization approach, training was in floating-point precision regardless of the fact that the parameters will be quantized to lower bit values. This difference of precision that originates from quantizing weights and activations enters some error to the network that propagates through the network by multiplications and additions.

In quantization-aware training, however, we attempt to artificially enter this quantization error into the model during training to make the model robust to this error. Note that similar to post-training quantization, in quantization-aware training, backpropagation is still performed on floating-point weights to capture the small changes.

In this method, extra nodes that are responsible for simulating the quantization effect will be added. These nodes quantize the weights to lower precision and convert them back to the floating-point in each forward pass and are deactivated during back propagation. This approach will add quantization noise to the model during training while performing backpropagation in floating-point format. Since these nodes quantize weights and activations during training, calculating the ranges of weights and activations is automatic during training. Therefore, there is no need to provide a representative dataset to estimate the range of parameters.

Quantization-aware training

Image source

Quantization-aware training gives less accuracy drop compared to post-training quantization and allows us to recover most of the accuracy loss introduced by quantization. Moreover, it does not require a representative dataset to estimate the range of activations. The main disadvantage of quantization-aware training is that it requires retraining of the model.

Here you can see benchmarks of various models with and without quantization.

Model Quantization with TensorFlow

So far, we have described the purpose behind quantization and reviewed different quantization approaches. In this section, we will dive deep into the TensorFlow Object Detection API and explain how to perform post-training quantization and quantization-aware training.

TensorFlow Object Detection API

The TensorFlow Object Detection API is a framework for training object detection models that offers a lot of flexibility. You can quickly train an object detector in three steps:

STEP 1: Change the format of your training dataset to tfrecord format. STEP 2: Download a pre-trained model from TensorFlow model zoo. STEP 3: Customize a config file according to your model architecture.

You can learn more about each step in the TensorFlow Object Detection API GitHub repo.

TensorFlow Object Detection API

Image source

This tool provides developers with a large number of pre-trained models that are trained on different datasets such as COCO. Therefore, you do not need to start from scratch to train a new model; you can simply retrain the pre-trained models for your specific needs.

Object Detection API offers various object detection model architectures, such as SSD and faster-RCNN. We trained an SSD Lite MobileNet V2 model using the TensorFlow Object Detection API on the Oxford Town Centre dataset to build a pedestrian detection model for the Smart Social Distancing application. We picked the SSD architecture to be able to run this application in real-time on different edge devices such as NVIDIA Jetson Nano and Coral Edge TPU. We used ssdlite_mobilenet_v2_coco.config sample config file for this purpose. You can find the available config files here.

Note that TensorFlow 1.12 or higher is required for this API, and this API does not support TensorFlow 2.

Installing TensorFlow Object Detection API with Docker

Installing Object Detection API can be time-consuming. Instead, you can use Neuralet's Docker container to get TensorFlow Object Detection API installed with minimal effort.

This Docker container will install the TensorFlow Object Detection API and its dependencies in the /models/research/object_detection directory. You can build the Docker container from source or pull the container from Docker Hub. See the instructions below to run the container.

1- Run with CPU support:

  • Build the container from source:
# 1- Clone the repository
git clone https://github.com/neuralet/neuralet
cd training/tf_object_detection_api

# 2- Build the container
docker build -f tools-tf-object-detection-api-training.Dockerfile -t "neuralet/tools-tf-object-detection-api-training" .

3- Run the container
docker run -it -v [PATH TO EXPERIMENT DIRECTORY]:/work neuralet/tools-tf-object-detection-api-training
  • Pull the container from Docker Hub:
docker run -it -v [PATH TO EXPERIMENT DIRECTORY]:/work neuralet/tools-tf-object-detection-api-training

2- Run with GPU support:

You should have the Nvidia Docker Toolkit installed to be able to run the docker container with GPU support.

  • Build the container from source:
# 1- Clone the repository
git clone https://github.com/neuralet/neuralet
cd training/tf_object_detection_api

# 2- Build the container
docker build -f tools-tf-object-detection-api-training.Dockerfile -t "neuralet/tools-tf-object-detection-api-training" .

3- Run the container
docker run -it --gpus all -v [PATH TO EXPERIMENT DIRECTORY]:/work neuralet/tools-tf-object-detection-api-training
  • Pull the container from Docker Hub:
docker run -it --gpus all -v [PATH TO EXPERIMENT DIRECTORY]:/work neuralet/tools-tf-object-detection-api-training

Exporting the Model to a Frozen Graph

After training the model, you can find the trained checkpoints, i.e., .ckpt files, placed in the model directory. To perform quantization or inference, you need to export these trained checkpoints to a protobuf file by freezing its computational graph. In general, you can use the export_inference_graph.py script to do so. However, if you are using an SSD model that you want to convert to tflite file later, you should run the export_tflite_ssd_graph.py script instead as follows:

python3 object_detection/export_tflite_ssd_graph.py \
--pipeline_config_path=$CONFIG_FILE \
--trained_checkpoint_prefix=$CHECKPOINT_PATH \
--output_directory=$OUTPUT_DIR \
--add_postprocessing_op=true

Running this script will create a .pb file in the $OUTPUT_DIR directory. We will use this file in the next steps to perform quantization.

Post-training Quantization with TFlite Converter

As described earlier, post-training quantization allows you to convert a model trained with floating-point numbers to a quantized model. You can apply post-training quantization using TFlite Converter to convert a TensorFlow model into a TensorFlow Lite model that is suitable for on-device inference.

This API provides three options to quantize a floating-point 32-bit model to lower precisions:

  1. quantize only weights to 8-bit precision
  2. quantize both weights and activations to 8-bit precision
  3. quantize only weights to floating-point 16-bit precision

We will investigate the first two approaches in this tutorial. Quantizing to floating-point 16-bit precision is beyond the scope of this article. Read this guide for more detail.

Weight Quantization of a Retrained SSD MobileNet V2

After exporting the model to a frozen graph, you can quantize the model weights by running the following python script:

[1]  import tensorflow as tf
[2]  frozen_graph_file = # path to frozen graph (.pb file)
[3]  input_arrays = ["normalized_input_image_tensor"]
[4]  output_arrays = ['TFLite_Detection_PostProcess',
[5]            'TFLite_Detection_PostProcess:1',
[6]            'TFLite_Detection_PostProcess:2',
[7]            'TFLite_Detection_PostProcess:3']
[8]  input_shapes = {"normalized_input_image_tensor" : [1, 300, 300, 3]}
[9]
[10] converter = tf.lite.TFLiteConverter.from_frozen_graph(frozen_graph_file,
[11]                                                  input_arrays=input_arrays,
[12]                                                  output_arrays=output_arrays,
[13]                                                  input_shapes=input_shapes)
[14] converter.allow_custom_ops = True
[15] converter.optimizations = [tf.lite.Optimize.DEFAULT]
[16] tflite_quant_model = converter.convert()
[17] with open(tflite_model_quant_file, "wb") as tflite_file:
[18]     tflite_file.write(tflite_model_quant)

You only need to set the path to the frozen graph file and change the input shape. You can leave the rest of the code as it is.

In line 2, you should specify the exported frozen graph (.pb) file.

In lines 3-8, the model's input/output names and the input shape are defined.

In lines 10-13, a TFlite Converter is created by specifying the model's frozen graph file, input/output names, and the input shape.

Line 14 is a critical command for quantizing custom operations in object detection models. Some operations, such as non-maximum suppression, are not supported by TensorFlow Lite and are registered as custom operations in the TensorFlow Object Detection API. By triggering the allow_custom_ops flag in line 14, you tell the TFLite Converter to find and quantize those registered custom operations. This line will raise an error in case of failure. Read more on custom operations and how to register them here.

In line 15, a list of model optimizations that the converter should perform is provided.

Finally, in lines 16-18, the model is converted to a quantized model and saved to a .tflite file.

Note that in this method, TensorFlow Lite quantizes some of the activations dynamically in inference time in addition to weight quantization to improve model latency.

Full Integer Quantization of a Retrained SSD MobileNet V2

We now explain how to quantize the full network, including weights, activations, inputs, and outputs to 8-bit numbers.

Run the following script to perform full 8-bit quantization:

[1]  import tensorflow as tf
[2]  frozen_graph_file = # path to frozen graph
[3]  input_arrays = ["normalized_input_image_tensor"]
[4]  output_arrays = ['TFLite_Detection_PostProcess',
[5]          'TFLite_Detection_PostProcess:1',
[6]          'TFLite_Detection_PostProcess:2',
[7]          'TFLite_Detection_PostProcess:3']
[8]  input_shapes = {"normalized_input_image_tensor" : [1, 300, 300, 3]}
[9]  converter = tf.lite.TFLiteConverter.from_frozen_graph(saved_model_dir,input_arrays,
[10]         output_arrays, input_shapes)
[11] converter.allow_custom_ops = True
[12] converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
[13] converter.representative_dataset = _representative_dataset_gen

[14] tflite_model_quant = converter.convert()
[15] with open(tflite_model_quant_file, "wb") as tflite_file:
[16]     tflite_file.write(tflite_model_quant)

This script is similar to the last one, except that a representative_dataset generator is introduced to the converter. As we mentioned earlier, the representative dataset allows the TFLite Converter to estimate the range of the activations. An example of this generator is as follows:

[1]  import cv2
[2]  import numpy as np
[3]  from imutils import paths

[4]  def _representative_dataset_gen():
[5]      images_path = # path to represantative dataset
[6]      if images_path is None:
[7]          raise Exception(
[8]              "Image directory is None, full integer quantization requires images directory!"
[9]          )
[10]     imagePaths = list(paths.list_images(images_path))
[11]     for p in imagePaths:
[12]         image = cv2.imread(p)
[13]         image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
[14]         image = cv2.resize(image, (300, 300))
[15]         image = image.astype("float")
[16]         image = np.expand_dims(image, axis=1)
[17]         image = image.reshape(1, 300, 300, 3)
[18]         yield [image.astype("float32")]

As you can see in this example, you should specify the path to sample images that represent the input data used in inference time. Based on our experience, a dataset of ~100 images would be enough for the TFLite Converter to reach an accurate estimate of the range of the activations.

Important Notes

  1. Those operations that are not supported by the TFLite Converter remain in floating-point after post-training quantization. If you want the converter to throw an error if an operation does not quantize, you should add the following line of code to your script: converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] However, adding this line to your code will abort the quantization of object detection models since some of the operations cannot be quantized using the current version of the TFLite Converter.

  2. If you want the converter to quantize inputs and outputs of the model, add the following lines to the code: converter.inference_input_type = tf.uint8 converter.inference_output_type = tf.uint8 Note that if you are quantizing your object detection model using the current version of the TensorFlow Lite Converter, adding these two lines of code will fail the quantization due to some compatibility issues.

Quantization-aware Training with TensorFlow Object Detection API

You can use the TensorFlow Model Optimization Tool to perform quantization-aware training for Keras-based models. You can use this tool in either of two ways: 1- specify some layers to be quantized-aware, or 2- set the whole model to be quantized-aware. You can install this tool by following the installation guide here.

However, if you are using the TensorFlow Object Detection API to train your model, you cannot use TensorFlow Model Optimization Tool for quantization-aware training. This is because the current version of the object detection API requires TensorFlow 1.x, which is not compatible with the model optimization tool. To apply quantization-aware training for object detection models that are trained using the object detection API, you need to make some config changes.

If you take a look at the sample config files of the object detection API, you will notice some files that contain the quantized keyword in their name, such as the ssd_mobilenet_v2_quantized_300x300_coco.config file here. These files are written similar to the ordinary config files except the last few lines with the graph_rewriter config that look like this:

graph_rewriter {
  quantization {
    delay: 48000
    weight_bits: 8
    activation_bits: 8
  }
}

By adding these lines to your config file, you tell TensorFlow that you want to perform quantization-aware training. The delay parameter specifies the number of iterations after which the fake nodes will be added to the computational graph. It is recommended not to add fake nodes at the beginning of the training since it may cause numerical instabilities and poor training results.

The other two parameters specify the number of bits that weights and activations will be quantized to. Only 8-bit quantization is supported by TensorFlow at this time.

You can start quantization-aware training from a quantized or non-quantized pre-trained model checkpoint. See the object detection model zoo to explore object detection models and their checkpoints. For the pedestrian detection task, we used the ssd_mobilenet_v2_quantized_300x300_coco.config file and fine-tuned the model using the ssd_mobilenet_v2_coco checkpoints from the model zoo.

After training has been finished, we can freeze the model and export the frozen graph by running the export_tflite_ssd_graph.py script.

TOCO

So far, we have trained a floating-point model by simulating the quantization effect in the training process, but we have not quantized the model yet. TensorFlow offers another tool that quantizes a model and exports it to a tflite file, called TOCO.

Based on TensorFlow documentation, to quantize your object detection model with TOCO, you need to build TensorFlow from source. This can be a daunting procedure since it is time-consuming and may lead to environment inconsistencies that fail the build after a long process. To overcome this issue, we created an easy-to-use docker container. This container takes in the frozen graph file path and some other specifications as parameters and generates the tflite model.

Model Quantization Using TOCO Docker Container

To quantize your model using Neuralet's TOCO Docker container, you can either build the container from source or pull the container from Docker Hub.

  • Build the container from source:
# 1- Clone the repository
git clone https://github.com/neuralet/neuralet
cd training/tf_object_detection_api

# 2- Build the container
docker build -f tools-toco.Dockerfile -t "neuralet/tools-toco" .

3- Run the container
docker run -v [PATH_TO_FROZEN_GRAPH_DIRECTORY]:/model_dir neuralet/tools-toco --graph_def_file=[frozen graph file]
  • Pull the container from Docker Hub:
docker run -v [PATH_TO_FROZEN_GRAPH_DIRECTORY]:/model_dir neuralet/tools-toco --graph_def_file=[frozen graph file]

After running the container, you can find the quantized object detection model named detect.tflite in FROZEN_GRAPH_DIRECTORY folder.

You can also customize other parameters when running the docker container. For example, you can override the default input shape and inference type by giving --input_shapes=[DEFAULT:1,300,300,3] and --inference_type=[DEFAULT:QUANTIZED_UINT8] values.

Quantization Example: Coral Edge TPU

In this section, we deploy an object detection model on a Coral Edge TPU device to illustrate one of the applications of model quantization.

Edge TPU only supports 8-bit weights and activations; thus, we first need to quantize our model to 8-bit precision to be able to work with the device. We have described three strategies to quantizing an SSDlite MobileNet V2 model:

  1. post-training quantization of weights
  2. post-training quantization of weights and activations
  3. quantization-aware training and quantization of weights and activations

Since Edge TPU requires 8-bit quantized parameters, the first strategy does not apply to these devices because activations remain in floating-point following this approach. We will now explain model quantization using the next two methods on an Edge TPU device.

model quantization on edge tpu

Image source

To deploy a model on an Edge TPU device, you need to compile the quantized tflite model into a file that is compatible with the Edge TPU using the Edge TPU Compiler. Running the Edge TPU Compiler creates a log file for the compilation process. The post-training quantization log file looks like this:

Operator                       Count      Status

CUSTOM                         1          Operation is working on an unsupported data type
ADD                            10         Mapped to Edge TPU
CONCATENATION                  2          Mapped to Edge TPU
QUANTIZE                       11         Mapped to Edge TPU
CONV_2D                        55         Mapped to Edge TPU
DEPTHWISE_CONV_2D              33         Mapped to Edge TPU
DEQUANTIZE                     2          Operation is working on an unsupported data type
RESHAPE                        13         Mapped to Edge TPU
LOGISTIC                       1          Mapped to Edge TPU

For each operator, this log displays the operator name, the number of operators in the model, and the operator status, which indicates whether that operator is mapped to edge TPU or it will run on the CPU.

The log file for quantization-aware training is as follows:

Operator                       Count      Status

CUSTOM                         1          Operation is working on an unsupported data type
ADD                            10         Mapped to Edge TPU
CONCATENATION                  2          Mapped to Edge TPU
CONV_2D                        55         Mapped to Edge TPU
DEPTHWISE_CONV_2D              33         Mapped to Edge TPU
RESHAPE                        13         Mapped to Edge TPU
LOGISTIC                       1          Mapped to Edge TPU

As you can see, the two DEQUANTIZE modules present in the post-training quantization log do not exist here.

Now we feed the Oxford Town Centre dataset to the compiled models and compute the latency and frame rate on the Coral Dev Board. The results are as follows:

Quantization Approach Inference Time FPS
post-training quantization 6.6 152
quantization-aware training 6.1 164

Visit Neuralet's GitHub repository for more examples of Edge TPU inferencing.

Conclusion

Quantization allows us to convert object detection models trained in floating-point numbers to lightweight models with lower-bit precisions. Quantized models accelerate calculations and consume less memory; therefore, they are ideal for edge computing applications.

In the Smart Social Distancing application, we have applied quantization to our pedestrian detection model to increase the speed of model inference and be able to run our application on different edge devices in real-time.

Visit Neuralet's GitHub repository for more projects. You can also reach us by email at [email protected]

Further Readings

  1. https://heartbeat.fritz.ai/8-bit-quantization-and-tensorflow-lite-speeding-up-mobile-inference-with-low-precision-a882dfcafbbd
  2. https://github.com/tensorflow/models/blob/master/research/object_detection/README.md
  3. https://blog.tensorflow.org/2020/04/quantization-aware-training-with-tensorflow-model-optimization-toolkit.html
  4. https://blog.tensorflow.org/2019/06/tensorflow-integer-quantization.html
  5. https://arxiv.org/abs/1712.05877