Figure 1. Pose estimation output on NVIDIA Jetson TX2 using OpenPifPaf.
Pose estimation is a computer vision technique that detects body pose, i.e., the human body’s spatial configuration, in videos or images. Pose estimation algorithms estimate body pose using key points that indicate key body joints, such as elbows, knees, and ankles.
In this post, we will walk through the steps to run pose estimation on NVIDIA Jetson platforms. Jetson devices are small, low-power AI accelerators that can run machine learning algorithms in real-time. However, deploying complex deep learning models on such devices with limited memory is challenging. We need to use inference optimization tools, such as TensorRT, to be able to run deep learning models on these platforms, on the edge.
In this work, we generated a TensorRT inference engine from a PyTorch pose estimation model to run pose estimation on Jetson platforms in real-time. Our model can work well on real-world CCTV data compared to the existing models.
Pose estimation on Jetson devices; where to start?
To run pose estimation, we searched for and deployed different pre-trained pose estimation models on Jetson devices. Several open-source models were available for pose estimation to experiment with. Let us explain more about a few of them:
TensorRT pose estimation
Since the models are accelerated with TensorRT, it was straightforward to deploy them on Jetson devices. We tested both pre-trained models on different sets of data on Jetson Nano, and the densenet121_baseline_att_256x256_B model achieved the best performance with a frame rate of 9 FPS.
As you can see in the example images below, the model worked well on images where people were standing close to the camera (Figure 2). However, it failed to generalize well to real-world CCTV camera images, where people occupy only a small portion of the image and partially occlude each other (Figure 3).
Since we wanted to run inference on real-world CCTV camera images and there was no TensorRT model available that could work properly with CCTV data, we had to create one from scratch. So, we moved to the next approach.
OpenPifPaf is the official implementation of a paper titled “PifPaf: Composite Fields for Human Pose Estimation” by researchers at the EPFL VITA lab. According to the paper, it “uses a Part Intensity Field (PIF) to localize body parts and a Part Association Field (PAF) to associate body parts with each other to form full human poses.” Here is a sample image of how OpenPifPaf works:
Since OpenPifPaf is optimized for crowded street scenes, it works well on CCTV frames (Figure 1) as well as images captured from a close distance (Figure 5). Therefore, we continued working with OpenPifPaf to run pose estimation on real-world CCTV data.
Deploying OpenPifPaf pose estimator on Jetson platforms
To begin working with the OpenPifPaf model, we first created a Dockerfile to install OpenPifPaf and PyTorch on our Jetson device. The Dockerfile is available for use here. Support for Jetson Nano is coming soon.
Next, we needed to export an ONNX model and generate a TensorRT inference engine from the OpenPifPaf model.
Technical note: input size matters
The default network input size to export an ONNX model from the OpenPifPaf model is set to (97, 129) in the source code version 0.10.0. However, this input size was too small for our use case. Therefore, we changed the input size for the ONNX model. We exported two ONNX models with different input sizes, the smaller one with (193, 257) input size, and the larger one with (321, 481) input size.
The network with the smaller input size worked well on images containing larger objects but failed to generalize to CCTV-like data. This model consumed less inference time and was faster than the other model.
However, the other network worked great on images containing smaller objects, like CCTV images with small faces. It also did well on images with larger objects; but, this model was slower than the previous model with an inference speed of 1.5 FPS on the same device.
We continued working with both models to implement each one for its specific use case.
Generating a TensorRT engine and running inference
The next step was to generate a TensorRT-based inference engine from the ONNX model. We used the 6.0-full-dims tag from this repo to generate the TensorRT inference engine with Jetpack 4.3 (TensorRT 6.0.1) installed on the Jetson device.
After building the TensorRT inference engine, we prepared the inference code. The inference code consists of a pre-processing module, an inference module, and a post-processing module. We pre-processed the data by applying normalization, ran inference using this inference module, and customized the OpenPifPaf post-processing module to decode the model output. The result is demonstrated in Figure 5.
As you can see, this model works great with images captured from a close distance to the subject as well as real-time CCTV images.
This is the first time that an OpenPifPaf model, which is a complex and heavy model, is deployed on Jetson devices.
Deploying complex deep learning models on edge devices with limited memory is challenging. In this post, we explained the steps to run pose estimation on Jetson platforms by building a TensorRT inference engine from an OpenPifPaf PyTorch model. In future works, we aim to optimize the pose estimation model to achieve higher inference speed to run pose estimation on input videos in real-time.