Human Pose Estimation: Ultimate Guide [2023 edition]
Wanna learn more about Human Pose Estimation? This guide is just what you need.
Introduction to Human Pose Estimation
Have you ever used an AI-powered mobile app to correct your posture while working out at the gym?
If so, you have interacted with an application of human pose estimation.
Human pose estimation (HPE) is a computer vision task that predicts an individual’s joint position. When fed with an image of a person, human pose estimation models can gauge the subject’s body position by identifying key points on their joints.
These algorithms have widespread applications in fields like security, e-commerce, autonomous driving, and AI fitness.
In surveillance videos, for instance, human pose estimation models can be used to indicate whether an individual is stealing or inciting physical violence based on their body position.
How Does Human Pose Estimation Work?
To correctly estimate a person’s body position, human pose estimation models must identify the human in the image. They also need to detect points of interest in the person’s joints, including their knees, ankles, arms, and shoulders.
There are two approaches to achieving this:
1. Top-Down Approach: Here, the model will first identify the person in the image before detecting the key points on their body. In this case, the person detector is created first, followed by the pose calculation.
2. Bottom-Up Approach: In this technique, the model first detects points of interest in the image. Then, these coordinates are consolidated to group together body parts that belong to each individual.
After identifying the person and detecting key points on their body, the human pose estimation model uses these coordinates to form a 2D or 3D representation of their position.
2D vs 3D Human Pose Estimation
2D Human Pose Estimation
2D human pose estimation detects body parts in images or videos and estimates the position of those parts in a 2-dimensional space. The joint position is represented using X and Y coordinates, and this approach works well for detecting basic positions like sitting, standing, and walking.
The biggest advantage of 2D pose estimation is that the technique is easy to implement and computationally efficient. It also effectively captures changes in lighting and background color, making this approach an ideal choice for real-time pose estimation applications.
One drawback of 2D pose estimation, however, is that it can only represent the human body in a 2-dimensional space. These algorithms cannot capture a person’s complete range of motion, so more complex body positions will not be detected correctly.
This disadvantage can be mitigated using 3D human pose estimation models.
3D Human Pose Estimation
3D pose estimation detects key points on the human body and represents them in a 3-dimensional space, where the joint position is represented on the X, Y, and Z axes.
This technique is used for applications such as detecting a pedestrian’s body language in autonomous vehicles, recognizing hand gestures, limb rehabilitation training, and sports guidance.
The biggest advantage of 3D pose estimation is that it captures a large range of motion and can model more complex body positions. However, this approach is time-consuming and requires more processing power than 2D pose estimation.
Types of Human Pose Estimation Models
There are three main methods for human pose estimation:
Skeleton-Based Model
Skeleton-Based Human Pose Estimation Model
(source: https://arxiv.org/pdf/2006.01423.pdf)
A skeleton-based pose estimation model identifies a person’s skeleton from an image or video. This model classifies body parts on the detected skeleton based on a prior model of the human skeleton and defines a set of joints and limb orientations on the person’s body.
Contour-Based Model
Contour-Based Human Pose Estimation Model
(source: https://arxiv.org/pdf/2006.01423.pdf)
Contour-based models represent the human body as a set of contours and can capture the connection of body parts, which is impossible with skeleton-based models. These models are used for 2D pose estimation.
Volume-Based Model
Volume-Based Human Pose Estimation Model
(source: https://arxiv.org/pdf/2006.01423.pdf)
A volume-based model is more advanced than the previous methods and illustrates the human body as a 3D volume. It creates a realistic representation of body poses and consists of geometric shapes such as cylinders and conics.
Classical vs. Deep-Learning-Based Approaches
Human pose estimation is a computer vision problem and can be solved using both traditional and deep-learning-based approaches. While deep-learning techniques like CNNs have pushed the limits of what was possible in the image processing field, both approaches come with their own sets of advantages and drawbacks.
In this section, we will explore the difference between classical and deep-learning-based approaches for human pose estimation and which one you should use.
Classical machine learning methods for human pose estimation
Traditional human pose estimation techniques involve the use of classical machine learning algorithms and typically follow the workflow depicted below:
Classical Machine Learning Workflow For Human Pose Estimation
These algorithms need to be trained on fine-tuned datasets, which means that hand-crafted features must be extracted before building the machine-learning model. This makes it more complicated and time-consuming for model builders since traditional algorithms cannot uncover hidden representations in the dataset. Furthermore, classical machine learning models often lack accuracy and are unable to generalize well for computer vision tasks since they cannot emulate the way a human brain works to process images.
However, one advantage of classical machine learning methods is that they take up less computational power and runtime than deep learning techniques.
An example of applying traditional machine learning techniques for human pose estimation can be seen in a research paper published by Gregory Rogez et al. The authors of this paper formulated pose estimation as a classification problem and created an algorithm based on random forests to predict body positions in video sequences.
To improve the model’s accuracy, the authors performed image alignment to make the feature selection process easier. They also had to define classes to detect poses using a classifier.
Image alignment and class definition were both time-consuming tasks, and one of the biggest challenges faced by the authors when defining classes was the decision as to where a class ended and where the next one started. The classification process can become extremely cumbersome since it is up to the modeler’s judgment to decide which features best describe different objects.
The authors then performed variable selection and built their feature vector using HOG descriptors. HOG, or Histogram of Gradients, is a descriptor that extracts features from image data.
Finally, they built a random forest-type classifier to detect the pose at each frame. The model was computationally efficient and only took two hours to run, which is much faster than deep learning approaches.
Deep learning methods for human pose estimation
Deep learning algorithms are based on artificial neural networks and are designed to simulate the behavior of the human brain. They can ingest an input image, designate weightage based on different objects in the image, and distinguish them from each other. These models can also uncover patterns in datasets with little to no data pre-processing and feature selection.
Deep learning models often outperform traditional machine learning techniques for computer vision tasks like human pose estimation since they can extract hidden representations from body poses with high precision and accuracy.
Here is a diagram depicting the workflow of deep learning architectures:
Deep Learning Workflow For Human Pose Estimation
Mask-RCNN, a state-of-the-art model, for instance, segmentation, was released in 2017 by researchers at Facebook. This model can help perform various computer vision tasks, including instance segmentation, bounding box detection, and key point detection.
For human pose estimation, Mask-RCNN follows a top-down approach. It first performs person detection before identifying key points on the human body. The creators of this algorithm trained it on a popular image dataset called Coco. This model ended up performing better at key point detection than the Coco 2016 winner while also being simpler and faster.
Open-Source Frameworks for Human Pose Estimation
Training a computer vision model from scratch can be tedious and time-consuming. Luckily, there are many open-source systems available that you can use to implement a human pose estimation model. Most open-source frameworks have been built using neural networks since these architectures tend to outperform traditional machine-learning models for computer vision.
Here are some deep-learning frameworks that can help you get started with building your own pose estimator:
AlphaPose
AlphaPose is a real-time multi-person pose estimation system. Multi-person pose estimation is the task of detecting the body position of multiple individuals in a single frame.
The framework was created to identify poses accurately, even in the presence of inaccurate bounding boxes. To achieve this, the creators of AlphaPose applied the SPPE Stacked Hourglass model, which is a convolutional network architecture for pose estimation.
However, there were two issues that they found when doing this:
1. Bounding box localization errors
Example of Bounding Box Localization Errors
Source: https://arxiv.org/pdf/1612.00137.pdf
The image above displays a bounding box localization error. The red bounding boxes represent the ground truth, while the yellow boxes are the detected ones.
The heat maps display the outputs of the SPPE model. Notice how body parts are incorrectly detected by the SPPE model with the yellow bounding box.
2. Redundant human detections
Example of Redundant Human Detections
Source: https://arxiv.org/pdf/1612.00137.pdf
The SPPE model also led to the problem of redundant human detections. There are multiple bounding boxes on the image on the left, and the model detects independent outputs for each of them.
This gives rise to many pose estimates for a single person.
To address the localization error, the creators of AlphaPose attached the SPPE with a new symmetric spatial transformer network (SSTN) to extract only a high-quality, appropriate region from the bounding box.
They also used a Parametric Pose Non-Maximal Suppression (NMS) to solve the redundant human detection issue. NMS selects the most confident pose as a reference and eliminates poses close to it using an elimination criterion.
OpenPose
Similarly to AlphaPose, OpenPose can also detect the body positions of multiple people in real time. This is the first library to have demonstrated the ability to detect over a hundred key points on single images jointly.
According to the creators of OpenPose, it is a challenge to detect the poses of multiple people in images since interactions like human contact can make it difficult to identify separate body parts. Furthermore, this is a computationally expensive task that grows in complexity as the number of people in the image increases, making it difficult to generate real-time estimations.
OpenPose solves the above problems using the following steps:
1. The entire image is fed into a convolutional neural network (CNN) to make joint predictions.
2. A confidence map is then created to identify human body parts in the image.
3. PAFs (Part Affinity Fields) are generated to encode the pairwise relationship between body parts.
4. Body parts that belong to the same person are then associated using bipartite matching.
5. Each part is assembled into a complete pose for every human body in the image.
OpenPose’s Steps for Human Pose Estimation
(Source: https://arxiv.org/pdf/1812.08008.pdf)
DeepCut
DeepCut is another solution that estimates the body position of multiple people in the real world. According to the creators of DeepCut, many existing human pose estimators exclude scenes with multiple people and cannot perform well on samples of real-world images.
Their proposed solution to the above problem includes the following steps:
1. DeepCut follows the bottom-up approach and first detects all body parts in the image.
2. Body parts belonging to each person are jointly clustered, and every part class is colored differently.
3. The predicted pose sticks are displayed for every person in the frame.
DeepPose
DeepPose is a framework that uses deep neural networks (DNNs) to estimate human body poses. This system was created by research scientists at Google, Alexander Toshev and Christian Szegedy.
According to Toshev and Szegedy, most existing pose estimation solutions aim to efficiently search in large image frames for all possible articulated poses. However, this efficiency comes with the drawback of not capturing all interactions between body parts.
Example: Drawbacks of Existing Human Pose Estimation Models
(Source: https://arxiv.org/abs/1312.4659)
In the above image, for instance, many joints are barely visible. Half the body of the person standing on the right is not visible at all, and the only way to estimate his pose is by looking at the rest of the picture and anticipating his range of motion.
Holistic reasoning is required to predict the complete picture and capture all interactions in the image.
The creators of DeepPose used DNNs to achieve this type of reasoning since these algorithms have shown outstanding performance on visual classification tasks and object localization in the past.
How to Evaluate Human Pose Estimation Models?
Evaluation metrics are part of the artificial intelligence pipeline. They are used to assess the performance of machine learning models and ensure that progress is being made. Popular metrics for classification tasks include accuracy, precision, and recall. Regression algorithms are evaluated using measures like the Root Mean Squared Error and R-Squared.
In this section, we will explore a few evaluation metrics used to assess the performance of human pose estimation models.
Percentage of Correct Parts (PCP)
As the name suggests, PCP is a metric that tells us if we have defined a limb correctly. A limb is a correct part only if the distance between the predicted key points and true key points is less than half a limb's length.
One drawback of this metric, however, is that human bodies have different limb lengths. Shorter limbs are penalized more than longer ones which can make this an unreliable measure with results that vary dramatically across datasets.
Percentage of Correct Key-points (PCK)
To overcome the drawback of punishing shorter limbs in PCP, PCK can be used. This metric will consider a key point as correct only if the distance between the true and predicted points is within a certain threshold.
For instance, with PCK, the threshold can be 0.2. This means that the distance between true and predicted points should be less than 0.2 times the person’s head bone link. This will vary by person and is a more reliable method of identifying correct points since people with smaller head bone links will also have smaller limbs and vice versa.
Percentage of Detected Joints (PDJ)
PDJ is another metric that mitigates the short-limb problem. This metric considers a key point correct if the distance between the true and predicted joint is within a fraction of their torso’s diameter.
Again, this is a more reasonable technique than PCP since the fraction will vary across different people.
Object Keypoint Similarity (OKS) based mAP
OKS is a similar metric to IoU in object detection. It calculates the distance between the true and predicted joints. This value is then normalized by the scale of the person.
This measure is calculated for each person in the frame by computing the sum over all the samples’ key points.
Applications of Human Pose Estimation
Human pose estimation is a widely used computer vision technique with various real-world applications that we'll explore below.
Fitness Apps
Human Pose Estimation for AI-Powered Fitness Apps
Have you ever used an AI fitness coaching app like Kaia or Ally App?
These applications allow you to film yourself when working out while a human pose estimator detects your body position and corrects your form.
With Kaia, for instance, you can start by simply putting your smartphone up against a wall and exercising around seven feet away from the screen. As you continue your workout, the app allows you to see your form and compare it with the correct pose so you can adjust accordingly.
As you continue exercising, a virtual trainer will correct your form using audio and visual feedback.
This application of human pose estimation allows users to start working out from the comfort of their homes. Fitness enthusiasts can save time and money on a personal trainer by actively learning to maintain proper form using the app.
Surveillance
Surveillance is another field that has benefitted extensively from human pose estimation.
In crowded areas, the automatic recognition of violent behavior and theft can immediately alert authorities and potentially save lives. Also, if a person meets with an accident or suffers an injury and there is no help nearby, human pose estimation models built into security cameras can identify them and send human assistance.
However, there are a few limitations of pose estimators in surveillance, which is why this application is not widespread.
Firstly, in a real-world setting, objects like power cables, moving trams, buildings, and even other people can block different parts of the image, obscuring the subject of interest. Also, interfering with human shapes can lead to false alarms, wasting resources and time. Finally, cameras are installed to deliver a wide view of the area, which can emphasize occlusions and heavily distort images.
Autonomous Vehicles
There are many studies about using object detection and image classification techniques in autonomous vehicles. While these methods have helped the automotive industry make strides in building self-driving cars, human pose estimation is also crucial to help vehicles navigate better and avoid accidents.
For instance, picture yourself driving when you see a pedestrian at the side of the street. It simply is not sufficient to detect the person; you also need to gauge what they will do next. Will they continue walking down the sidewalk, or will they attempt to cross the road?
These are decisions based on the human judgment that we make every day when driving, even when we aren’t consciously thinking about it.
In autonomous vehicles, human pose estimation models can be used to emulate human decision-making while driving. These algorithms can detect a pedestrian’s body language, position, and movement before deciding what to do next.
They can determine a person’s predictable future action instead of only identifying a specific object, making human pose estimators a powerful step forward in the autonomous vehicle industry.
Human Pose Estimation: A Summary
Human pose estimation is a technique used to identify the position of a human body in an image or video frame.
There are traditional and deep-learning approaches that can be used to accomplish this task, both of which come with a set of advantages and disadvantages. However, neural networks are by far the most popular architecture used to build human pose estimators since they require little to no data preprocessing and feature extraction.
Human pose estimation can be broadly categorized into 2D and 3D estimations. There are three methods to detect human body positions, which include skeleton-based, contour-based, and volume-based models.
This technique has a variety of applications in a wide range of industries, including AI-powered fitness applications, security cameras, and autonomous vehicles.