Related Work - HOW MUCH DOES MULTI-VIEW SELF-SUPERVISION HELP 3D POSE ESTIMATION?

HOW MUCH DOES MULTI-VIEW SELF-SUPERVISION HELP 3D POSE ESTIMATION?

9.2 Related Work

Supervised 3D Pose Estimation

The goal of 2D human pose estimation is to predict the 2D locations of a pre-defined set of keypoints on the human body in the image space. In the case of 3D pose estimation, we also want to predict the ‘depth’ of each keypoint (either with respect to the camera or in a specified 3D coordinate space). This is an inherently ill- posed problem when only given a single image as input. The existing literature for single image 3D pose estimation can be broadly divided into two main approaches:

end-to-end, and two-stage (i.e.‘lifting’).

End-to-end approaches take a 2D image as input and directly regress the 3D keypoints from the pixel data. Common representations for the predicted pose include 3D coordinates [10, 20], volumetric predictions [21, 22], coefficients for probabilistic pose models [23], and 3D meshes for full body shape and pose encoding [24, 25].

Semi-supervised end-to-end training also enables the use of additional 2D keypoint data that has no paired 3D information [26, 27]. Multi-view information has also been shown to improve semi-supervised training [28, 29].

For lifting approaches, the goal is to take a set of 2D keypoints as input and then predict the missing depth dimension. These methods do not train a full deep feature extractor from the input image but instead use compact fully connected models to regress the missing dimensions [30]. In addition to direct regression of missing keypoints other lifting approaches have explored keypoint distance matrix estimation [31], direct retrieval [12, 32], and adversarial learning [33, 34].

We explore self-supervised learning of 3D pose aware feature representations using

Chapter 9. How Much Does Multi-View Self-Supervision Help 3D Pose Estimation?

Training Data for 3D Pose Estimation

While 2D pose estimation has benefited from large supervised training datasets e.g.[35, 36], acquiring ground-truth 3D pose data is much more difficult. Standard options include capturing data in controlled settings with motion capture markers [19, 37], multi-camera setups [38, 39], or using paired depth cameras [40]. However, all these approaches require additional hardware and calibration and are difficult to deploy in outdoor settings. Training data can also be generating synthetically, resulting in ground-truth dense depth for free [41–43]. However, it is challenging to generate varied 3D poses with realistic scene appearance and interactions without significant manual intervention. As full 3D pose information for ‘in-the-wild’ image collections is very challenging to acquire. One proposed solution is to crowd-source weak 3D information in the form of estimated relative keypoint distances from the camera [44–47]. The disadvantage of this approach is that it can result in noisy labels as in many cases the relative distance of some pairs of keypoints can be hard to disambiguate.

Our goal is to learn features that encode 3D pose information without requiring any direct 3D supervision. Once trained, we show that these features can be finetuned with a relatively small amount of 3D supervision, resulting in accurate 3D pose predictions.

Self-Supervised 3D Pose Estimation

To overcome the lack of training data for 3D pose estimation, there is a growing interest in self-supervised approaches for learning features that encode 3D pose.

These methods can be categorized based on their requirements for additional data at training time or the assumptions they make about how the images are related.

Given no 2D or 3D pose information at training time, but instead a set of time synchronized multi-view images with known camera extrinsics, [16] proposed an image reconstruction based approach to learn a features for 3D pose estimation.

As they have the known transformations between the cameras during training the learned latent space encodes camera viewpoint information as well as pose. [48]

proposed a similar approach but assumed they had access to 2D pose during training and performed the reconstruction in 2D pose space, rather than in image space.

In contrast, other embedding based approaches that also assume availability of time synchronized cameras are unable to recover camera viewpoint information as

141

Chapter 9. How Much Does Multi-View Self-Supervision Help 3D Pose Estimation?

they explicitly train embeddings that are invariant to viewpoint [17, 49]. [50] still assumed access to time synchronized views from neighboring viewpoints but used a pretrained pose estimation network, along with multi-view epipolar constraints, to learn how to predict 3D pose. This overcomes the requirement of having known camera extrinsics.

Another source of self-supervision which has been exploited in the context of 2D pose estimation is temporal information. [18] trained a Siamese architecture to predict if two frames were temporally close or far, with the assumption that nearby frames in time are more similar in pose to further away ones. [13, 14] also utilized time information with the assumption that the same action performed by two different people will have a similar temporal ordering. These models learn to align pairs of video of the same action by enforcing consistency in both directions for the predicted frame-level embeddings. While they still require multiple videos of the same action, unlike the previous multi-view approaches, these videos do not have to be time synchronized or depict the same individual.

The last main category of self-supervision is having access to 2D keypoint information from non-time synchronized image collections. [15] proposed a lifting based approach that used a series of self-consistency constraints in both 2D and 3D along with an adversarial loss to encourage that projected 3D posses are valid 2D poses.

However, these lifting approaches do not explicitly train an image encoder that can be used for any arbitrary image.

One of the main challenges for many of the above embedding based approaches for self-supervised pose representation learning is the sampling of positive and negative pairs of frames during training. Positive pairs of frames are usually selected based on some criteria such as the same time-point imaged from a different viewpoint or two frames that are close in time. However, due to the repeated nature of poses over time, selecting negative pairs of frames (i.e.ideally those containing very different poses) is challenging. One solution is to filter out potential negative pairs based on their similarity in the embedding space during training [13, 17, 18].

In this work we explore design decision related to multi-view self-supervised learning for single view 3D pose estimation. We evaluate the impact of each of the decisions on the quality of the learned features and measure how much they effect the predicted 3D poses.

Chapter 9. How Much Does Multi-View Self-Supervision Help 3D Pose Estimation?

Dalam dokumen Human Perception and Pose Estimation (Halaman 155-158)