Few-shot Domain Adaptation for 3D Human Pose and Shape Estimation

Training on larger datasets would be one solution, but it is costly to obtain a sufficient amount of annotations. They trained the model to generalize to different camera parameters so that the model could adapt to the new camera perspective of the test data. One of the popular approaches is to use a parametric model called SMPL [9] to regress the human shape.

If we provide correct shape(β) and pose(θ) parameters, regressor of SMPL produces a human mesh. One of the limitations of this approach is the inability to properly regress the SMPL parameters. VIBE [13] is one of the latest works based on monocular and temporal estimation method using parametric model.

19] proposed a non-temporal mesh regression model called METRO, which consists of Transformers architecture, which is one of the best performing models on MPI-INF-3DHP. 22] adopted attention mechanism by applying learnable weighting to each intermediate feature of the neural network architecture. Their method seems heuristic and lacks ablation study on selection of the winding core.

The model then deforms the shape against the input shape while preserving the labels on the template point.

Problem Overview

Log-ratio Loss Review

In addition, a NN-sized distance matrix must be constructed to calculate the losses, which causes big data scalability problems. If the log ratio loss is directly applied to VIBE, a 59K × 59K distance matrix must be constructed where the number of elements exceeds 3481M.

In-Batch Triplet sampling

Metric Loss

Our loss is numerically more stable, while preserving the idea of continuous distance mapping between the feature space and the label space. We also sample several anchors from the mini-batch, so that we can further generalize the loss calculation. One of the major disadvantages is computationally intensive compared to the conventional loss design, as previous works tend to consider only one armature.

Metric Loss for Few-Shot Domain Adaptation

Segmentation Module

However, the most challenging issue is the lack of pixel-wise segmentation annotation on existing 3D human pose sets. Since there is no ground truth annotation, we generate pseudo ground truths on 3D human pose dataset. After the annotation is prepared, we train the segmentation module using IoU scores between pseudo-ground truth and the predicted map.

We generated pseudo-ground-truth using ensemble of Mask-RCNN [56], Cascade Mask-RCNN [57] and 2D keypoint annotation from MPI-INF-3DHP dataset. First, we make a 2D patches of a skeleton mask from the 2D keypoint annotation, which will serve as a basic segmentation mask. Then we derive separate segmentation masks from each instance's segmentation models and perform join operation for each pixel so that we will only have the overlapping regions as it would be more robust and reliable to be used as a ground truth.

After the concatenation operation, we finally get the union of the inferred segmentation maps and 2D keypoint masks as a final pseudo-truth. If we denote 0 as a pseudo-truth segmentation mask of the input imagex, then our general ground truth generation process can be defined as Equation 3. To merge the predicted segmentation maps and human pose features, the segmentation maps must be further processed to have the same dimension of the human pose feature.

Human pose feature fp∈ Travel a feature vector with dimension, so we need to construct a dimensionality reduction functionFd: fs7→Re, where fs∈Rw×h. After dimension reduction, we fuse those features with elementwise multiplication operation. After the fusion, f is fed into the pose generator and subsequent networks, which produce predicted human shape in the final stage.

One way is to simultaneously train the segmentation and the pose estimator, and the other way is to pre-train each network and fine-tune the feature fusion network without further training the segmentation module. Since our position estimator is based on a GAN architecture, joint training may cause mode collapse, which should be avoided if possible. Therefore, we pre-train the segmentation module and finalize the weights, and then fine-tune the feature fusion network by co-training with the pose estimation network.

Dataset

Implementation Details

Evaluation Results: Metric Loss

If set to 3, the previous and next 3 frames are chosen. This is a weighting factor for metric learning loss. Note that the performance stated in the VIBE document and our evaluation result on the baseline VIBE do not match. Although MPJPE and PVE are sacrificed, our method achieves better performance in terms of acceleration errors and PA-MPJPE metrics.

Since the ground-truth SMPL parameters are not provided in the MPI-INF-3DHP test set, we do not include the PVE metric. We report both the values from the VIBE paper and our estimation results, due to the reproducibility problem. From the evaluation results on the test data set 3DPW including nearby frames was second to VIBE in terms of MPJPE and PVE, while PA-MPJPE and acceleration error did not outperform the 'include near' sampling strategy '.

To summarize, our method sacrifices MPJPE and achieves better temporal consistency as the acceleration error is improved. This shows that our loss metric effectively regulates the temporal pose generator to produce temporally consistent results while discriminating between similar and dissimilar poses in the feature space.

Evaluation Results: Segmentation Module

The model trained with option 1 produces irregular outline of the human body, since background estimation was given the highest priority. On the other hand, the model trained with option 2 produced a smoother and more continuous foreground area. Note that we did not train the segmentation module on the 3DPW dataset and it still performed reliable inference results.

Although it generalizes well to wild outdoor images, it produces spurious results in the background, although we performed noise filtering with the threshold value. As illustrated in the middle image of Figure 6 , if the person interacts with an object (e.g., clothes), the segmentation module tends to pay attention to the object. Although it would be useful when we consider person-object interaction, its negative effect would be much more significant, since the model can map the entire background if the background is very noisy or crowded.

From the results in Table 1, 2, 3, the segmentation module showed better performance in terms of PVE compared to the baseline VIBE. Although MAJPE and PA-MAJPE are degraded, our findings suggest that further advancement would be possible with some engineering effort. Unfortunately, we could not perform further experiments due to the error during data preparation.

However, our preliminary results indicate that our initiatives and approaches can make more progress in the future.

Evaluation Results: Few-Shot Domain Adaptation

Qualitative Analysis

Jain, "Multiview-consistent semi-supervised learning for 3d human pose estimation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), juni 2020. Zeng, "Fusing wearable imus with multi-view images for human poseestimation: A geometric approach," i Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), juni 2020. Yang, "Sequential 3d human pose and shape estimation from point clouds," in Proceedings of the IEEE/CVF Konference om computersyn og mønstergenkendelse (CVPR), juni 2020.

Wang, "Multi-context attention for human pose estimation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. Asari, "Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. Bai, “Progressive pose attention transfer for person image generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp.

Fu, “Tell me where to look: The directed attention inference network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. Fujiyoshi, “The attention branch network: Learning the mechanism of attention for visual explanation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. Lu, “Dual attention network for scene segmentation,” in 2019 IEEE/CVF Conference on Computer Vision and pattern recognition (CVPR), 2019, p.

Xu, "Maskflownet: Asymmetric feature matching with learnable occlusion mask," i Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), juni 2020. Triebel, "Multi-path learning for object pose estimation på tværs af domæner ," i Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), juni 2020. Hariharan, "Revisiting pose-normalization for fine-grained few-shot recognition," i Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), juni 2020.

Lim, “One-shot domeinaanpassing voor het genereren van gezichten”, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), juni 2020. Callet, “Few-shot pilherkenning,” in Proceedings of the IEEE/CVF Conferentie over Computer Vision and Pattern Recognition (CVPR), juni 2020. Achard, “Pandanet: Anchor-based single-shot multi-person 3d pose estimation”, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Juni 2020.

Perez, “xmuda: Cross-modal unsupervised domain adaptation for 3d semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. Kwak, “Deep metric learning beyond binary supervision,” in in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, p.