Human Perception and Pose Estimation

You taught me to play without fear of the unknown and to be comfortable without knowing all the answers. Share the laughter of my best days and the struggles of my hardest moments.

FLAVORS OF ROBOTS

Another construction discussed in the Golden Age of Automata[16], is "The Flute Player" by Innocenzo Manzetti [17, 18]. Social robots are able to perform a variety of tasks in the context of social interactions with human agents.

Figure 1.1: The evolution of human-made tools. As we evolved from Homo Sapiens to Homo Deus [14], our tools also matured: from the early days of building spears for hunting, to the Mechanical Turk, all the way to modern day robots and self-driving cars.

COMPUTER) VISION FOR SOCIAL ROBOTS

Most of the early object detection algorithms were based on hand-crafted local image feature descriptors—designed to be invariant to transformations such as scaling or rotation—that were selectively compared to parts of the image in the hope of finding a successful match. As is clear from the definition, context is an overarching concept that can be applied to all the previous work.

THESIS STATEMENT

Furthermore, it implies having a correct spatial perception of the scene and being able to evaluate the depths and the relative position of the objects in the scene. We investigate the data efficiency, accuracy, and viewpoint invariance of the learned embedding applied to the task of regressing three-dimensional human pose from monocular images.

DESCRIBING COMMON HUMAN VISUAL ACTIONS

Introduction

In the present study, we focus on actions that can be detected from single images (rather than videos). We explore the visual actions that are present in the recently collected COCO image dataset [5].

Previous Work

The work closest to our own is a dataset called TUHOI [16], which extends the ImageNet [17] annotations by providing descriptions of the actions contained in the images. Second, we collect data in the form of semantic networks, where active entities and all the objects they interact with are represented as connected nodes.

Table 4.1: State-of-the-art datasets for single-frame action recognition. We indicate with ‘ x’ quantities that are not annotated, with ‘-’ statistics that are not reported

Framework

Persons labeled in COCO images were treated as potential “objects” of actions, and AMT workers labeled all objects they interacted with and assigned corresponding visual actions. While taxonomy has been accepted as an appropriate means of organizing object categories (e.g. animal → mammal → dog → dalmatian) and shallow taxonomies are available for verbs in VerbNet [21], we are not interested in a precise categorization for therefore in our set of visual actions there are no taxonomies.

Dataset Collection

We measured the accuracy (percent agreement) between the author's answer and the combined annotations in judging the visibility flag of the 'subject' in the image. Tables 4.2 and 4.3 contain a breakdown of the visual actions and adverbs in the categories presented to the AMT workers.

Figure 4.3: Visual VerbNet (VVN). (Top-Left) List of the 140 visual actions that constitute VVN – bold ones were added after the comparison with COCO captions

Analysis

In Figure 4.8 (below) we show a complete list of visual actions labeled with images and their occurrences. The distribution of the number of visual actions follows a heavy tail distribution, to which we fit the line shown in Figure 4.9 with a slope of α∼ −3.

Table 4.2: List of visual actions in COCO-a grouped by category. The complete list of visual actions contained in Visual VerbNet

Discussion and Conclusions

We give two examples of the information that can be extracted and examined, for an object and a visual action in the dataset. Sample images returned as a result of searching our dataset for visual actions with rare emotion, attitude, position, or location combinations.

Figure 4.10: Sample images and rare annotations contained in the COCO-a dataset.

DISTANCE ESTIMATION OF AN UNKNOWN PERSON FROM A PORTRAIT

Introduction
Related Work
Caltech Multi-Distance Portraits Dataset
Problem Formulation
Method
Results
Conclusions

Is it possible to estimate the distance between the camera (or the painter's eye) and the viewer's face. Estimation of the distance of the unknown person from the portrait In the case of the regression task, we only have access to one shape Si during the testing.

Figure 5.1: Example annotations from the Caltech Multi-Distance Portraits (CMDP) dataset

BENCHMARKING AND ERROR DIAGNOSIS IN MULTI-INSTANCE POSE ESTIMATION

Introduction

Our goal is to propose a principled method for analyzing the performance of algorithms for multi-instance estimation of the human position. A taxonomy of error types characteristic of a multi-instance pose estimation framework;

Related Work

Emphasis is calculated by estimating an unnormalized Gaussian function, centered on the ground truth position of a key point, at the detection location to estimate. Bottom-up methods (part in example) first separately detect all parts of the human body from an image, then try to group them into individual instances.

Figure 6.2: Similarity measure be- be-tween a keypoint detection and its ground-truth location

Multi-Instance Pose Estimation Errors

Comparative analysis and error diagnosis of multi-instance pose estimation In Figure 6.6-(a), we can see that about 75% of the detections of both algorithms are good, and while the percentage of jitter and inversion errors is about the same [2]. Finally, in Figure 6.6-(d) we see the performance improvement in terms of the AUC of the PR curves calculated for the entire set of tests.

Figure 6.4: Taxonomy of keypoint localization errors. Keypoint localization errors, Section 6.3.1, are classified based on the position of a detection as, Jitter: in the proximity of the correct ground-truth location, but not within the human error margin

Sensitivity to Occlusion, Crowding, and Size

Another significant difference is in the amount of FP detections that have a high confidence score (in the top 20th percentile of overall scores), Figure 6.9-(b): [6] has more than double the number, mostly all with small pixel area size (< 322).

Discussion and Recommendations

Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, .

Figure 6.12: Amount of training data in each of the suggested benchmarks for the COCO dataset

A ROTATION INVARIANT LATENT FACTOR MODEL FOR MOVEME DISCOVERY FROM STATIC POSES

Introduction

A rotation-invariant latent factor model for Moveme Discovery of parts of the body (for example, tennis serve and volleyball strike), making it easier to qualitatively interpret and evaluate the learned moves. Inference from action dynamics: Changing the weights of the basic postures learned is analogous to moving along a line in the high-dimensional space of human postures (2D or 3D).

Related Work

Latent factor models are variants of matrix and tensor factorization, which can easily incorporate missing values or other types of constraints. Our method is complementary to and can be integrated with other latent factor modeling approaches.

Models

A rotation-invariant latent factor model for Moveme Discovery where aj denotes the viewing angle cluster to which example j belongs. A rotation-invariant latent factor model for Moveme Discovery (Section 7.4.5) for which we only need a fairly coarse prediction of the viewing angle (e.g. in quadrants).

Figure 7.3: Training pipeline of the proposed method LFA-3D. Bases poses U, coefficient matrix V and angles of view θ are initialized and updated through alternate stochastic gradient descent

Experiments and Empirical Results

Matrix V describes each pose in the data set as a linear combination of the learned latent factors, section 7.3.1. Each column of U corresponds to a latent factor that describes some of the movement variability in the data.

Figure 7.4: Example annotation for the orientation angle collected on the LSP dataset.

Conclusion and Future Directions

Zitnick, “Prediction of object dynamics in scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. Pose-conditioned joint limits for 3d human pose reconstruction", in Proceedings of the IEEE Conference on Computer Vision and Pattern Recogn, 2015, pg.

IT’S ALL RELATIVE: MONOCULAR 3D HUMAN POSE ESTIMATION FROM WEAKLY SUPERVISED DATA

Introduction

Unfortunately, in the case of 3D pose estimation, it is much more challenging to acquire large amounts of training data containing real-world people with their corresponding 3D poses. As the amount of training data (measured in megabytes of annotations) is reduced, the performance of a fully supervised algorithm, Figure 8.1-(b), breaks down; on the other hand, our model, Figure 8.1-(c), predicts accurate 3D poses even with very small amounts of relative training data.

Related Work

A loss for 3D position estimation of articulated objects that can be trained on sparse and easy to collect relative depth annotations with performance comparable to the state-of-the-art;. Monocular 3D human pose estimation from poor surveillance data 3D pose estimation: There exist two main categories of methods for 3D pose estimation: (1) end-to-end models and (2) lifting-based approaches.

Method

Now the scaling parameters of the network are interpreted differently and are used to predict the distance from the camera to the center of the person in 3D. The output of the non-linearity is scaled using a hyperparameter so that the network can predict an arbitrarily wide range.

Human Relative Depth Annotation Performance

We compare the performance of our algorithm with that using full 3D surveillance. a) Histogram of errors across all images in the Human3.6M test data set (lines and dots indicate cumulative sum and percentages, respectively). This can also be seen in the error distribution over key point type in Figure 8.7-(b).

Figure 8.5: Human performance analysis on the relative keypoint depth annotation task

Conclusion

Chan, "Human 3d pose estimation from monocular images with deep convolutional neural network", inACCV, 2014 (cited on p. 116). Moreno-Noguer, "3d estimation of human pose from a single image via distance matrix regression", in CVPR, 2017 (cited on p. 116).

Figure 8.9: Inference time 3D pose predictions on the LSP dataset. Fine-tuning (FT) on LSP significantly improves the quality of our predictions especially for images containing uncommon poses and viewpoints that are not found in Human3.6M, such as those v

HOW MUCH DOES MULTI-VIEW SELF-SUPERVISION HELP 3D POSE ESTIMATION?

Introduction

We are therefore interested in exploring methods to recreate 3D poses that require the least possible supervision. However, 3D pose estimation from monocular images and robot control are the main applications we have in mind, and they both benefit from a camera-aware representation of pose.

Related Work

Our goal is to learn features that encode 3D pose information without requiring any direct 3D supervision. In this work, we investigate design decisions related to multi-view self-supervised learning for single view 3D pose estimation.

Method

The goal of the pose decoder is to map vectors e from the viewpoint-invariant embedding space to the output space of 3D human poses. The input to the viewpoint decoder is tensorb∈R512×35×35 obtained from the “mixed-5d” layer of the Inception-V3 backbone [52].

Figure 9.2: Encoder-decoder pipeline for monocular 3D human pose estimation. The proposed pipeline for semi-supervised monocular 3D pose estimation is based on three components: i) a Time Contrastive Network encoder which embeds an image from a video into

Experiments

In Figure 9.3, we observe that tcn-oracle outperforms the simple baseline, regardless of the pose representation adopted. In Figure 9.6-(b) we visualize the relationship between the distance in the 3D space of the egocentric pose and the distance in the space of the learned functions.

Figure 9.3: Amount of supervision versus reconstruction error trade-off curve for Ego versus Camera-Centric pose representation

Discussion and Recommendations

Conclusion

Temporal cycle-consistency learning”, inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. Unsupervised 3d pose estimation with geometrische zelfsupervisie”, inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp.

SUMMARY OF THESIS AND CONTRIBUTIONS

When 3D ground-truth data are either missing or only available in small amounts, such understanding can be used, along with the additional constraints associated with alternative types of surveillance signals, to learn models that are able to regress the full 3D position of the human body. and predict its movements simply from monocular 2D images. This is essential to provide social robots with the ability to understand the 3D pose of the people they interact with and opens up the possibility of richer interactions in scenarios that are unrealistic to expect.

FUTURE STEPS TOWARDS SOCIAL ROBOTS

In fact, just like how people transition through different positions while moving, we would like to understand if and how the dynamic state of the human body can be represented by following trajectories in the learned embedding space. The embedding of Chapter 9 has the advantage of being able to retain an egocentric representation of the human body.

Figure 11.1: Stereo and ego-centric video capturing rig. The rig designed for capturing stereo ego-centric videos of human actions and social interactions

BIBLIOGRAPHY

Efficient object localization using convolutional networks”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. Multi-context attention for human pose estimation”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.