In Chapter 9, we described a methodology for learning an embedding of the pose of the human body using only training data in the form of synchronized videos from multiple uncalibrated cameras. The embedding is viewpoint-invariant and can be used to decode poses of the human body in both an ego-centric and camera-centric representation with a small metric error on the 3D reconstruction.
The synchronized videos, however, were taken in an indoor environment and with fixed cameras, while the proposed methodology does not require such limiting assumptions. A future direction of research we plan to pursue is to quantitatively verify the validity of our approach when training is performed on more complex, in-the-wildvideo collections.
To this end, we designed a stereo ego-centric video capturing rig, shown in Figure 11.1, composed of two Go-Pro Hero Session cameras with a120◦horizontal FOV which can record videos at 1080p and 60fps. The cameras are positioned in a stereo setup with a baseline in the range of[60,120]mm to provide an approximate depth range of about [.5,5]m, sufficient for capturing most of the relevant interactions between humans which we wish to observe.
In Figure 11.2, we show the data collection setup: two or more people wearing the video recording rigs will be moving while filming one or multiple actors performing a task, and possibly also interacting with each other in some way.
This will provide a multi-view setup between the pairs of cameras on different rigs, in which the camera locations and their relative orientation change in time, from a variety of viewpoints, dynamic backgrounds and illumination conditions.
Further, the change in pose of each rig could be tracked using additional sensors, such as Inertial Measurement Units (IMUs). At the same time, the use of a calibrated stereo setup between the two cameras on each individual rig will allow collecting the ground-truth 3D position of the parts of the human body1, and can be used to estimate the performance of the 3D pose estimation algorithm.
1The 2D locations of body parts in each monocular image can be annotated by humans or estimated through a state-of-the-art algorithm.
Chapter 11. Future Steps Towards Social Robots
Figure 11.1: Stereo and ego-centric video capturing rig.The rig designed for capturing stereo ego-centric videos of human actions and social interactions.2
Figure 11.2: Video capture setup for collecting a dataset of ego-centric videos of ac- tions and social interactions. The setup allows recording multiple human subjects, per- forming an action or interacting, with two or more recording rigs in relative motion to obtain valuable multi-view signal, while the stereo pair on each recording rig provides the 3D ground-truth.
We believe that verifying the robustness and generalization ability of the proposed learning algorithm on such a novel and trulyin-the-wildvideo collection, could be very impactful. In fact, using a portable rig with the proposed design, one can easily obtain multi-view and stereo training videos of human actions and interactions from practically any location: on the top of mountains or under water, or in very trafficked environments, such as shopping malls, stadiums and parks.
Another research direction we are planning to pursue is studying in depth how the learned embedding can be modified to encode complex and highly articulated motions of the human body beyond its static 3D poses. In fact, similarly to how humans transition through different poses while moving, we would like to understand if and how the dynamic state of the human body can be represented by following trajectories in the learned embedding space.
Chapter 11. Future Steps Towards Social Robots
Figure 11.3: Dynamic model of the leg of the robot Cassie. Figure taken from [98], showing (a) the leg joints and (b) the leg model of the robot Cassie [99].
In Chapter 7 we showed that linear interpolations between vectors in the learned embedding produced realistic lookingmovemes. We would like to verify if such observationi)applies also to the embedding learned with the algorithm introduced in Chapter 9, which is trained using data which is easier to collect and gives better performance,ii)holds true for more complex non-linear trajectories andiii)can be extended to model longer actions which are composed of sequences ofmovemes.
In the work presented in Chapter 7 and Chapter 9, a point in the embedding space encoded a single image representing a human pose. We would like to investigate ways to encode multiple frames from consecutive timestamps into a single vector of the embedding, and analyze whether this results in an improved embedding, not only in terms of single-frame 3D pose estimation, but also for predicting human dynamics.
Finally, we would like to explore the framework in which our work can be applied for controlling the visuomotor skills of under-actuated dynamic bipedal robots that can walk and run in similar fashion to humans or animals, such as Cassie [99], Figure 11.3.
Recent work [82–84, 100] has shown that image features, extracted with convolu- tional neural networks, paired with self regression or reinforcement learning can be used to control manipulator robots and have them perform simple movements or ob- ject relocation tasks.
173
Chapter 11. Future Steps Towards Social Robots
Viewpoint Invariant Human Pose Manifold
C(h(It1i))2RA P(h(It0i))2R3⇥J
ti
h(·)
Pose Decoder
Dynamics Decoder P(h(It0i))2R3⇥J
C(h(It1i))2RA
P(h(It0i))2R3⇥J C(h(It1i))2RA
: # joints : # actuators
Figure 11.4: Playing “Simon Says” in the viewpoint-invariant embedding. First, we learn a manifold of human poses in such a way that the videos of a person performing a movement, and those of Cassie being manually controlled to imitate that movement, follow similar trajectories in the embedding space, regardless of their viewpoint. Secondly, we learn a “dynamics decoder” that can map the learned embedding vectors to Cassie’s actuators’
control signals. The goal is to study the extent to which the dynamics decoder can be trained with few examples and generalize to novel viewpoints and movements, as is the case for the pose decoder in the task of 3D pose estimation.
We would like to study how the methodology introduced in Chapter 9 can be expanded and applied to the highly complex case of controlling an underactuated robot imitating human body motions, like walking or squatting. This presents several new challenges, such as describing the dynamic model of the robot and dealing with instabilities due to the unknown surface contact points. While introducing complex feedback control algorithms for such types of robots is the main focus of state-of- the-art robotics research [98], we would like to investigate whether expanding the state variable with learned features, extracted from images of the robot moving successfully, can improve such control algorithms. This would add information to the state representation about the image appearance that is relevant to the movement that is being controlled, with the hope of making it more stable, secure, and efficient.
In Figure 11.4 we exemplify one possible direction of a further investigation. Sim- ilarly to how the “pose decoder” introduced in Chapter 9 is able to produce a 3D pose of the human body from the viewpoint invariant embedding space, we would like to learn a “dynamics decoder” that can output a control signal for Cassie’s actu- ators, resulting in Cassie moving accordingly to its visual input in a way that could enable it to play the game of “Simon Says”.
The embedding of Chapter 9 has the advantage of being able to retain an ego- centric representation of the human body. This could allow learning more easily a unique control policy invariant to the viewpoint from which the robot is viewing the
175