저작자표시-비영리-변경금지 2.0 대한민국 이용자는 아래의 조건을 따르는 경우에 한하여 자유롭게
l 이 저작물을 복제, 배포, 전송, 전시, 공연 및 방송할 수 있습니다. 다음과 같은 조건을 따라야 합니다:
l 귀하는, 이 저작물의 재이용이나 배포의 경우, 이 저작물에 적용된 이용허락조건 을 명확하게 나타내어야 합니다.
l 저작권자로부터 별도의 허가를 받으면 이러한 조건들은 적용되지 않습니다.
저작권법에 따른 이용자의 권리는 위의 내용에 의하여 영향을 받지 않습니다. 이것은 이용허락규약(Legal Code)을 이해하기 쉽게 요약한 것입니다.
Disclaimer
저작자표시. 귀하는 원저작자를 표시하여야 합니다.
비영리. 귀하는 이 저작물을 영리 목적으로 이용할 수 없습니다.
변경금지. 귀하는 이 저작물을 개작, 변형 또는 가공할 수 없습니다.
Master's Thesis
Kyuiyong Lee
Department of Electrical Engineering
Ulsan National Institute of Science and Technology
2022
Pedestrian Trajectory Prediction with Monocular
Camera
Kyuiyong Lee
Department of Electrical Engineering
Ulsan National Institute of Science and Technology
Pedestrian Trajectory Prediction with Monocular
Camera
A thesis/dissertation submitted to
Ulsan National Institute of Science and Technology in partial fulfillment of the
requirements for the degree of Master of Science
Kyuiyong Lee
Seungjoon Yang
Pedestrian Trajectory Prediction with Monocular Camera
12.13.2021 of submission Approved by
Advisor
Kyuiyong Lee
This certifies that the thesis/dissertation of Kyuiyong Lee is approved.
12.13.2021 of submission
Signature
___________________________
Advisor: Seungjoon Yang Signature
___________________________
Seungryul Baek Signature
___________________________
Jeong hwan Jeon
Pedestrian Trajectory Prediction with Monocular
Camera
Abstract
Understanding the 3-dimensional space surrounding a vehicle is necessary for autonomous driving.
Among the obstructions on road, localizing and tracking pedestrians are especially crucial to safety.
Many researchers approach the tasks with assistance from LiDAR or RADAR, which are highly expen- sive. The cost-effective approach might employ monocular RGB cameras and track pedestrians on the 2-dimensional image coordinates. In this paper, we localize pedestrians and track their trajectories in 3- dimensional world coordinates employing a monocular RGB camera and applying computer vision and deep learning techniques to 2-dimensional images. The essential task for preventing pedestrian-related accidents is predicting pedestrians’ forthcoming movements. Pedestrians’ actions and gestures give out cues to drivers about their intention to intrude or move away from the vehicle’s expected course. Interac- tions between pedestrians, or the vehicle and a pedestrian can also affect their future trajectories. In this paper, we detect and track pedestrians, and predict their future trajectories combining their gesticular information from Graph Convolutional Network (GCN) and relational information from Social-LSTM, a pedestrian trajectory prediction LSTM network in a crowded situation.
Contents
I Introduction . . . 1
II Related Works . . . 2
2.1 Pedestrian 3D Localization from Monocular Image . . . 2
2.2 Visual Odometry . . . 2
2.3 Pedestrian Trajectory Prediction . . . 2
2.4 Skeleton-Based Action Recognition . . . 2
III Method . . . 3
3.1 Pipeline . . . 3
3.2 Pedestrian Localization & Tracking . . . 3
3.3 Pedestrian Trajectory Prediction . . . 5
IV Experiments . . . 7
4.1 Dataset . . . 7
4.2 Implementation . . . 7
V Results . . . 8
VI Conclusion . . . 9
References . . . 10
Acknowledgements . . . 13
List of Figures
1 Pipeline of our method. . . 3
2 Detection and Tracking. . . 3
3 Coordinate conversion. . . 4
4 Detecting background movements. . . 5
5 Compensating ego-motion. . . 5
6 Examples of Social-LSTM (left) and GCN+Social-LSTM (right) results on PIE dataset. . 8
7 Jittery data is caused from jumpy tracking. . . 9
I Introduction
Autonomous driving is one of the fields that have been very actively researched over the past decade.
Safety is the most important issue in autonomous driving. To prevent accidents, autonomous vehicles must accurately detect pedestrians and predict the future path, so that they can effectively take the infor- mation into account when they calculate their next movements.
Typically, the pedestrian’s future trajectory is estimated by considering the pedestrian’s past location [1] and the interaction between pedestrians [2]. However, it is not uncommon for pedestrians to change direction in a brief timeframe. The accumulated locations or the relations in those locations are not enough to accurately predict such a sudden behavior.
While driving on the road, drivers use a vast range of information from pedestrians to understand their intentions. This includes direct signals from pedestrians to their subtle gestures and looks. It is natural for neural networks to utilize the same kind of information to predict pedestrians’ sudden behaviors.
Many self-driving cars are equipped with LiDAR or RADAR to perceive their surroundings. These sensors are accurate in measuring object depth but are costly. A less expensive approach can employ a monocular RGB camera. In many pedestrian path estimation studies using monocular cameras, pedestri- ans’ past and future trajectories are only considered on a 2-dimensional image plane. However, degrad- ing the 3-dimensional information into lower-dimensional information hinders understanding critical features such as the actual speed of pedestrians or distance from them. However, extracting 3D informa- tion from 2D images is an ill-defined problem.
We extract three-dimensional information by using Monoloco [3], a deep learning neural network that utilizes information on pedestrian body parts from monocular RGB images.
The movement of the vehicle will affect the estimated location. To determine the absolute world coordinate independent of the vehicle’s movement, we need to know the vehicle’s ego-motion. The task of visual odometry, estimating the ego-motion of the camera from the video, is also actively researched in the field of deep learning. In this paper, we present a method to correct the position of a pedestrian without actually employing the whole process of visual odometry. Our method presumes the domi- nant optical flow for each grid of a frame to be the ego-motion and calculates the pedestrian’s actual displacement by dividing the optical flow vectors into two parts.
Also, as GPS and OBD(On-Board Diagnostics) sensors become easily accessible, the actual velocity and direction information from those sensors is put to use for robust ego-motion correction.
1
II Related Works
2.1 Pedestrian 3D Localization from Monocular Image
Although a pure vision-based system [4] has detected and tracked the coordinates of a pedestrian, it is limited to the 2D image plane. This problem has been classically accomplished by first detecting pedestrians from an image, then applying perspective transformations on image coordinates. [5], and [6]
have addressed the problem with trigonometric and algebraic approaches. Deep learning has proven to be effective in estimating depth from monocular images [7], [8]. Using pedestrian skeleton further improves performance [3].
2.2 Visual Odometry
Conventional methods such as [9] effectively estimate ego-motion using trifocal geometry. However, it requires a pair of stereo images due to scaling problems and shows fatal errors in rotation in a spe- cific direction or movement at low speed. [10] and [11] attempt monocular visual odometry using deep learning. Like [12], it is also studied in connection with monocular depth estimation.
2.3 Pedestrian Trajectory Prediction
[13] extracts features from a dense optical flow and predicts the future path of a pedestrian by using a Kalman filter for positional information. [1] showed an attempt to use a Dynamic Bayesian Network.
Machine learning is also actively used for this task. [14], [15] predicts their crossing intentions from the behavioral information of pedestrians using random forest and deep learning, respectively. Recent works exploit an even larger range of information, such as vehicle odometry and local image context [16].
2.4 Skeleton-Based Action Recognition
In early studies [17], [18], features were extracted by handcrafted joint relations for modeling human actions. Deep learning became widely accepted as the mainstream method, where spatial and tempo- ral information of joint coordinates is analyzed using RNN-based techniques [19] or pre-processed into images or tensors and analyzed by CNN [20]. As it was found that it is effective to apply Graph Convo- lution Network (GCN) by representing the relationship between different body parts as a graph [21], the task has shown dramatic improvement in performance employing GCN [22], [23].
2
III Method
3.1 Pipeline
Figure 1: Pipeline of our method.
First, we detect the skeletons of pedestrians in the monocular video and track them in 2-D image coordinates. The tracked coordinates are converted to world coordinates and corrected with vehicle ego- motion to remove dependency on the vehicle movement. previously detected skeleton data is combined with the processed world coordinates and applied as an input to a GCN-LSTM network, which outputs the future coordinates of each pedestrian.
3.2 Pedestrian Localization & Tracking Localization & Tracking
To obtain the skeleton information for a later process, we perform a pose estimation algorithm on the images rather than a pedestrian detection algorithm. Plenty of off-the-shelf pose estimation algorithms is available. PifPaf [24] is selected for compatibility with Monoloco [3], a 3D pedestrian localization network. The bounding box of the pedestrian can be obtained from the detected skeleton. Pedestrian bounding boxes are tracked with (Simple Online Real-time Tracking (SORT)) [25] algorithm, which is a fast and effective tracking algorithm for multiple bounding boxes.
Figure 2: Detection and Tracking.
Coordinate Conversion
The simplest approach would be measuring the external parameters of the vehicle’s camera and forming 1:1 correspondence between the image plane coordinates and world coordinates using perspective trans- formation. However, this must assume planar ground and fixed external parameters (especially height and pitch), which are not the case in real driving conditions. Deep learning has been used in previous studies for robust performance. We borrow Monoloco [3] to transform pedestrians’ 2-D image plane coordinates into 3-D world coordinates.
3
Figure 3: Coordinate conversion.
Ego-motion Correction
In this paper, we present two methods for compensating vehicle ego-motion. First, we do not directly estimate the ego-motion only use the sparse optical flow of the background to compensate for the camera movement. Second, we calculate the absolute position of the pedestrian by using the vehicle On-Board Diagnostics information to calculate and accumulate the vehicle’s displacement in every frame to obtain the vehicle’s absolute position.
The background of an image will move in the opposite direction to the ego-motion of the camera in the following frame. This movement appears differently in each part of the screen. Therefore, we divide the screen into (M×N) two-dimensional grids. Sparse optical flow vectors can be obtained by finding a corner point in each grid and tracking them in an adjacent frame (Making Good Features Track Better). Not all grid sparse optical flow vectors are useful. Given the fact that foreground objects, such as cars and pedestrians, give optical flow vectors independent of, or even colliding with ego-motion of the camera, optical flow vectors that are not sufficiently similar to the ones from surrounding grids are removed. The grid containing the pedestrian’s bounding box must also be emptied. An empty grid is filled using linear interpolation of the optical flow vectors from grids.
The optical flow vector of the background, especially the ground plane, can be separated into hor- izontal and vertical components. Similarly, the movement of the vehicle can also be separated into forward-backward movement and left-right rotation. The horizontal component of the optical flow is proportional to the left-right rotation of the camera. Therefore, we correct the left-right rotation of the vehicle by subtracting the horizontal component of the optical flow from the image coordinate system before converting the pedestrian coordinates into the world coordinate system.
On the other hand, the vertical component is not directly proportional to the forward-backward movement but becomes smaller as it moves away from the camera. However, when it is converted into the world coordinate system, its magnitude will be directly proportional to the forward-backward
4
Figure 4: Detecting background movements.
movement of the camera. Therefore, we correct the forward-backward movement of the vehicle by subtracting the vertical component of the optical flow in the world coordinate system after converting the pedestrian coordinates after coordinate conversion.
Figure 5: Compensating ego-motion.
We can also get GPS and On-Board Diagnostics (OBD) information from the vehicle. Whereas GPS data shows errors of several to tens of meters in a complex urban environment, OBD data is not affected by nearby buildings. OBD data includes vehicle speed and head angle. We can estimate the total displacement and change of head angle of the vehicle by accumulating vectors with the head angle as the direction and the sensor’s measurement period multiplied by velocity as the magnitude. The absolute coordinates of the pedestrian are obtained by adding the vector from the camera to the pedestrian and the displacement and applying rotation transform according to the change of head angle.
3.3 Pedestrian Trajectory Prediction
We train social-LSTM [2], a network that takes the relationship between pedestrians into account, with 15 frames (0.5 seconds) long sequence of the pedestrians’ world coordinates as input and the next 15 frames as output. In the original paper, social-LSTM only considers the paths of pedestrians. We decided that the path of the vehicle itself also has a significant impact on the path of pedestrians. Therefore, the vehicle’s world coordinates are also considered as an input.
Pedestrian Action Assisted Trajectory Prediction
Skeletons detected by PifPaf are collected as sequences. We use a network that can process the sequence of skeletons to extract information from the sequence of skeletons and combine them with the Social-
5
LSTM. We choose a state-of-the-art skeleton-based action recognition network, A Channel-wise Topol- ogy Refinement Graph Network Convolution for Skeleton-based Action Recognition (CTR-GCN) [23].
The major issue is fusing two networks. In the original paper, CTR-GCN takes a sequence of skeletons as input and outputs scores for 60 (for NTU-RGB+D [26]) or 120 (NTU-RGB+D 120 [27]) actions. Since our purpose is not to classify but to provide information about pedestrian action to Social- LSTM, we do not need the classification output. Nevertheless, the distribution of scores for different behavior classes clearly contains information about the behavior.
The intuitive approach could be inputting the score distribution of all classes with the(x,y)input of Social-LSTM. This approach implies that pedestrian behavior is also a time-series data, and instructs Social-LSTM to predict the next behavior along with the x,y coordinates.
However, behavioral information does not exist in temporal sequences. Behavior information exists as a tensor inferred through a 15-frame sequence of skeletons.
Another approach is to combine the score distribution of the class with the hidden state of Social- LSTM. After making it the same size as the hidden state of Social-LSTM through a first order transfor- mation and added into the initial hidden state.
6
IV Experiments
4.1 Dataset
To train our network and evaluate performance, we applied our method to PIE dataset. PIE dataset consists of 6 hours of driving videos (1920×1080, 30 fps) in typical traffic scenes, GPS and On-Board Diagnostics data per frame, and annotations of the pedestrian bounding box, action, and intention.
4.2 Implementation
The network is implemented in Python 3 and PyTorch 1.9. Sparse optical flow is detected with the basic functions of OpenCV-Python.
Neighborhood size and grid size of Social-LSTM are set as 32 and 4 respectively, same as the original paper.
For training, the batch size is set to 5. Two types of optimizers are used at the different parts of the network. Adagrad optimizer trains Social-LSTM with the initial learning rate of 0.003. Learning rate drops by half every 8 epochs. L2 regularization is used withλ=0.0005. Gradient clipping is applied to gradients over 10. SGD optimizer trains CTR-GCN with the initial learning rate of 0.01 and momentum of 0.9. Learning rate drops by 0.1 every 20 epochs. L2 regularization is used withλ =0.0001. Nesterov momentum is applied.
7
Model Error(m) Social-LSTM (Original Paper) 0.61
Social-LSTM (Ours) 0.74 CTR-GCN + Social-LSTM 3.87
Table 1: Table.
V Results
Table. 1 shows the training results. Unlike our study, the original study of social-LSTM used datasets with well-annotated (x,y) coordinates, which consist of the ETH [28] UCY [29] dataset.. In the original paper, the authors average the performance of each result and present accuracy of 0.61. When our algorithm is applied to PIE dataset, the error is 0.74, which is within a reasonable range.
However, when pedestrian action recognition is combined with social-LSTM, the error increases fivefold. We analyze the reasons as follows.
Figure 6: Examples of Social-LSTM (left) and GCN+Social-LSTM (right) results on PIE dataset.
First, our dataset contained jittery data - the tracking was jumpy, as shown in Fig. 7. This breaks the consistency between the behavioral information and the trajectory. This can greatly hinder the conver- gence of the model. As future work, we plan to manually clean up these jittery data and retrain them.
We expect the model to converge better after data clean-up.
Second, in learning two different structures of models together, we used two different optimizers from [2] and [23]. While both optimizers are apt for training respective networks, we did not consider the fact that we trained two networkssimultaneously. Inappropriate use of optimizer could have led to the local optima. Further research in the use of combined optimizers is required.
8
Figure 7: Jittery data is caused from jumpy tracking.
VI Conclusion
We proposed a method to predict pedestrian trajectory with a monocular camera. We first detect pedes- trians’ skeletons with a pose estimation algorithm PifPaf, localize their 3-D coordinates using Monoloco, and compensate the vehicle’s ego-motion with either sparse optical flow or On-Board Diagnostics. We also combine ’trajectory prediction’ Social-LSTM and ’action recognition’ CTR-GCN, to access more information about pedestrians. Using PIE dataset, we trained and evaluated the network. We analyzed the reason behind poor optimization.
9
References
[1] N. S. F. F. D. M. G. Julian Francisco, Pieter Kooij, “Context-based pedestrian path prediction,”
ECCV 2014: Computer Vision – ECCV 2014, pp. 618–633, 2014.
[2] V. R. A. R. L. F.-F. S. S. Alexandre Alahi, Kratarth Goel, “Social lstm: Human trajectory prediction in crowded spaces,”CVPR, pp. 961–971, 2016.
[3] A. A. Lorenzo Bertoni, Sven Kreiss, “Monoloco: Monocular 3d pedestrian localization and uncer- tainty estimation,”ICCV, 2019.
[4] M. Bertozzi, A. Broggi, A. Fascioli, A. Tibaldi, R. Chapuis, and F. Chausse, “Pedestrian localiza- tion and tracking system with kalman filtering,” pp. 584–589, 2004.
[5] P. Carr, Y. Sheikh, and I. Matthews, “Monocular object detection using 3d geometric primitives,”
pp. 864–878, 2012.
[6] B. Leibe, N. Cornelis, K. Cornelis, and L. Van Gool, “Dynamic 3d scene analysis from a moving vehicle,” pp. 1–8, 2007.
[7] C. Yan and E. Salman, “Mono3d: Open source cell library for monolithic 3-d integrated circuits,”
IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 3, pp. 1075–1085, 2018.
[8] C. Godard, O. M. Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” 2017.
[9] B. Kitt, A. Geiger, and H. Lategahn, “Visual odometry based on stereo image sequences with ransac-based outlier rejection scheme,” in 2010 IEEE Intelligent Vehicles Symposium, 2010, pp.
486–492.
[10] K. R. Konda and R. Memisevic, “Learning visual odometry with a convolutional network.” in VISAPP (1), 2015, pp. 486–490.
[11] S. Wang, R. Clark, H. Wen, and N. Trigoni, “Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks,” in2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 2043–2050.
[12] H. Zhan, C. S. Weerasekera, J.-W. Bian, and I. Reid, “Visual odometry revisited: What should be learnt?” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 4203–4210.
10
[13] C. G. Keller and D. M. Gavrila, “Will the pedestrian cross? a study on pedestrian path prediction,”
IEEE Transactions on Intelligent Transportation Systems, vol. 15, no. 2, pp. 494–506, 2013.
[14] Z. Fang and A. M. López, “Is the pedestrian going to cross? answering by 2d pose estimation,” in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 1271–1276.
[15] ——, “Intention recognition of pedestrians and cyclists by 2d pose estimation,”IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 11, pp. 4773–4783, 2019.
[16] A. Rasouli, I. Kotseruba, T. Kunic, and J. K. Tsotsos, “Pie: A large-scale dataset and models for pedestrian intention estimation and trajectory prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6262–6271.
[17] R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recognition by representing 3d skele- tons as points in a lie group,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 588–595.
[18] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars, “Modeling video evolution for action recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5378–5387.
[19] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1110–1118.
[20] H. Liu, J. Tu, and M. Liu, “Two-stream 3d convolutional neural network for skeleton-based action recognition,” 2017.
[21] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” 2018.
[22] Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolu- tions for skeleton-based action recognition,” 2020.
[23] Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel-wise topology refinement graph convolution for skeleton-based action recognition,” 2021.
[24] S. Kreiss, L. Bertoni, and A. Alahi, “Pifpaf: Composite fields for human pose estimation,” 2019.
[25] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,”
2016 IEEE International Conference on Image Processing (ICIP), Sep 2016. [Online]. Available:
http://dx.doi.org/10.1109/ICIP.2016.7533003
[26] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activ- ity analysis,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019.
11
[27] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “Ntu rgb+d 120: A large- scale benchmark for 3d human activity understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, pp. 2684–2701, 2020.
[28] S. Pellegrini, A. Ess, K. Schindler, and L. van Gool, “You’ll never walk alone: Modeling social behavior for multi-target tracking,” in 2009 IEEE 12th International Conference on Computer Vision, 2009, pp. 261–268.
[29] A. Lerner, Y. Chrysanthou, and D. Lischinski, “Crowds by example,”Computer Graphics Forum, vol. 26, no. 3, pp. 655–664, 2007. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.
1111/j.1467-8659.2007.01089.x
12
Acknowledgements
I would like to thank my advisor, Professor Seungjoon Yang. Without his insight and feedback, this thesis would never have been finished. He supported me through two years of M.S course with his endless encouragements. Thank you very much.
13