Vision based Generous Deep Neural Network Grasping Detector for Robot Grasping Task

Recently, robotic grasp detection and object detection with reasoning have been explored using deep neural networks (DNNs). People immediately identify multiple grasps on new objects (perception), plan how to pick them up (planning), and can actually understand them reliably (control). Our proposed REM were evaluated for two different tasks: robot grasp detection on the Cornell dataset [16,17] and real robot grasp tasks with novel objects that were not used during training.

Our proposed REM was able to outperform state-of-the-art methods such as [15] by achieving up to 99.2% (image), 98.6% (object) accuracies on the Cornell dataset, as shown in Figure 1.2. with 5× faster computation than [15]. Thanks to Cornell's open database for robotic grasp detection [16] and advances in deep learning techniques, many approaches have been proposed [13,14,18].

Figure 1.1: REM module performance summary of computation time (frame per second) vs. grasp de- de-tection accuracy on the Cornell dataset with object-wise data split.

Related works

Esteveset al. proposed a rotation invariant network by replacing the STN network generation with a polar transform [10]. Lenz et al proposed a TSD model that classifies the stickiness of objects using sparse automatic encoding (SAE) with sliding windows for brute-force region proposals [ 17 ]. Guo et al. used ZFNet-based TSD [28] for grasping robots and designed angle prediction as a classification [14].

Grasping Detection

Rotation Ensemble Module

Parameter descriptions of the proposed OSD method

Rotation ensemble module (REM)

Suppose that an intermediate output for {tx,ty,θ,tw,th,tz} is available in REM, called {txm,tym,θm,twm,thm,tzm}. Thus, our proposed method uses the class probability (probability to understand) to selectively cluster the activations along with the angle classification weight. In REM, the intermediate output is used in part to enable rotation, it still contains valuable, compressed information about the final output - it can be a good initial bounding box.

Thus, we designed our REM to decompress, attach it to the bottom of the REM as illustrated in Fig.2.2(b). This pipeline provides valuable information about {txm,tym,θm,twm,thm,tzm} indirectly in the final layer, and this structure appeared to decrease the probability errors.

Figure 2.2: An illustration of incorporating our proposed REM in a DNN for robot grasp detection (a) and the architecture of our proposed REM with rotation convolutions (b).

Loss functions

Simulations and Experiments

Implementation details
Benchmark dataset and novel objects
Results for in-house implementations of previous works
Results for our proposed REM on the Cornell dataset

5] Taco Cohen and Max Welling, "Group equivariant convolutional networks," in International Conference on Machine Learning (ICML), 2016, p. 13] J Redmon and A Angelova, "Real-time grasp detection using convolutional neural networks", in IEEE International Conference on Robotics and Automation (ICRA), 2015, p. 21] Chen-Hsuan Lin and Simon Lucey, “Inverse Composite Spatial Transformer Networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

22] Diego Marcos, Michele Volpi, Nikos Komodakis, and Devis Tuia, "Rotation Equivariant Vector Field Networks." in IEEE International Conference on Computer Vision (ICCV), 2017, p. 24] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You only see once: unified, real-time object detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, p . 25] Joseph Redmon and Ali Farhadi, "YOLO9000: Better, Faster, Stronger," inIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, p.

29] Kaiming He, Xiangyu Zhang, Shaoqing Ren og Jian Sun, "Deep Residual Learning for Image Recognition," i IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, s. 33] Sulabh Kumra og Christopher Kanan, " Robotic grap detection using deep convolutional neural networks,” inIEEE International Conference on Intelligent Robots and Systems (IROS), 2017, s. Learning using a new rektangle repræsentation,” i IEEE International Conference on Robotics and Automation (ICRA), 2011, pp.

Kenmerkpiramidenetwerke vir objekbespeuring,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.

Figure 2.3: Images from the Cornell dataset

A Single Neural Network for Multi-Taskf 15

Multi-task robot grasping

Single object robot grasping
Multi-task robot grasping

A 5D robotic gripper representation is widely used for GD with a parallel gripper when a single 2D image (RGB or RGB-D) is used [17,37]. This representation is a vector of {xgd,ygd,θgd,wgd,hgd} consisting of location(xgd,ygd), orientationθgd, gripper opening widthwgd, and parallel gripper plate size hgd. Grasping a specific target object in cluttered and stacked objects requires more than single-object grasping information and requires additional information such as object class and relational reasoning (see Fig.1.2) for sequential grasp planning.

Reparametrization of 15D representation

Anchor box: w, h in each cell
Anchor box: orientation in each cell
Object class: cls in each cell
FC and CC in each cell

Thus, DNN will estimate {txj,tyj,tθj,twj,thj,tprj} instead of {xj,yj,θgd,wj,hj,prj} for the GD of our proposed methods. Finally, the angle θgd will be modeled as a discrete and continuous value instead of a continuous value. Due to the re-parameterization with the anchor field, estimatingwj,hj is converted into estimatingtwj,thj, which are associated with the expected values of different sizes of wj,hj.

Then, the best grip representation among all anchor box candidates is selected for the final output. While MultiGrasp took regression approach for θgd[13], Guoet al. the regression problem of estimatingθgd converted to the classification forθgdamong finite number of corner candidates in{0,π/18,. Zhang [45] proposed orientation anchor case so that the angle is determined using classification as well as discrete anchor case rotations.

When objects are stacked in a complex way, it becomes a difficult task to match the OD (bounding box of detection) result with the GD result without additional information such as object classes. For this task, an object class is predicted for each grasp detection field result, so that our proposed model can generate grasp detection fields, their grasp points, and corresponding object classes. For object relationship, we propose to predict FC and CC together with other detection results.

FC and CC consist of object class labels and classless label.(tclsf c,tclscc) ∈ {class1,class2, ..,classnone}.

Proposed FCNN with predictions across scales

OD with reasoning is performed over scales of ×1,×2 and ×4 and GD is performed over scales of ×1 and ×2. Therefore, 9 anchor boxes are predicted with 4 bounding box shifts, object probability, object class (class number) and (class number+1)×2 reasoning classes (FC, CC) for OD with reasoning. In addition, 8 anchor boxes are predicted with 4 bounding box offsets, orientation, grasp probability and object class (class number) for GD.

Figure 3.2: Schematic pipelines of Zhang [11, 12] vs ours. Mark ’N’ means neural network and ’P’

Reasoning post-processing: from class to index

Loss functions

Experimental Evaluations

Implementation details
Evaluations on VMRD and Cornell datasets

Results

Simulation results on VMRD dataset
Simulation results on Cornell dataset

2] Hayit Greenspan, Serge Belongie, Rodney Goodman, Pietro Perona, Subrata Rakshit and Charles H Anderson, “Overcomplete steerable pyramid filters and rotational invariance,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1994, pp. 3] Whoi-Yul Kim and Po Yuan, "A Practical Pattern Recognition System for Translation, Scale, and Rotation Invariance," in 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 14] Di Guo, Fuchun Sun, Huaping Liu, Tao Kong, Bin Fang, and Ning Xi, “A Hybrid Deep Architecture for Robotic Gesture Detection,” in IEEE International Conference on Robotics and Automation (ICRA), 2017, pp.

Visual manipulation relational network for autonomous robotics,” in 2018 IEEE-RAS 18th International Conference on Humanoid Robots (Humanoids). 26] Xuepeng Shi, Shiguang Shan, Meina Kan, Shuzhe Wu, and Xilin Chen, “Real-time rotation-invariant face detection with progressive calibration networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018 , p. 31 ] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei, “Deformable convolutional networks,” in IEEE International Conference on Computer Vision (ICCV), 2017, p.

38] Edward Johns, Stefan Leutenegger, and Andrew J Davison, “Deep learning, a sub-gripper grasping function, introduces uncertainty,” inIEEE International Conference on Intelligent Robots and Systems (IROS). 43] Lerrel Pinto and Abhinav Gupta, “Supersizing self-supervision: Learning to grip from 50,000 trials and 700 robot hours,” inIEEE International Conference on Robotics and Automation (ICRA). 47] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Doll´ar, “Focal loss for density object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp.

REFERENCES [50] Dongwon Park, Yonghyeok Seo, Dongju Shin, Jaesik Choi, and Se Young Chun, “A Single Multitask Deep Neural Network with Post-Processing for Reasoning Object Detection and Robotic Grasping Detection,” in International Conference 2020 IEEE on Robotics and Automation (ICRA).

Figure 3.4: Multi-task detection results for VMRD. The 1 st row is GT and the 2 nd row is the results of our proposed methods

Toward Robot Demonstration in Real-environment 26

Evaluation of GD with 4-axis robot arm

The following 8 novel objects (toothbrush, candy, earphone cover, cable, Styrofoam bowl, L-key, pliers, pencil) were used for grasping tasks.

Results of Robot Evaluation in Real-envirionment

Results of multi-task OD, GD, reasoning with Baxter
Results of GD with 4-axis robot arm

However, in the stacking scene, we found that increasing the number of objects had a significant effect on accuracy, making a lot of changes to the child and father class predictions. Our proposed method gave an average success rate of 95.3% with 6.5% standard deviation for 8 new small objects with 8 repetitions per

Discussion

6] Patrick Follmann and Tobias Bottger, “A Rotationally Invariant Convolution Module with Backward Feature Map Rotation,” IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, p. 8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Spatial Pyramid Clustering in Deep Convolutional Networks for Visual Recognition,” in European Conference on Computer Vision, 2014, p. 10] Carlos Esteves, Christine Allen-Blanchette, Xiaowei Zhou, and Kostas Daniilidis, “Polar Transformer Networks,” in International Conference on Learning Presentations (ICLR), 2018.

12] Hanbo Zhang, Xuguang Lan, Xinwen Zhou, and Nanning Zheng, "Roi-based robotic grasp detection in object overlapping scenes using convolutional neural network," arXiv preprint arXiv. A fully convolutional network for grasp sensing with an oriented anchor box,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 16] Ian Lenz, Honglak Lee, and Ashutosh Saxena, “Deep Learning for Robot Grasp Detection,” in Robotics: Science and Systems, June 2013, p.

23] Shaoqing Ren, Kaiming He, Ross Girshick og Jian Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," i Advances in neural information processing systems (NIPS), 2015, s. 27] A Krizhevsky, I Sutskever og G E Hinton, "Imagenet classification with deep convolutional neural networks," i Advances in Neural Information Processing Systems 25, 2012, s. 34] Umar Asif, Jianbin Tang og Stefan Harrer, "GraspNet: An Efficient Convolutional Neural Network for Real-time Grasp Detection for Low-powered Devices,” i International Joint Conference on Artificial Intelligence (IJCAI), 2018, s.

REFERENCES [39] Jeffrey Mahler, Matthew Matl, Xinyu Liu, Albert Li, David Gealy, and Ken Goldberg, "DexNet 3.0: Computing Robust Vacuum Suction Grip Targets in Point Clouds Using a New Analytics Model and Deep Learning," inIEEE International Conference on Robotics and Automation (ICRA). 44] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg, “Ssd: Single shot multibox detector,” in European Conference on Computer Vision. I will solve many problems in the world with this methodology with computer vision knowledge, and robotic knowledge interacting with the real environment.

Figure 4.2: Target grasp detection results in (a) cluttered scene, (b) stacking scene and (c) challenging invisible scene.

The ablation studies on the Cornell dataset for our REM with RC, RA and RL

Performance summary on Cornell dataset. Our proposed method yielded state-of-the-art

Self-evaluation summary on VMRD

Performance summary on VMRD dataset

Summary on Cornell data (25% IOU)

Performance summary of grasping tasks for cluttered (CS), stacking (SS) and invisible