View of RECENT RESEARCH ON CHARACTERISTIC ANALYSIS ON MULTISCALE TECHNOLOGY THROUGH ROTATION NETWORK

(1)

RECENT RESEARCH ON CHARACTERISTIC ANALYSIS ON MULTISCALE TECHNOLOGY THROUGH ROTATION NETWORK

Amitesh Kumar Jha

Asst. Prof., CSIT Department, G.G.V. Bilaspur, C. G., India

Abstract- Numerous real-world applications have relied heavily on the precise and robust detection of multi-class objects in VHR aerial images. With horizontal bounding boxes (HBBs), traditional detection techniques have made significant progress thanks to CNNs.

For densely distributed and strip-like objects, HBB detection methods still have limitations, such as missed detection and redundant detection regions. In addition, numerous obstacles arise from large-scale variations and diverse backgrounds. For aerial images with oriented bounding boxes (OBBs), the Multi-scale Feature Integration Attention Rotation Network (MFIAR-Net) is proposed as an effective region-based object detection framework that promotes the integration of the inherent multi-scale pyramid features to generate a discriminative feature map. This is done in an effort to address these issues. In the meantime, the ground truth mask information-supervised double-path feature attention network is implemented to direct the network toward object regions and eliminate irrelevant noise.

Keywords: Detection of objects; aerial photos; highlight consideration.

1. INTRODUCTION

Under imagery analysis for geospatial areas, very high resolution (VHR) aerial images with a resolution of 20cm to 30cm can provide an insightful understanding.

The construction of national defense, urban planning, environmental monitoring, and so on all depend on object detection, which is an essential component of automatic aerial imagery analysis. Despite the fact that numerous object detection techniques for VHR aerial images have been proposed in the past, this task still presents numerous difficulties due to arbitrary orientations, large scale variations, and a variety of background. A number of CNN-based object detection frameworks have recently been proposed as a result of the rapid development of convolutional neural networks (CNNs), and very impressive results have been achieved over natural benchmarks such as PASCAL VOC and MS COCO. Commonly, the CNN-based object detection techniques can be broken down into two categories: single-stage and two-stage approaches. In the two-stage methods, the input image is used to generate category-independent region

proposals in the first stage. Following the extraction of features from these regions, category-specific classifiers and regressors are refined in the second stage for classification and regression. Finally, non-maximal suppression (NMS) eliminates redundant bounding boxes to produce accurate detection results. A ground-breaking piece of work is Region- based CNN (R-CNN). Its reformative versions, SPP-Net and Fast-RCNN, enable runtime efficiency and learning to be simplified. By sharing convolution weights, the proposed Region Proposal Network (RPN) and Fast R-CNN are combined into a single network known as Faster-RCNN. This results in end-to-end object detection that is both quick and accurate. Zhang and proposed a multi scale cascaded object detection network and introduced multi scale features in pyramids to obtain features for each scale using a novel attention method. This method can effectively identify objects for traffic signs with complex backgrounds by highlighting their features. FPN, R-FCN, Mask-RCNN, Libra RCNN, Trident-Net, and so on are just a few of the high-

(2)

performance detection approaches that have been proposed up until this point. In addition, the single-stage approaches treat object detection solely as a regression problem without resorting to proposal generation, which is capable of achieving nearly real-time success.

Single-stage techniques like YOLO and SSD are popular because they maintain real-time speed while guaranteeing accuracy in detection. To address the issue of single-stage approaches' class imbalance, Retina Net suggests a novel focal loss function. RefineDet can use cascade regression and an Anchor Refinement Module (ARM) to adjust the sizes and locations of anchors and then filter out easy negative anchors to improve accuracy, drawing inspiration from the two-stage methods.

2. PROPOSED METHOD

The proposed Multi-scale Feature Integration Attention Rotation Network (MFIAR-Net) is described in detail in this section. The MFIAR-Net overview framework is depicted in Figure 1. First and foremost, the Multi-scale Feature Integration Network (MFIN) and FPN anticipate adding more multi-scale feature information to the feature map.

The Double-Path Feature Attention Network (DPFAN) is able to direct the network to concentrate on information in the foreground. At the end of the first stage, the coarse horizontal regions are still regressed to keep important information. The PS RoI Align layer and a brand-new Multi-task learning loss are added to the Rotation Detection Network to boost location sensitivity and precision of five-parameter (x, y, w, h, q) regression.

The sections that follow provide additional information.

Figure 1 Overview framework of Multi-scale Feature Integration Attention Rotation Network (MFIAR-Net) for oriented object detection. The proposed framework based on

FPN, consists of Multi-scale Feature Integration Network, Double-Path Feature Attention Network and Rotation Detection Network.

2.1. Multi-Scale Feature Integration Network (MFIN)

Multi-scale feature information can be extracted from a single image using the Feature Pyramid Network (FPN). After the

output of each scale, the Asymmetric Convolution Block (AC Block) is used to obtain distinct feature representation of FPN for geospatial objects. In addition, we combine the multi-scale feature maps

(3)

simultaneously into a discriminative feature map of the appropriate size. This integrated feature contains balanced information from each spatial resolution, which is crucial for scale variations in aerial images. There are two important steps in MFIN: extraction and integration of features at multiple scales.

2.2 PS RoI Align Layer

In the second stage, the RoI pooling layer constructs a fixed-length feature map, such as 7 X 7, for use in the subsequent classification and bounding box regression tasks. Utilizing bilinear interpolation rather than quantified integers for candidate regions, R-DFPN and SCRDet employed RoI Align in order to address the issue of feature misalignment, particularly for objects with a large aspect ratio. PS RoI Pooling is proposed in R-FCN to resolve the conflict between translation-invariance in object detection and translation-invariance in image classification. Score maps that are position-sensitive are initially generated by a fully convolutional network. The position data is encoded in relation to a relative spatial position in each of these score maps. After that, a positive-sensitive RoI pooling layer is used to get score maps information for the second stage's final convolutional layer.

By combining RoI Align and PS RoI Pooling, the proposed method creates a PS RoI Align layer that is more durable.

Particularly, we eliminate two rounding operations during PS RoI Pooling: first, we quantify the candidate region's coordinates into integers, and second, we quantify each bin's coordinates in position-sensitive score maps. We, like RoI Align, obtain the precise coordinates by employing bilinear interpolation. In addition to preventing misalignment between the extracted feature maps and the inputs, PS RoI Align can effectively convert the deep convolutional backbone into an object detector with excellent translation-variance performance. The

tests show that the PS RoI Align layer, in comparison to the more common RoI Pooling layer and the RoI Align layer, can boost performance by 1.27 percent and 0.72 percent, respectively.

3. EXPERIMENTS

On publicly available aerial datasets, we will demonstrate the efficiency of the proposed MFIAR-Net in this section:

HRSC2016 and DOTA. We begin by introducing evaluation metrics, implementation specifics, and datasets.

The accuracy and speed of the MFIAR-Net method are then compared to those of the most recent methods.

3.1. Dataset Description 3.1.1. DOTA

The DOTA is a massive dataset of VHR optical aerial images with arbitrary quadrilateral annotation for comparing object detection results from various sensors and platforms. DOTA consists entirely of 2806 aerial images, including 1411 training images, 458 validation images, and 937 testing images that have already been divided. The completely commented on DOTA benchmark has 188,282 cases, which have a place with 15 normal classes, in particular, plane (PL), baseball field (BD), span (BR), ground track field (GTF), little vehicle (SV), huge vehicle (LV), transport (SH), tennis court (TC), b-ball court (BC), capacity tank (ST), soccer ball field (SBF), roundabout(RA), harbor(HA), pool (SP) and helicopter(HC).The objects in the image have a wide range of sizes, orientations, and shapes, with sizes ranging from 800 x 800 to 4000 x 4000 pixels. There are no annotations on the testing images.

3.1.2. HRSC2016

For ship detection, HRSC2016 is a challenging dataset gathered from well- known harbors in Google Earth. There are total of 1061 images in the dataset, with 436 images used for training, 181 for

(4)

validation, and 444 for testing. There are 2976 samples in the HRSC2016 that have been fully annotated, and they belong to more than 20 distinct ship categories. The majority of the images are larger than 1000600 and range in size from 300 x 300 to 1500 x 900.A more effective illustration of OBBs is the strip-like ship detection.

4. CONCLUSIONS

In this paper, we propose MFIAR-Net, a novel and efficient region-based rotation object detection framework for VHR aerial images with multiple categories and arbitrary orientations. The MFIN is able to discriminatively extract features at multiple scales and integrate them into a distinguished feature of the appropriate size, balancing semantically strong features at coarse resolution with semantically weak features at high resolution simultaneously. A supervised Double-Path Feature Attention Network (DPFAN) is designed to direct the entire network to capture object information and suppress irrelevant noise while taking into account the diverse and complex background. In addition, a robust Rotation Detection Network that successfully achieves OBB's localization and classification is presented. The ablation study was carefully constructed to demonstrate the improved performance of each network component. Our framework's experimental results on the public datasets DOTA and HRSC2016 demonstrate that it can compete with the best in the industry for OBB task detection speed.

REFERENCES

1. Zhang, J.; Lu, C.; Wang, J.; Yue, X.G.; Lim, S.J.; Al-Makhadmeh, Z.; Tolba, A. Training Convolutional Neural Networks with Multi- Size Images and Triplet Loss for Remote Sensing Scene Classification. Sensors 2020, 20, 1188. [CrossRef] [CrossRef] [PubMed]

2. Zhang, J.; Lu, C.; Li, X. Kim; H. J.; Wang, J.

A full convolutional network based on DenseNet for remote sensing scene

classification. Math. Biosci. Eng. 2019, 16, 3345–3367. [CrossRef] [CrossRef] [PubMed]

3. Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm.

Remote Sens. 2010, 65, 2–16. [CrossRef]

[CrossRef]

4. Cheng, G.; Han, J. A survey on object detection in optical remote sensing images.

ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [CrossRef] [CrossRef]

5. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J.

Object detection in optical remote sensing images: A survey and a new benchmark.

ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [CrossRef] [CrossRef]

6. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r- cnn: Towards real-time object detection with region proposal networks. IEEE Trans.

Pattern Anal. Mach. Intell. 2017, 39, 1137–

1149. [CrossRef]

7. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.;

Dollar, P. Focal loss for dense object detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp.

2999–3007.

8. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. IJCV 2010, 88, 303–338. [CrossRef] [CrossRef]

9. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.;

Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the 2014 European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014;

pp. 740–755.

10. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2015; pp. 580–587.

11. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEE Trans.

Pattern Anal. Mach. Intell. 2015, 37, 1904–

1916. [CrossRef] [CrossRef]

12. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–

13 December 2015; pp. 1440–1448.

13. Zhang, J.; Xie, Z.; Sun, J.; Zou, X.; Wang, J.

A Cascaded R-CNN with Multiscale Attention and Imbalanced Samples for Traffic Sign Detection. IEEE Access. 2020, 8, 29742–

29754. [CrossRef]

14. Lin, T.; Dollar, P.; Grishick, R.; He, K.;

Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

(5)

Honolulu, HI, USA, 21–26 July 2017; pp.

936–944.

15. Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. In Proceedings of the 2016 Advances in Neural Information Processing Systems, Curran Associates, Barcelona, Spain, 5–10 December 2016; pp. 379–387.