Types of Artificial Neural Networks - automated detection and classification

There are two types of approaches to perform transfer learning using pre- trained models: feature extraction, and fine-tuning. The former extracts the feature maps from the pre-trained model to be built on top of a shallow model, while the latter makes fine adjustments on the pre-trained model to increase its accuracy and performance whilst retaining the initial weights learned by the model for the new task (Mustafid, Pamuji and Helmiyah, 2020). All in all, transfer learning benefits in requiring lesser dataset and time for training while improving performance and network generalization (Alzubaidi, et al., 2021).

Figure 2.9: Transfer Learning Process (Tan, et al., 2018)

different sizes with various topologies such as trees and graphs compared to conventional techniques that are based on features, which use fixed-size vectors to encode the information relevant to the problem (Chinea, 2009). Socher, et al. (2011) provided some examples of the application of RvNN such as parsing scene images, which can be helpful for computer vision. Figure 2.10 illustrates how RvNN parses scene images.

The approach of RvNN is to over-segment the image into smaller regions of interest, then the features of the image are extracted and mapped into a semantic space.

The semantic representations of each region are then fed into the RvNN where it will compute a score. The ones with the highest score will be merged to the neighbouring units, producing a larger unit. A new feature and the class labels that represent the unit are generated for every large unit produced. The merging process happens recursively on the same neural network. As a result, an RvNN tree structure is implicitly created for each merging decision, whereby the final output is the complete scene image, which is said to be the root of the structure (Alzubaidi, et al., 2021). RvNN is still uncommon among the research community due to its intricately complex characteristics, requiring a steep learning curve (Chinea, 2009).

Figure 2.10: Illustration of How RvNN Parses Scene Images (Socher, et al., 2011)

2.5.2 Recurrent Neural Network

Recurrent neural network (RNN) is a type of artificial neural network that deals with information that is time-continuous by implementing feedback to feedforward neural networks (FFNN). The purpose of feedback neural networks is to possess the idea similar to the short-term and long-term memory demonstrated by humans. In the case of RNN, it uses past outputs to process the present input. Hence, RNN is mainly used in speech recognition, human activity recognition, and language translation (Rezk, et al., 2020). There are three collections of layers in RNN: input layers denoted as x, recurrent or hidden layers denoted as h, and output layers denoted as y.

Though RNN may seem like it has a deep network, whereby the input at time 𝑚𝑚 <𝑡𝑡 propagates through multiple nonlinear layers before producing the output at time 𝑡𝑡. However, upon unfolding the network through steps of time, it has a temporal structure with shallow functions. These functions include input-to-hidden (𝑥𝑥_𝑡𝑡 → ℎ_𝑡𝑡), hidden-to-output (ℎ_𝑡𝑡 → 𝑦𝑦_𝑡𝑡), and hidden-to-hidden (ℎ_𝑡𝑡−1 → ℎ_𝑡𝑡) (Pascanu, et al., 2014).

RNN can be unfolded into different types of structure as shown in Figure 2.11: one- to-many, many-to-one, and many-to-many. An RNN is called a deep transition RNN if additional nonlinear layers are stacked within the hidden layer; it is called a deep output RNN if additional nonlinear layers are stacked between the output and the hidden layer (Rezk, et al., 2020).

Figure 2.11: (a) Typical RNN Structure (b) One-To-Many Temporal Structure of RNN (c) Many-To-One Temporal Structure of RNN (d) Many-To-Many Temporal

Structure of RNN (Adapted from Rezk, et al., 2020; Su and Li, 2019)

2.5.3 Convolutional Neural Network

Convolutional neural network (CNN, or ConvNet) is a type of neural network designed specifically to process two-dimensional inputs, which includes images and videos. It is the first artificial neural network that can truly accomplish DL where it successfully trained hierarchically structured layers in a robust way (Mishra and Gupta, 2017). An illustration of the CNN architecture is shown in Figure 2.12. It is an architecture inspired by the structure of the visual system of humans and animals, discovered in 1962 by Hubel and Wiesel, and digitalized in 1980 by Fukushima.

Through the discovery from the receptive fields of the cells in the primary visual cortex of a cat, Hubel and Wiesel proposed a hierarchy model of the visual neural network. The structure starts from the lateral geniculate body (LGB) to simple cells, followed by complex cells, lower-order hypercomplex cells, and finally, to the higher-order hypercomplex cells (Fukushima, 1980). Then, Fukushima presented an artificial neural network model called Neocognitron that followed the works of Hubel and Wiesel. Neocognitron was one of the first models that can be simulated on a computer. It is also considered the earliest version of CNNs since it was based on the hierarchical, multi-layered structure of neurons for image processing (Shamsaldin, et al., 2019).

CNNs are still obscured from the public until 1990. LeCun, et al. (1990) brought the idea to the limelight by using a multi-layered artificial neural network, known as LeNet, to recognize and classify handwritten digits. LeNet was the first CNN architecture that is able to perform image classification using deep learning. It utilizes an algorithm known as back-propagation to train the model, allowing patterns to be recognized from raw pixels. Though LeNet is incapable of solving complex classification problems, it instilled interest among the research community, paving the way for upcoming CNNs (Shamsaldin, et al., 2019).

One of the main interests in employing CNNs is the concept of shared weights, which reduced the number of parameters that had to be learned, enabling a better generalization and avoiding overfitting problems. The utilization of temporal and spatial relationships in the CNN architecture is also an effort in reducing the number

of parameters (Mishra and Gupta, 2017). Besides that, the classification stage is also combined with the feature extraction stage, which expedites the training process and ensuring the optimum output. Furthermore, CNNs also allows large-scale networks to be implemented easier compared to other types of neural networks (Alzubaidi, et al., 2021). Hence, due to the exceptional performance that CNNs is able to provide, it is currently widely applied in multiple applications such as face detection, object detection, image classification, facial expression recognition, speech recognition, and so on (Indolia, et al., 2019).

Figure 2.12: CNN Architecture (Adapted from Mishra and Gupta, 2017)

Dalam dokumen automated detection and classification (Halaman 46-50)