• Tidak ada hasil yang ditemukan

I. Introduction

1.3 Introduction to Machine Learning

1.3.4 Classification of Machine Learning

1.3.4.4 Deep Learning

Figure 1. 3. 16. Venn diagram of artificial intelligence.

1.3.4.4.1 Introduction to Deep Learning

63-65

If we look at ML technology from a different perspective as show in Figure 1. 3. 16, it can be classified into deep learning (DL) and traditional machine learning (TML). As mentioned earlier, in TML such as classification or regression features of training data are pre-defined then a learning model are generated to predict new data based on the features. On the other hand, DL permits computer to

Artificial Intelligence

Deep Learning

A technique which enables machines to mimic human behavior.

It analyze data, learn from it and make informed decisions based on the learned insights.

It make the computation of multi-layer neural network learn from vast data.

ex) Traditional

Machine Learning

It is a single task learning which is performed without considering past learned knowledge in other tasks.

Machine Learning

35

self-control these procedures. For instance, let’s suppose that there is a task to distinguish species of Irises. In TML, the features of Irises are pre-defined then machine determines discriminant to discriminate the species of Irises based on dataset whereas in DL machine itself finds and identifies the feature of Irises.

Figure 1. 3. 17. Representation of (a) neuron and synapses and (b) neural network system composed of numerous neurons.

Historically, researches in ML can be distinguished three major paradigms; Neural network (NN) modeling and decision-theoretic techniques, symbolic concept-oriented learning and knowledge- intensive approaches combining various learning strategies.66 In 2000s, NN model was developed to deeper NN, i.e., DL due to an introduction of new and innovative algorithms (convolutional NN, recurrent NN, restricted Boltzmann machine, deep belief network, DQN etc.) and improvement of computer performance using general purpose graphical processing unit. Here, NN (or ANN) is a ML algorithm which imitates human brain. In Figure 1. 3. 17, the brain composed of numerous neurons, and synapses that connect neurons, depending how each neuron becomes active, the neurons connected behind it also determine whether they become active.

1.3.4.4.2. Basic Neural Network Model

As indicated in Figure 1. 3. 18, it is possible to establish an algorithm composed of neuron (or node) and synapse (or edge). Since each synapse have different importance, each edge can be also assigned by different weight. Prior to explaining principle, the term “Deep” means many NN layers with many considerable variables per each layer. NNs consisting of 2~3 layers are called as shallow learning (SHL) and if it is more than that, called as DL.67 In the Figure 1. 3. 18, input layer accepts input data so that the number of edges is equivalent to that of features in input data. Output layer in the model is directly connected to the property of problem to be solved. In addition, layers between input and output layer are defined as hidden layer, in DL technology, depth or number of layers can be obtained by adding one to the number of hidden layers. DL is also called as deep neural network for this reason. SHL which consists of a few NN layers suffer from limited application fields. More complex problem requires

Neuron

Synapse

a b

36

increased the number of the NN layers and the number of the edges that connect to the nodes, however, complexity of calculation is proportional to square of the number of the NN layers. Nevertheless, why we have to increase them? First, difficult problems can be solved by DL not SHL technology. Second, the brain structure which imitated by DL is a deep architecture. Also, recognition procedure of animals or human is systemized by several hierarchies, i.e., it abstracts more complex concepts by combining them in various ways on the basis of pre-learned simple concepts.

Figure 1. 3. 18. Schematic diagram of multi-layer perceptron model and information propagation process at the dashed region (between input and a first hidden layer).

Multi-layer perceptron (MLP), known as the simplest NN model and also belong to SL technology, will be briefly introduced in this section. The general NN model is a directed graph where information propagation (IP) is occurred in only one direction as the Figure 1. 3. 18. MLP is also the direct graph, there are no connections in same layers, and the edges are only permitted in adjacent layers. In this case, the IP is generated at forward direction, such NNs are also named as feed-forward network. In the actual brain, each neuron becomes active, the results are transferred to next neuron and then the results also are transferred to the next neuron, information is processed depending on activation method of the

Input Layer Hidden Layers Output Layer

૚࢐

૛࢐

࢔࢐

Weights

Sum of Weighted

Inputs Inputs

Net Output Activation

Function Net Input

Threshold

* Net Input ࢐ ࢏࢐

Nodes Edge

37

neuron that make a final decision. Back to the model diagram, the activation conditions for input data can expressed as functions, it is called as activation function (AF). The simplest example is an AF to determine whether each node will become active or not by summing all input data then defining threshold. Following Eq. (1-20)–(1-23) indicate frequently used AFs, is variables at each edge.

Each node has non-linearity even when linear function is used as AF in the model. It is because no matter many node layers are stacked, it is eventually represented as a sum of the AFs for each node. For the model building, network shape composed of the nodes and the edges is firstly defined then the proper AF is selected for each node. Hyperparameter is the weight at each edge, finding the most appropriate weight would be a final purpose in the model.

1.3.4.4.3 Interference via Neural Network

Figure 1. 3. 19. Interference process in the multi-layer perceptron. Pink circles correspond to activated nodes.

How the model interferes final outputs in assumption that all parameters are determined? As

Input Layer Hidden Layers Output Layer

(1-20) Sigmoid function:

Tanh function:

Absolute function:

(1-21)

Rectified linear unit function: (1-23)

(1-22)

38

represented as Figure 1. 3. 19, after deciding whether to activate on each node then the final results are obtained from the output layers, the interference is decided by analyzing them.

1.3.4.4.4 Backpropagation Algorithm

Lastly, we will take a look at an algorithm to find weight parameter in this section. As previously discussed, the final AFs are non-linear, and each node are intricately intertwined so that the weight optimization corresponds to non-convex optimization. Therefore, in practical, it is impossible to find global optimum of the parameters. Gradient descent method (GDM) which converges appropriate value (e.g. near global optimum) is commonly used as an alternative approach. GDM is the approach for obtaining the weight which minimizes difference between an estimated output that the model produce and a target output that we want. The difference is regarded as a loss, it can be expressed as a mathematical formula defined as a loss function (LF), at that time, shape of the LF is a parabolic second- order plane, the final purpose is to find the minimum point. If supposing that target output ( ) in d-

dimensional is and estimated output ( ) is , generally used

LFs can be represented as following Eq. (1-24)–(1-27).

At the known LFs, gradients for the given parameters are calculated then the parameters can be updated using them. The situation become more complex in the general case because the calculation procedure could be difficult. Backpropagation algorithm (BA) makes the complex gradient calculation very simple, parallelization is also easy when obtaining the gradients of each parameter, so it is frequently used in practical. In the general case, the gradient for current parameter should be calculated to utilize the GDM, however, it becomes incredibly difficult to calculate the value in complicated model.

Meanwhile, in the BA, the loss is firstly obtained using the current parameter, a degree of how much influence each parameter has on the corresponding loss is calculated using chain rule then the weight is updated as the value. Thus, the overall process of the BA can be divided as two following phases, propagation and weight update.

(1-24) Sum of squares Euclidean loss:

SoftMax loss:

Cross entropy loss:

(1-25)

Hinge loss: where is inner product. (1-27)

(1-26)

39

Figure 1. 3. 20. Schematic diagram of backpropagation algorithm.

1. Propagation

1) Forward propagation: Output is calculated from input then the loss is obtained at each output node.

2) Backpropagation: Estimating how much affect nodes in previous layer have on the loss using the calculated loss from the output nodes and the weights at each edge.

2. Weight update

In calculus, the chain rule is can be expressed as Eq. (1-28) in the form of Leibniz’s notation where a variable depends on variable , which also depends on variable , i.e., and are dependent variables, then z also depends on x via the intermediate variable of y.

For the weight update process, it is essential that the purpose is calculating the gradients of the parameter ( ) using differential of variable just before the current parameter to be updated ( ) and differential of the previous variable with the current parameter ( ). The procedure is repeated descending from the output layer. That is, the weight is continuously updated while going through the process output layer hidden k, hidden k hidden k-1, …, hidden 2 hidden 1, hidden 1 input layer. For instance, derivative of the weight adjacent the output layer can be directly obtained, derivative of the weight in the hidden k layer can be calculated by multiplying derivative of the weight in the hidden k-1 layer and of the AF in the hidden k-1 layer. The gradients of each edge can be obtained by

Input Layer Hidden Layers Output Layer

Target Output

Error Calculation Nodes

Edge

Weight Update

(1-28)

40

repeating process from top to bottom layer, the parameters can be modulated using the gradients. This is one iteration procedure, it is processed and repeated until enough output. The whole procedure of the BPA can be represented as Figure 1. 3. 20. Like this, the BA only estimates the loss and continuously updates the weight toward minimizing the loss, not directly changes the weight. Furthermore, after calculating the loss once, the IP is implemented in the direction from the output to the hidden layer, so it is called as backpropagation.

It is ideal that the gradients are calculated for all training data then the values are averaged to obtain accurate gradient then the weights are updated. However, it is inefficient due to numerous numbers of input data in practical. Stochastic gradient descent method (SGDM) can be used as alternative, all parameters are updated by forming a mini batch with some data only the gradient of one batch is calculated, instead of updating the weights using the averaged gradient for the all input data (so called full batch). In case of convex optimization, both SGDM and GDM are converged to the same global optimum at specific conditions, but the actual model is not convex type. The converge condition changes depending on how assigning the batch size, it is generally set as large as memory of computer can handle.

1.3.4.4.5. Example of Backpropagation Algorithm

Figure 1. 3. 21. (a) Example of backpropagation algorithm and enlarged diagrams of backpropagation procedure occurred at (b) outside layers and (c) inside layers, respectively.

This section will introduce how to obtain gradient using the chain rule in practical with an example

૝૟

૜૞

Input Layer Hidden Layer

૚૜

૛૝

૛૜

Output Layer Target Output

Activation Function

૚૝

૜૟ ૝૞

a

b c

૜૞

Backpropagation

૝૞

૚૜

૜૞

૜૟

Backpropagation

૛૜

41

of Figure 1. 3. 21a. For our convenience, at each node, it is denoted that AF as sigmoid function (), LF as sum of square loss, input and output value as In and Out, respectively. The relationship between input and output is satisfied with in H3 layer at this time. If the target output is defined as T, the LF can be expressed as Eq. (1-29) where 1/2 is meaningless number for obtaining clear derivatives in future calculation procedure, and and are each LF of O5 and O6 node, respectively.

Here, a value we want to find would be . As shown in Figure 1. 3. 21b, assume that calculating , it can be indicated in following Eq. (1-30) by the chain rule.

That is, the targeted derivative can be obtained by calculating three different derivatives. It is noted that , and are pre-obtained value in the previous propagation step. The processes to calculate the three derivatives are as follows:

We can also obtain , the weight gradient inside layers, with the example of Figure 1. 3.

21c and it follows Eq. (1-31) by the chain rule.

Likewise, the processes to calculate the three derivatives can be represented as follows:

(1-29)

(1-30)

1

2

and

3

(1-31)

1

42

1)

Chain rule

2)

Chain rule

and

and

3

43

1.3.5 Machine Learning: Consideration and Evaluation