I. Introduction
1.3 Introduction to Machine Learning
1.3.4 Classification of Machine Learning
1.3.4.1 Supervised Learning
Machine learns data set which have correct answer, for example, it solves given problems such as 9
× 3 = ??, when learning multiplication table such as 3 × 5 = 15, 6 × 4 = 24, etc. In this case, 3 × 5 is feature and 15 is label.
1.3.4.1.1 Classification
Output should be belonged to one of labels that training data have since a label of new data is predicted depending on which group to belong to after learning from training data. For example, the output can be only two answers such as determining pass or failure or multi-answers such as predicting one’s hometown. Classification learning is commonly used in filtering spam e-mails, categorizing images and recognizing handwriting for automatically sorting the mails, etc. The representative algorithms include kNN, SVM (support vector classification, SVC) and more.
Unlabeled data set are trained. (There are no correct answer.)
=> Unpaired data set.
It find pattern or trend from given data for useful categorization.
Clustering
Labelled data set are trained. (Data is already tagged with the correct answer.)
=> Paired data set (feature and label).
Classification
Regression
Supervised Learning Machine Learning
Unsupervised Learning Reinforcement Learning
St Rt At
Rt+1
St+1
S: State R: Reward
A: Action
It is based on Behavioral Psychology.
It is guided by a given problem and finds solution through trial and error.
Deep–Q Network (DQN) Feature
X
Label Y
Unlabeled Data Set
A B C a
b c
Environment Agent
a a’
b’’ a’’ b’
b
c c’
c’’
26
1.3.4.1.1.1. k-nearest Neighbor
Group training data which is nearest to the new data is initially identified for predicting groups new data belong to. As a cornerstone of the implementation, training data is quantified in an n-dimensional plane with supposing that n-characteristics. With new input data, a virtual circle around the data were expanded, it is continued until k-data are founded. The k-data are obviously belonged to the labelled groups, group of the new input data is assigned to the highest number group after counting the number of data at each group. In other words, ‘k’ in the term ‘kNN’ indicates finding adjacent k-data to predict group new input data belong to.
Figure 1. 3. 8. Classification process of k-nearest neighbor.
For example, at k = 3, to determine group մ (new input data) belongs to, the number of data closest to մ are 1 and 2 for and (labelled groups), respectively so we can determine group of մ is in Figure 1. 3. 8.
1.3.4.1.1.2. Support Vector Machine
It is supposed that training data are in vector space. SVM is geometric algorithm to find linear classifier categorizing n-groups training data belong to. For instance, the purpose is to calculate linear vector ( , linear classifier) separating two groups ( and ) in Figure 1. 3. 9. In above equation, is a normal vector crossing at right angles with linear classifier, it can rotate the linear vector and is scalar constant, it can move to parallel. is data vector. If an input vector we want to predict is on the linear vector, it such that . As a same approach, if vector is on region, it such that or vector is on region, it such that . It is difficult to determine accurate position of vector if it is close to the border of two regions. It is a core concept of SVM to solve the problem by finding linear classifier, which keeps two groups as far apart as possible55 because input data can be predicted with higher accuracy the farther two groups go away
a b
27
each other.
Figure 1. 3. 9. Classification process of support vector machine.
1.3.4.1.2 Regression
It predicts continuous values depending on feature of training data and purpose is to express functional formula relationship between feature and label. Labels are always continuous so output values can be within range of training data set. It is utilized in predicting pattern or trend/tendency and assessing what is a housing price per size. There are linear regression (LR), kernel ridge regression (KRR), random forest (RF) regression, extra trees (ET) regression, gradient tree boosting regression, etc. in representative technologies.
1.3.4.1.2.1 Linear Regression
56LR can be classified as simple and multi LR, simple LR will be deeply treated in this section and further described in section 1.3.5.3.4. It is simplest algorithm when understanding between independent and dependent variables. The regression formula can be expressed as following linear Eq. (1-15).
It is called as regression on (to) . In the equation, and are parameters that we want to know and a standard approach to determine them is the method of least squares. Figure 1. 3. 10 represents simple LR, blue dots indicate given data, red line corresponds to regression function. At same given , the difference between estimated at the linear equation and given is defined as residual.
It is denoted as for i-th data, represented in the following Eq. (1-16).
To confirm quantitatively the residuals, each was calculated then added up square of all the
࢞ᇝ ࢞ᶭ Max M
a b
(1-15)
(1-16)
28
for all data set. This is called as RSS (Residual Sum of Squares) or SSE (Sum of Squared Error) and can be obtained from Eq. (1-17).
Figure 1. 3. 10. Example of simple linear regression.
and can be obtained by taking partial differential of RSS with respect to each parameter then calculating values of the parameters such that zero. It is obvious that the slope (the increase rate with respect to and ) such that zero, becomes minimum or maximum point when taking partial differential of RSS with respect to and . We can obtain minimum parameters because a parabola of quadratic function opens downwards, the highest degree terms with respect to and are positive. The corresponding procedures are as noted below.
1) Finding which satisfy ,
Small
(1-17)
(1-18)
29
2) Finding which satisfy ,
(For both and , and are averages for all data set.)
It is important characteristic in the method of Least Square that regressed linear function is unbiased, i.e. if you take a number of the method of Least Square for numerous data sets, the averaged regression function will be the same as the real.
1.3.4.1.2.2 Random Forest Regression
Figure 1. 3. 11. Schematic structure of random forest regression.
Average All
( ) ( )
Tree 1 Tree 2 Tree n
Unknown Data Node
(1-19)
30
It is one of ensemble methods frequently used in classification and regression analysis, individual trees have slightly different characteristics due to its randomization. It causes predictions of each tree to be decorrelated, resulting in improved generalization.57 The randomization proceeds in the training process of each tree and improves robustness against outliers. The value of a target variable is predicted by a hierarchy of if/else questions for each tree, and the output is mean prediction of the individual trees (see Figure 1. 3. 11).58 Significant hyperparameters to train a RF regression algorithm are number of decision trees in the forest, maximum number of features considered by each tree, maximum depth of a tree, minimum number of data to become a leaf node and minimum number of samples to split a node.58, 59 RF regression normally applied general techniques such as bagging (bootstrap aggregation) and randomized node optimization, etc.