SEMI - SUPERVISED SUPPORT VECTOR MACHINES (S3VM) In recent decades, the ways of collecting data are more diverse, and the amount of data

LEARNING FROM DATA

4.6 SEMI - SUPERVISED SUPPORT VECTOR MACHINES (S3VM) In recent decades, the ways of collecting data are more diverse, and the amount of data

is growing exponentially. With the rapid development of various technologies, it becomes easy to collect large amounts of data. To build good predictive models, it is necessary to have training data sets with labels. However, because the obtained data in most of modern applications are extremely massive, it is unfeasible to invest a lot of resources in the work of their labeling. It is likely to occur that collecting unlabeled data samples is becoming cheap, but obtaining the labels costs a lot of time, effort, or money. This is the case in many application areas of machine learning, and the fol- lowing examples are just a few illustrations in big data environment:

• In speech recognition, it costs almost nothing to record huge amounts of speech, but labeling it requires some human to listen to it and type a transcript.

• Billions of Web pages are directly available for automated processing, but to classify them reliably, humans have to read them.

• Protein sequences are nowadays acquired at industrial speed (by genome sequencing, computational gene finding, and automatic translation), but to resolve a 3D structure or to determine the functions of a single protein may require very significant scientific work.

Based on characteristics of training data sets, the classification of machine- learning tasks can be extended from two into three main types: unsupervised learning, supervised learning, and a new type so-called semi-supervised learning (SSL). While supervised and unsupervised learning techniques are introduced earlier, this section is explaining main ideas behind SSL. Essentially, the learning models in this case are similar to models in supervised learning with labeled samples; only this time the model is enhanced using large amount of cheap unlabeled data.

SSL represents one of the research focuses on machine learning in recent years, and it has attracted much attention in many application fields ranging from bioinfor- matics to Web mining. These are disciplines where it is easier to obtain unlabeled samples, while labeling requires significant effort, expertise in the field, and time con- sumption. Just imagine reading labeling millions of emails as a spam or no spam to create high-quality automatic classification system. SSL is a machine-learning approach that is combining unsupervised learning and supervised learning. The basic idea is in using a large number of unlabeled data to help the supervised learning method improve modeling results. More formally, in SSL, there are labeled data setL= {(x1,y1), (x2,y2),…,(xm,ym)} and unlabeled data setU= x1,x2,…,x_n , where m n,xis ad-dimensional input vector, and y are labels. The task is to determine a functionf:X Y, which could accurately predict a labelyfor each samplex□X. Since unlabeled data carry less information about f function than labeled data, they are required in large amounts in order to increase prediction accuracy of the model.

SSL will be mostly useful whenever there are far more unlabeled data samples than

SEMI-SUPERVISED SUPPORT VECTOR MACHINES (S3VM) 131

labeled ones. This big data assumption implies the need for fast and very efficient SSL algorithms.

An illustrative example of the influence of unlabeled data in SSL is given in Figure 4.27. The Figure 4.27a part shows a decision boundary we might adopt after seeing only one positive (white circle) and one negative (black circle) example. The Figure 4.27b part shows a decision boundary we might adopt if, in addition to the two labeled examples, we were given a collection of unlabeled data (gray circles). This could be viewed as performing clustering of unlabeled samples and then labeling the clusters by synchronizing these labels with the given labeled data. This labeling process enables to push the decision boundary away from high-density regions. In both cases in Figure 4.27, with or without unlabeled samples, maximum margin principle has determined the final decision boundaries.

In order to evaluate SSL model, it is necessary to make a comparison with results of a supervised algorithm that uses only labeled data. The question is, can SSL imple- mentation have a more accurate prediction and better model by taking into account the unlabeled points? In principle, the answer is“yes.”However, there are important con- ditions to reach this improved solution: the distribution of unlabeled samples has to be relevant for the classification problem. Using more mathematical formulation, one could say that the knowledge onp(x) distribution, which is gained through the unlabeled data, has to carry useful information in the inference ofp(y|x). If this is not the case, SSL will no yield an improvement over supervised learning. It might even hap- pen that using the unlabeled data degrades the prediction accuracy by misguiding the inference.

SSL include techniques such as semi-supervised support vector machines (S3VM), self-training algorithms, generative models, graph-based algorithms, and multi-view approaches. This short review gives some additional details about S3VM. S3VM is an extension of standard SVM methodology using additionally available unlabeled data. This approach implements the cluster assumption for SSL, that is, examples in data cluster have similar labels, so classes are well separated and do not cut through dense unlabeled data. The main goal of S3VM is to build clas- sifier by using both labeled and unlabeled data. Similar to the main idea of SVM, S3VM requires the maximum margin to separate training samples, including all

(a) (b)

Figure 4.27. Classification model using labeled and unlabeled samples. (a) Model based on only labeled samples. (b) Model based on only labeled and unlabeled samples.

132 LEARNING FROM DATA

labeled and unlabeled data. The basic principle of S3VM is presented in Figure 4.28.

If the learning process is made only based on labeled samples represented with small circles with + or−, SVM model with maximum separation is given in with dashed lines. If unlabeled samples are taken into modeling and density is accepted as a cri- terion for separation margin, then maximized classification margin is totally trans- formed into parallel full lines.

S3VM shows satisfactory results only if two assumptions about unlabeled data are satisfied:

1. Continuity assumption—Unlabeled samples in n-dimensional space, which are close to each other, are more likely to share the same label. This is also generally assumed in supervised learning, and it yields a preference for geo- metrically simple decision boundaries. In the case of SSL, thesmoothness assumptionrepresents extension that additionally yields a preference for classification boundaries in low-density regions so that there are fewer samples close to each other belonging different classes.

2. Cluster assumption—The data tend to form discrete clusters, and points in the same cluster are more likely to share the same label. Label sharing may be spread across multiple clusters. For example, if unlabeled samples are organ- ized intoXclusters, thenX–Yclusters may belong to one class (one label), and the rest ofYclusters will belong to the other class (the example is for two-class problems!). Cluster assumption is a special case of the smoothness assumption and gives rise to feature learning with clustering algorithms.

With the assumption that smoothness and clustering requirements are satisfied, core steps in the S3VM algorithm are to:

1. Enumerate all 2^u possible labeling combinations of unlabeled samples Xu

(exponential complexity of a problem—it requires analysis of all alternatives in labeling unlabeled samples).

(a) (b)

Figure 4.28. Semi-supervised model improves classification. (a) Supervised model trained on labeled samples alone. (b) Semi-supervised model use also unlabeled samples.

SEMI-SUPERVISED SUPPORT VECTOR MACHINES (S3VM) 133

2. Build one standard SVM for each labeling case in the previous step (and variety ofXlsplitting).

3. Pick the SVM with the largest margin.

Obviously the algorithm has a problem of combinatorial explosion of alternatives in the first step! Variety of optimization methods are applied, and different S3VM implementations shows reasonable performances in practice (for example, min-cut approach in a graph of unlabeled and labeled samples). Obviously, S3VM have seri- ous deficiency in a case of big (unlabeled) data. The methods are using extremely large time in training, and it is currently the biggest challenge in all implementations of S3VM.

In general, there is no uniform SSL solution for all applications with available both unlabeled and labeled data. Depending on available data and knowledge about the problem, we may use different techniques and approaches: from standard supervised learning (such as SVM) when there are enough labeled samples to different SSL techniques including S3VM when two assumptions in data set are satisfied. If SSL is used as a first step, verify and discuss solution by comparing with other approaches including supervised and even unsupervised learning model. Figure 4.29 gives only illustrative examples with applicable and non-applicable S3VM.

Dalam dokumen BUKU DATA MINING (Concepts, Models, Methods, and Algorithms) (Halaman 152-155)