Problem Transformation Methods - Multi-Label Classification Techniques

3.1 Multi-Label Classification Techniques

3.1.1 Problem Transformation Methods

Binary Relevance (BR)

The first multi-label classification technique tested was binary relevance.

The binary relevance (BR) multi-label classification technique transforms the nature of the problem from multi-label to single label (binary) [20]. That is, for each label in the multi-label dataset a binary classifier is trained. To classify an unseen example, each of the binary classifiers is consulted for a prediction for the label it was trained to identify. To compose the final multi-label prediction, the predictions from each of the binary classifiers are simply concatenated. This mechanism is illustrated in figure 3.1.

Due to the fact that this technique is independent of any specific binary classification algorithm, it is possible to use any such algorithm to train the binary classification models. The two different algorithms that were selected as the so-calledbase classifiersare discussed in the next two sections.

Support Vector Machine (BR-SVM) Support vector machines (SVMs) are classification models that are commonly used in pattern recognition problems [84, 85]. SVMs map the feature space into a much higher dimensional space using a technique referred to as thekernel trick, which enables this mapping to be done implicitly (without have to actually calculate the mapping).

If the (now single label, since the binary relevance transformation has been applied) dataset hasnexamples, each withpfeatures, it can be represented as shown below:

D={(x₁,λ1),(x2,λ2), . . . ,(xn,λn)} wherexi∈R^p where theλiare the binary label values for each example (xi).

3.1 Multi-Label Classification Techniques 34

Figure 3.1. Binary relevance classifier. The solid arrows indicate input data flow while the dashed arrows indicate output data flow. They_irepresent the features and theZrepresents the set of labels to be predicted.

3.1 Multi-Label Classification Techniques 35

SVMs construct a maximum margin hyperplane in the higher dimensional space by solving the following optimisation problem:

w,b,ξmin 1

2w^Tw+C

∑

i=1

ξi

subject toλi(w^TΦ(xi) +b)≥1−ξiwithξi≥0 (3.1)

wherewis the normal vector to the hyperplane,C>0 is the cost parameter of the error term andbis the hyperplane offset.

Given the optimisation problem described in 3.1, the kernel functionKis defined as K(xi,xj)≡Φ(xi)^TΦ(xj)

A number of kernel functions have been proposed, but it has been suggested [86] that the radial basis function (RBF) kernel is a reasonable choice. The RBF kernel function is defined in equation 3.2.

K(xi,xj) =exp(−γkxi−xjk²)withγ>0 (3.2)

Naive Bayes (BR-NB) A naive Bayes classifier models the probability of the class variable using the simplifying assumption that each feature in the feature vector is independent [87]. The probabilistic model developed for generating the predictions is derived from Bayes’ theorem, which is defined in equation 3.3.

P(A|B) =P(B|A)P(A)

P(B) (3.3)

The termP(A|B)is theposteriorprobability and can be interpreted as the probability of propositionAgiven that propositionBis true. The termP(A) is thepriorprobability of A, or the probability of propositionA without having any information aboutB. In terms of binary classification the theorem can be rewritten as follows.

P(C=λj|x_i=y) =P(xi=y|C=λj)P(C=λj)

P(xi=y) (3.4)

3.1 Multi-Label Classification Techniques 36

Equation 3.4 can be interpreted as the probability that the class (C) will have the valueλj given that the feature vector (x) has the valuesy. Expanding this to explicitly include each individual feature gives the following (omitting the variable assignments):

P(λj|y₁,y₂, . . . ,yp) =P(y₁,y₂, . . . ,yp|λ_j)P(λj) P(y1,y₂, . . . ,yp)

They_iare given, which makesP(yi,y₂, . . . ,yp)a constant (labelledF). Further, observing that the numerator is equivalent to the joint probability P(λj,yi,y2, . . . ,yp) and applying the chain rule it can be seen that [87–89]:

P(λj|yi,y2, . . . ,yp) = 1

FP(λj,yi,y₂, . . . ,yp)

= 1

FP(λj)P(y₁,y₂, . . . ,yp|λ_j)

= 1

FP(λj)P(y₁|λ_j)P(y₂, . . . ,yp|λ_j,y₁)

= 1

FP(λj)P(y₁|λ_j)P(y₂|λ_j,y₁)P(y₃, . . . ,y_p|λ_j,y₁,y₂)

= 1

FP(λj)P(y₁|λ_j)P(y₂|λ_j,y₁)P(y₃|λ_j,y₁,y₂)P(y₄, . . . ,y_p|λ_j,y₁,y₂,y₃) (3.5)

At this point the “naive” independence assumption is made. In the naive Bayes classification model, it is assumed that P(yi|λ_j,y_k) =P(yi|λ_j), or more generally P(yi|λ_j,yk, . . . ,y_k+l) =P(yi|λ_j). Taking this into account, equation 3.5 simplifies to:

P(λj|y_i,y₂, . . . ,yP) = 1

FP(λj,yi,y₂, . . . ,yp)

= 1

FP(λj)P(yi|λ_j)P(y2|λ_j)· · ·P(yp|λ_j)

= 1 FP(λj)

∏

i=1

P(yi|λ_j) (3.6)

The naive Bayes classifier is built using equation 3.6 along with a decision rule. Intuitively, the classifier tries to find the label (λi) that has the highest probability given the feature vector (y). This is know as the

3.1 Multi-Label Classification Techniques 37

maximum a posteriori probability(MAP) hypothesis [87] and can be expressed formally as:

NBclass(y) =argmax

1 FP(λj)

∏

i=1

P(yi|λj) (3.7)

One of the efficiency problems present in the binary relevance method is that all base classifiers must be trained with every single training example. This issue is addressed by the following technique.

HOMER

The second multi-label classification technique considered in this research is referred to as the Hierarchy Of Multi-label classifiERs (HOMER) method [30]. HOMER constructs a hierarchy of multi-label classifiers as depicted in figure 3.2. Each node in the hierarchy (Hi) only considers a subset of the label set (LH_i ⊆L).

The primary goal of architecting the technique in this way is to increase training and testing efficiency. The mechanism of this optimisation is explained below.

HOMER attempts to overcome the inefficiency of the binary relevance method by defining the concept of ameta-label(µH_i) as follows:

µH_i≡^_λj, λj∈LH_i (3.8)

That is, a training example can be considered as being labelled withµH_i if it possesses at least one of the labels inLH_i. As an example, consider figure 3.2 and the training example shown below.

xi= ({x₁,x₂, . . . ,xp},{λ₁,λ2,λ3}) (3.9) The HOMER technique would label the training example in equation 3.9 with the meta-labelµH₁ in figure 3.2 since is it labelled with λ1 andλ2. During training, only the examples that are labelled with µH_i are passed down the hierarchy toHito be given as training input to the base classifiers at the leaf nodes below Hi. In this study, naive Bayes classifiers (described in some detail in section 3.1.1 above) are used as the base classifiers.

3.1 Multi-Label Classification Techniques 38

Figure 3.2. A simple hierarchy constructed by HOMER for a hypothetical problem with 8 labels.

TheHi are the nodes in the hierarchy and the meta-labels (µH_i) are shown in parentheses for each node. Theyirepresent the features and theZrepresents the set of labels to be predicted. TheBCλi

are the base classifiers for each label.

3.1 Multi-Label Classification Techniques 39

To generate a label set prediction for an unseen example, the example is first passed to the root node of the hierarchy (HOMERin figure 3.2). Then, the example is recursively passed to each child node (Hi) using a depth-first traversal. The concatenation of the binary predictions produced by each base classifier is then taken to be the final predicted label set.

One problem that needs to be solved when constructing a classifier using the HOMER technique is how to create the disjoint label setsL_H_i. HOMER solves this problem by trying to distribute the labels asevenly as possible intoksubsets such that labels belonging to the same subset are assimilaras possible. HOMER achieves this by extending thebalanced clusteringalgorithm [90] and introducing a novel clustering algorithm calledbalanced k means[30], which clusters the labels with an explicit constraint on the size of each cluster. These clusters are then used as the meta-labels.

Dalam dokumen An investigation of multi-label classification techniques for predicting HIV drug resistance in resource-limited settings. (Halaman 46-52)