3.1 Multi-Label Classification Techniques
3.1.1 Problem Transformation Methods
Binary Relevance (BR)
The first multi-label classification technique tested was binary relevance.
The binary relevance (BR) multi-label classification technique transforms the nature of the problem from multi-label to single label (binary) [20]. That is, for each label in the multi-label dataset a binary classifier is trained. To classify an unseen example, each of the binary classifiers is consulted for a prediction for the label it was trained to identify. To compose the final multi-label prediction, the predictions from each of the binary classifiers are simply concatenated. This mechanism is illustrated in figure 3.1.
Due to the fact that this technique is independent of any specific binary classification algorithm, it is possible to use any such algorithm to train the binary classification models. The two different algorithms that were selected as the so-calledbase classifiersare discussed in the next two sections.
Support Vector Machine (BR-SVM) Support vector machines (SVMs) are classification models that are commonly used in pattern recognition problems [84, 85]. SVMs map the feature space into a much higher dimensional space using a technique referred to as thekernel trick, which enables this mapping to be done implicitly (without have to actually calculate the mapping).
If the (now single label, since the binary relevance transformation has been applied) dataset hasnexamples, each withpfeatures, it can be represented as shown below:
D={(x1,λ1),(x2,λ2), . . . ,(xn,λn)} wherexi∈Rp where theλiare the binary label values for each example (xi).
3.1 Multi-Label Classification Techniques 34
Figure 3.1. Binary relevance classifier. The solid arrows indicate input data flow while the dashed arrows indicate output data flow. Theyirepresent the features and theZrepresents the set of labels to be predicted.
3.1 Multi-Label Classification Techniques 35
SVMs construct a maximum margin hyperplane in the higher dimensional space by solving the following optimisation problem:
w,b,ξmin 1
2wTw+C
n
∑
i=1
ξi
subject toλi(wTΦ(xi) +b)≥1−ξiwithξi≥0 (3.1)
wherewis the normal vector to the hyperplane,C>0 is the cost parameter of the error term andbis the hyperplane offset.
Given the optimisation problem described in 3.1, the kernel functionKis defined as K(xi,xj)≡Φ(xi)TΦ(xj)
A number of kernel functions have been proposed, but it has been suggested [86] that the radial basis function (RBF) kernel is a reasonable choice. The RBF kernel function is defined in equation 3.2.
K(xi,xj) =exp(−γkxi−xjk2)withγ>0 (3.2)
Naive Bayes (BR-NB) A naive Bayes classifier models the probability of the class variable using the simplifying assumption that each feature in the feature vector is independent [87]. The probabilistic model developed for generating the predictions is derived from Bayes’ theorem, which is defined in equation 3.3.
P(A|B) =P(B|A)P(A)
P(B) (3.3)
The termP(A|B)is theposteriorprobability and can be interpreted as the probability of propositionAgiven that propositionBis true. The termP(A) is thepriorprobability of A, or the probability of propositionA without having any information aboutB. In terms of binary classification the theorem can be rewritten as follows.
P(C=λj|xi=y) =P(xi=y|C=λj)P(C=λj)
P(xi=y) (3.4)
3.1 Multi-Label Classification Techniques 36
Equation 3.4 can be interpreted as the probability that the class (C) will have the valueλj given that the feature vector (x) has the valuesy. Expanding this to explicitly include each individual feature gives the following (omitting the variable assignments):
P(λj|y1,y2, . . . ,yp) =P(y1,y2, . . . ,yp|λj)P(λj) P(y1,y2, . . . ,yp)
Theyiare given, which makesP(yi,y2, . . . ,yp)a constant (labelledF). Further, observing that the numerator is equivalent to the joint probability P(λj,yi,y2, . . . ,yp) and applying the chain rule it can be seen that [87–89]:
P(λj|yi,y2, . . . ,yp) = 1
FP(λj,yi,y2, . . . ,yp)
= 1
FP(λj)P(y1,y2, . . . ,yp|λj)
= 1
FP(λj)P(y1|λj)P(y2, . . . ,yp|λj,y1)
= 1
FP(λj)P(y1|λj)P(y2|λj,y1)P(y3, . . . ,yp|λj,y1,y2)
= 1
FP(λj)P(y1|λj)P(y2|λj,y1)P(y3|λj,y1,y2)P(y4, . . . ,yp|λj,y1,y2,y3) (3.5)
At this point the “naive” independence assumption is made. In the naive Bayes classification model, it is assumed that P(yi|λj,yk) =P(yi|λj), or more generally P(yi|λj,yk, . . . ,yk+l) =P(yi|λj). Taking this into account, equation 3.5 simplifies to:
P(λj|yi,y2, . . . ,yP) = 1
FP(λj,yi,y2, . . . ,yp)
= 1
FP(λj)P(yi|λj)P(y2|λj)· · ·P(yp|λj)
= 1 FP(λj)
p
∏
i=1
P(yi|λj) (3.6)
The naive Bayes classifier is built using equation 3.6 along with a decision rule. Intuitively, the classifier tries to find the label (λi) that has the highest probability given the feature vector (y). This is know as the
3.1 Multi-Label Classification Techniques 37
maximum a posteriori probability(MAP) hypothesis [87] and can be expressed formally as:
NBclass(y) =argmax
j
1 FP(λj)
p
∏
i=1P(yi|λj) (3.7)
One of the efficiency problems present in the binary relevance method is that all base classifiers must be trained with every single training example. This issue is addressed by the following technique.
HOMER
The second multi-label classification technique considered in this research is referred to as the Hierarchy Of Multi-label classifiERs (HOMER) method [30]. HOMER constructs a hierarchy of multi-label classifiers as depicted in figure 3.2. Each node in the hierarchy (Hi) only considers a subset of the label set (LHi ⊆L).
The primary goal of architecting the technique in this way is to increase training and testing efficiency. The mechanism of this optimisation is explained below.
HOMER attempts to overcome the inefficiency of the binary relevance method by defining the concept of ameta-label(µHi) as follows:
µHi≡_λj, λj∈LHi (3.8)
That is, a training example can be considered as being labelled withµHi if it possesses at least one of the labels inLHi. As an example, consider figure 3.2 and the training example shown below.
xi= ({x1,x2, . . . ,xp},{λ1,λ2,λ3}) (3.9) The HOMER technique would label the training example in equation 3.9 with the meta-labelµH1 in figure 3.2 since is it labelled with λ1 andλ2. During training, only the examples that are labelled with µHi are passed down the hierarchy toHito be given as training input to the base classifiers at the leaf nodes below Hi. In this study, naive Bayes classifiers (described in some detail in section 3.1.1 above) are used as the base classifiers.
3.1 Multi-Label Classification Techniques 38
Figure 3.2. A simple hierarchy constructed by HOMER for a hypothetical problem with 8 labels.
TheHi are the nodes in the hierarchy and the meta-labels (µHi) are shown in parentheses for each node. Theyirepresent the features and theZrepresents the set of labels to be predicted. TheBCλi
are the base classifiers for each label.
3.1 Multi-Label Classification Techniques 39
To generate a label set prediction for an unseen example, the example is first passed to the root node of the hierarchy (HOMERin figure 3.2). Then, the example is recursively passed to each child node (Hi) using a depth-first traversal. The concatenation of the binary predictions produced by each base classifier is then taken to be the final predicted label set.
One problem that needs to be solved when constructing a classifier using the HOMER technique is how to create the disjoint label setsLHi. HOMER solves this problem by trying to distribute the labels asevenly as possible intoksubsets such that labels belonging to the same subset are assimilaras possible. HOMER achieves this by extending thebalanced clusteringalgorithm [90] and introducing a novel clustering algo- rithm calledbalanced k means[30], which clusters the labels with an explicit constraint on the size of each cluster. These clusters are then used as the meta-labels.