Nonnegative Matrix Factorization (NMF) Based Supervised Feature Selection and Adaptation

(1)

C. Fyfe et al. (Eds.): IDEAL 2008, LNCS 5326, pp. 120–127, 2008.

Abstract. We proposed a novel algorithm of supervised feature selection and adaptation for enhancing the classification accuracy of unsupervised Nonnega- tive Matrix Factorization (NMF) feature extraction algorithm. At first the algorithm extracts feature vectors for a given high dimensional data then reduce the feature dimension using mutual information based relevant feature selection and finally adapt the selected NMF features using the proposed Non-negative Su- pervised Feature Adaptation (NSFA) learning algorithm. The supervised feature selection and adaptation improve the classification performance which is fully confirmed by simulations with text-document classification problem. Moreover, the non-negativity constraint, of this algorithm, provides biologically plausible and meaningful feature.

Keywords: Nonnegative Matrix Factorization, Feature Adaptation, Feature extraction, Feature selection, Document classification.

1 Introduction

In machine learning systems feature extraction of the high-dimensional data such as text is an important step. The extracted feature vectors, individually or by combina- tion with more than one represent the data as pattern or knowledge unit. In a text mining system, pattern or knowledge discovery is an important task. In some text classification works such as [1, 2] tried to reduce the dimension of original feature (word or term) set using the statistical measure of relevance i.e., mutual information among the terms and given category. These selected set of terms are used to represent the natural language text into numerical form by counting the term-document frequencies. However, the terms or words have properties such as polysemy, and synon- ymy, that does not make the words as optimal features. These problems motivated for text feature extraction such as clustering [3].

The unsupervised Nonnegative Matrix Factorization (NMF) algorithm [4] has been used to extract the semantic features from a document collection, where the algorithm

provides two encoded factors with the approximation of X≈BS. Here X, B and S are

* Corresponding Author.

(2)

Fig. 1. NMF based feature selection and adaptation model for classification

the input data, the basis factor and the coefficient factor respectively, of which elements are all non-negative. The basis factor may be regarded as a set of feature vectors or patterns. The NMF and its extended versions [5, 6] become popular in many scien- tific and engineering applications including pattern clustering [7], blind source separa- tion [5], text analysis [4], and music transcription [8].

Although NMF feature extraction process is based on the internal distribution of given data the algorithm is blind about the category and may not be optimal for classification. Therefore, one may select only relevant features for given category information and/or adapt the features for best classification performance. In general, the hidden layer of multilayer Perceptron (MLP) may be regarded as feature extrac- tors with category information. The gradient-based learning for MLP has a tendency to stick with local minima, and its performance depends on good initialization of synaptic weights [9]. The first layer of MLP may be initialized by NMF based features, which can provide a good approximation of initial synaptic weights. With this initialization, to train the networks we modifying the general error back-propagation learning rule, and proposed a Non-negative Supervised Feature Adaptation (NSFA) algorithm.

2 Algorithm

We divide the whole classification process into four stages: unsupervised NMF feature extraction, feature selection based on Mutual Information (MI), non-negative supervised feature adaptation, and Single Layer Perceptron (SLP) classification as shown in Fig.1. Let us consider X be a given input data matrix of dimension N×M, of which each column represents a training sample with non-negative elements and M

>> N. The samples are normalized to have unit sum i.e.,

x

ij

⁼ x

ij

∑

i

x

ij . At the feature extraction stage one decomposes X into a feature or basis matrix B and coefficient matrix S with the following update rules, which are based on generalized KL divergence [4] as

(3)

here the number of NMF features is an empirical parameter R.

At the selection stage one may remove irrelevant features and keep only important features for the classification. For the selection criteria we chose the mutual information between the feature coefficient and category variable as

( ) ( ) ( )

( )

rq

( )

k k K rq

k Q

q

k

rq s c

c c s

s

MI Pr Pr

, log Pr

, Pr

, ₂

1 1

∑∑

= =

= c

s_r (2)

where Pr(a,b) is the joint probability of variables a and b, Pr(a) is the probability of a, K is the number of categories, c is the category label vector, Q is the number of quan- tization levels for sr’s.

In general the features have mutual information among themselves, and the impor- tance of a feature need to be evaluated by Information Gain (IG), which is defined as the additional information by adding a feature to an existing set of features. It requires calculation of higher dimensional joint probability, which requires huge data for accu- rate estimation. Since the cost function of our NMF feature extractor is a generalized version of KL divergence, the extracted features have small mutual information among themselves and the IG may be approximated by the simpler MI.

The features may be adapted to provide best classification performance. The feature adaptation and classification network can be considered as a two-layer feed- forward network, where the first layer is for feature adaptation and second layer for classification. With the first layer synaptic weight W, hidden neuron activation H is defined as

H = WX .

While the initial values of W are set to Moore-Penrose pseudo inverse of

F

asW₀ ⁼

[ ]

Fε⁺, here

F

is the selected columns of the NMF feature matrix B. The operator [a]_ε represents a half-wave transfer function of variable a, such as a=max(a,ε), with ε is a very small positive number (e.g., ε =10^-16).

Let Y be the input to the classifier layer and O be the output of the classifier layer as

( )

⁼ ⁺

(

⁻

( ∑ ) )

=

− −

= +

r kr rj

kj

rj rj

y y u

f o

h and y

β α

exp 1 , 1 exp 1 1

2

U

(3)

(4)

here U is the synaptic weight of the classifier layer. Let the error,

e

_kj

= t

_kj

− o

_kj, for a given target C, then the synaptic weight U can be updated by simple gradient rule as

kj kj

kj u u

u ← +Δ (4)

and based on the error gradient the hidden activation H can be modified as

[

rj rj

]

_ε

rj h h

h ← +Δ (5)

where ₁ ,

rj j

rj h

h E

δ η δ

−

=

Δ Ej ⁼

∑

k^K₌₁ekj²

2

1 , η’s are learning rate, and K is the total number of categories.

Now based on the NMF approximation we can estimate the hidden or coefficient matrix, with given input data X and feature matrix W as

WX

Hˆ = or

⁼ ∑

i ij ri

rj

w x

h ˆ

(6) Now we can minimize the Euclidean distance ^D

( )

^H^,^H^ˆ ⁼^H⁻^H^ˆ ₂between updated and estimated H with respect to W as

ri j

ri w

w D where

δ ηΔ Δ =−δ +

←W W;

W (7)

Let us consider the instantaneous distance Dj for a certain training sample j, is defined as

D

j

= ∑

r

d

rj²

2

1

and

^d

^rj

⁼ ( ^h

^rj

⁻ [ ] ^wx

^rj

) ( ⁼ ^h

^rj

⁻ ^∑

ⁱ

^w

^ri

^x

^ij

)

, r is the number of features. The minimization of Dj with respect to W is defined as

[ ]

rj ji

j rj ji j

ri rj rj

j

ri h x wx x

w d d

w =− D =∑ −∑

Δ δ

δ δ

δ (8)

By choosing an appropriate learning rate η, we can get the following multiplicative update rule for W as

[ ] [ ]

∑

← ∑

=∑ Δ

+

←

j rj ji

j rj ji ri

ri

j rj ji

ri ri

x wx

x w h

w

x wx with w

w w

w η ; η

(9)

In matrix form:

[ ] ^⎟⎟_⎠^⎞

⎜⎜⎝

← ⎛ ^T_T X WX W HX

W . (10)

(5)

To observe the performance of our proposed algorithms, we selected 8137 documents from top eight document categories of Reuters21578 text database. We randomly generate four-fold training, validation and test data sets. In each fold data we use 50%

documents for training, 25% for validation and the remaining 25% for test. In the pre- processing stage of representing the documents into term-document frequency matrix, we first use the stop-words elimination and category-based rare-word elimination [10], and select 500 terms with higher frequencies. The frequencies are normalized to have unit sum for each document.

To observe the classification performance of our proposed algorithm; we consider a baseline system, where the optimum extracted coefficient factor H=S of the NMF algorithm with a predefined number of features (such as 2, 3, 4, 5, 6, 10 15, 25, 50 100, 150 300, and 500) is classified by training a Single Layer Perceptron (SLP) classifier, which is defined as NMF-SLP approach.

We verified the MI based feature selection approach without feature adaptation, in this case at first we extract very large number (500, same as the number of selected terms) of NMF features, then select some important feature subsets with different dimension (like 2, 3, 4, 5, 6, 10 15, 25, 50 100, 150 300, and 500), and finally test the classification performance with these selected (H⊆S) NMF coefficient factors by training a SLP classifier, which is defined as MI-NMF-SLP approach.

We also verified the feature adaptation approach without feature selection. In this case the feature extraction process is similar to NMF-SLP approach. For adaptation the network can be considered as MLP structure where the first layer synaptic weight is initialized by the pseudo-inverse of NMF optimized features and the learning algorithm is similar to MLP learning with nonnegative constraint which is defined as MLP-NMFI-NC.

Finally, MI based selected NMF features are adapted using the proposed algorithm which is defined as MI-MLP-NMFI-NC. Again here the feature extraction and selection processes are similar to MI-NMF-SLP approach. The first layer synaptic weight of the MLP network is initialized by the selected NMF features and adapted by the proposed multiplicative learning rule with nonnegative constraint.

4 Results

The connectivity of the terms corresponding to the feature vectors are non-negative i.e., is a terms is connected to a feature vector the synaptic weight has positive value

(6)

Fig. 2. (a) Synaptic connectivity of terms to feature vectors, (b) Synaptic Connection probability distributions of 10 feature vectors

otherwise zero, which is shown in Fig.2a here we have only presented the case when we select 10 NMF features using MI and then adapt those using NSFA algorithm. The features are sparse which has been shown in Fig.2b, where the histogram of 10 adapted feature vectors shows the number of terms (Y-axis in each figure) connected to a feature vectors with different synaptic weight strength, i.e., bin centre of the histogram (as shown in X-axis). This figure suggests that in each feature vector only few terms are connected (nonzero synaptic weight) and most are not connected (zero synaptic weight).

Based on mutual information measure, it has been observed that all the NMF extracted features are not important or relevant for classification. Few of that are more relevant to the categories such 50 most relevant NMF features based on MI score has been shown in Fig. 3a. The overall average classification performances for 4-fold data set with different classification approaches: NMF-SLP, MI-NMF-SLP, MLP-NMFI- NC and MI-MLP-NMFI-NC have been shown in Fig. 3b, corresponding to different number of features as indicated in the X-axis (in log scale). In general the unsupervised NMF algorithm optimizes and extracts almost independent and sparse features, and the optimum feature dimension R is empirical. In our observation, the classification performance of NMF-SLP approach is proportionally dependent on the feature dimension. The MI-NMF-SLP approach also shows the similar trend of performance dependency as NMF-SLP approach. In MI-NMF-SLP case, we select a few features from a large number of independent and optimum NMF feature set, due to the selection the features become suboptimal so the performance is worst. The MLP-NMFI and MI-MLP-NMFI-NC algorithms are performed better with small number of features owing to the adaptation based on given category information. Although the

(7)

Fig. 3. (a) Mutual information score for top 100 NMF features, (b) Classification performance for different algorithms

performance of MLP-NMFI-NC is almost similar to MI-MLP-NMFI-NC but in the second approach is more effective. In first case we need to extract NMF features with different predefined number of basis while in MI-MLP-NMFI-NC case we just extract once NMF features with large number of basis then the relevant features are selected based on MI score. With small number of adapted features (about 25 to 50) are just enough to represent the data properly. We can consider these adapted features as the representative patterns or knowledge of the given data or document collection.

5 Conclusion

In this work we have presented a supervised algorithm for effective feature extraction, dimension reduction and adaptation. This algorithm will extend the application area of unsupervised NMF feature extraction algorithm into the supervised classification problems. The non-negativity constraint of this algorithm represents the data by sparse patterns or feature vectors that are understandable in text mining application.

The better performance with small number of features can be useful to make a faster classifier system. We are going to implement our idea of NMF feature adaptation using standard error back propagation algorithm, which will be helpful to reduce the risk of local minima in MLP training.

(8)

Acknowledgement

This work was supported as the Brain Neuroinformatics Research Program by Korean Ministry of Commerce, Industry and Energy.

References

1. Makrehchi, M.M., Kamel, M.S.: Text classification using small number of features. In:

Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, pp. 580–589.

Springer, Heidelberg (2005)

2. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization.

In: Proc. of 14th ICML, pp. 412–420 (1997)

3. Slonim, N., Tishby, N.: The power of word clusters for text classification. In: Proc. ECIR (2001)

4. Lee, D.D., Seung, H.S.: Learning of the parts of objects by nonnegative matrix factorization. Nature 401, 788–791 (1999)

5. Cichocki, A., Zdunek, R.: Multilayer Nonnegative Matrix Factorization. Electronics Let- ters 42, 947–948 (2006)

6. Hoyer, P.O.: Non-negative Matrix Factorization with Sparseness Constraints. Journal of Machine Learning Research 5, 1457–1469 (2004)

7. Shahnaz, F., Berry, M.W., Paul, P., Plemmons, R.J.: Document clustering using nonnegative matrix factorization. Information Processing and Management 42, 373–386 (2006) 8. Smaragdis, P., Brown, J.C.: Non-Negative Matrix Factorization for Polyphonic Music

Transcription. In: Proc. WASPAA 2003. IEEE press, Los Alamitos (2003)

9. Haykin, S.: Neural Networks a comprehensive foundation, 2nd edn. Prentice Hall, Engle- wood Cliffs (1999)

10. Barman, P.C., Lee, S.Y.: Document Classification with Unsupervised Nonnegative Matrix Factorization and Supervised Perceptron Learning. In: Proc. ICIA 2007, pp. 182–186.

IEEE Press, Korea (2007)