Feature-reduction fuzzy co-clustering approach for hyper-spectral image analysis

(1)

Knowledge-Based Systems 216 (2021) 106549

Contents lists available atScienceDirect

Knowledge-Based Systems

journal homepage:www.elsevier.com/locate/knosys

Feature-reduction fuzzy co-clustering approach for hyper-spectral image analysis

Nha Van Pham

^a^,^b

, Long The Pham

^a

, Witold Pedrycz

^c

, Long Thanh Ngo

^a^,^∗

aDepartment of Information Systems, Le Quy Don Technical University, Hanoi, Viet Nam

bMIST Institute of Science and Technology, 17 Hoang Sam, Hanoi, Viet Nam

cDepartment of Electrical & Computer Engineering, University of Alberta, Edmonton T6R 2V4 AB, Canada

a r t i c l e i n f o

Article history:

Received 1 January 2020

Received in revised form 2 August 2020 Accepted 19 October 2020

Available online 26 December 2020 Keywords:

Fuzzy co-clustering Dimensionality reduction Cluster tendency

Hyper-spectral satellite image Land-cover classification

a b s t r a c t

Fuzzy co-clustering algorithms are the effective techniques for multi-dimensional clustering in which all features are considered of equal importance (relevance). In fact, the features’ importance could be different, even several of them could be considered redundant. The removal of the redundant features has formed the idea of feature-reduction in problems of the big data processing. In this paper, we propose a new unsupervised learning scheme by incorporating the feature-weighted entropy into the objective function of fuzzy co-clustering, called the Feature-Reduction Fuzzy Co-Clustering Algorithm (FRFCoC). First, a new objective function is formed on the basis of the original fuzzy co-clustering objective function which adds parameters representing the entropy weight of the different features.

Next, a feature-reduction and clustering automatic schema are adjusted based on FCoC’s original learning schema which calculates new parameters and conditions to eliminate irrelevant feature components. FRFCoC algorithm can be mathematically shown to converge after a finite number of iterations. The experiment results were conducted on some many-features data sets and hyperspectral images that have demonstrated the outstanding performance of FRFCoC algorithm compared with some previously proposed algorithms.

1. Introduction

The hyperspectral imaging is a progressive development technology adhering the development of aeronautics and space re- mote sensing. It combines the two latest technologies in imagery and spectroscopy. Through the hyperspectral imaging sensor, se- quential imagery data can be acquired in narrow bands with a high spectral resolution which benefits the theoretical research on hyperspectral data analysis in various fields [1]. Today, this technology is widely used in different fields, such as military, search salvage and rescue, environmental monitoring, mineral exploration and public security. The advantages of hyperspectral imaging are the high spectral resolution, providing spectral characteristics and spatial information simultaneously. The large number of bands contains knowledge for target detec- tion and recognition. In fact, the hyperspectral images contain many redundant feature components, reducing the performance and quality in hyperspectral image processing [2,3] . Therefore, the dimensionality reduction is an important phase to enhance

∗ Corresponding author.

E-mail addresses: [email protected](N.V. Pham),[email protected] (L.T. Pham),[email protected](W. Pedrycz),[email protected] (L.T. Ngo).

the performance. Some methods of dimensionality reduction are previously mentioned as band selection [4,5], discriminant analysis [6–8], principal component analysis [9–11]. Recently, Yang et al. have proposed an algorithm called a feature-reduction FCM (FRFCM) [12] that can automatically compute feature weights by adding feature-weighted entropy into the FCM objective function.

The algorithm exhibits a feature-reduction schema to eliminate irrelevant features taking the lowest weights to decrease the computational time and improve clustering performance.

The co-clustering is a useful tool for data analysis, where data objects are grouped into a number of clusters according to their similarity in multi-dimensional spaces. These techniques can simultaneously perform on both spatial dimensions and feature dimension, which are suitable with problems of complex data like the multi-dimensional data [13,14] or hyperspectral image.

Some co-clustering techniques have been studied and improved to resolve some problems such as clustering documents and keywords [15–17], color segmentation [18], categorical multivariate data [15] and high-dimensional data [19]. Recently, we have proposed a co-clustering algorithm by combining advantages of fuzzy co-clustering and interval-valued fuzzy sets. We called IVFCoC algorithm [20]. IVFCoC algorithm was guided by the new objective function with two values of the fuzzifier pro- ducing the FOU. Experiments were conducted on the datasets

(2)

N.V. Pham, L.T. Pham, W. Pedrycz et al. Knowledge-Based Systems 216 (2021) 106549

of color images, multi-spectral satellite images and high dimensional datasets and IVFCoC algorithm has proven to achieve the better performance than some previously proposed clustering algorithms. However, a limitation of IVFCoC algorithm is the high calculation complexity so when dealing with the data types of high-dimensional (number of features is large) such as hyperspectral images, we have to pay an expensive price for the consumption time and the accuracy of clusters. In addition, co- clustering algorithms generally treat data points with feature components under equal importance. However, in fact there are various data sets with irrelevant features involved in clustering process that may cause bad performance for clustering algorithms. That is, different feature components should be weighted differently to reflect their relative importance. Then removing the lowest significant features will increase the performance of clustering [12].

Some variants of FCM algorithm had been proposed, such as weighted K-Means [21,22], weighted FCM uses feature-weight learning (WFCM) [23–27]. These works have been proposed by considering the various features of data by weighting these features in the scheme of algorithms to improve the accuracy of clusters. However, the feature-reduction schema has not been proposed in these works. This may be the motivation for Yang et al. [12] to propose a new schema to improve FCM by adding feature reduction by using feature-weighted entropy (FRFCM).

FRFCM algorithm can automatically compute individual feature weight, and simultaneously reduce these irrelevant feature components. Authors first consider the FCM objective function with feature-weighted entropy, and construct a learning schema for parameters, and then reduce these irrelevant feature components. A new procedure for eliminating irrelevant features with small weights is created for feature reduction. Experiments are conducted on some numerical and real data sets to compare FRFCM with various feature-weighted FCM methods in the literature. Experimental results and comparisons actually demonstrate these good aspects of FRFCM with its effectiveness and usefulness in practice.

Currently, there are many clustering algorithms with different mathematical models and objective functions such as K-Means, Fuzzy C-Means, DBSCAN, etc [28]. However, most of these algorithms only consider data in terms of data objects. Feature components are only considered indirectly through data objects.

This means that in the objective functions of clustering algorithms there is no presence of the number of features parameter or the weight of the feature. Therefore, the feature-reduction for these algorithms is not mathematically significant though being significant in data pre-processing which is similar to band selection methods in the hyperspectral image [4,5]. Meanwhile, co-clustering algorithms in general, fuzzy co-clustering algorithm in particular always exist parameters of feature components such as the number of features and the feature weight and the feature membership function (as will be as described in Section 2 below). Therefore, co-clustering algorithms are more suitable than other clustering algorithms in the clustering of multi-dimensional and many features data types including the hyperspectral image [29]. We are motivated to create a new feature-reduction mathematical model that is specific to clustering algorithms.

In this paper, we propose a new unsupervised learning scheme by incorporating feature-weighted entropy in the objective function of fuzzy co-clustering, called Feature-Reduction Fuzzy Co- Clustering algorithm (FRFCoC). First, a new objective function is formed on the basis of the original fuzzy co-clustering objective function which adds parameters representing the entropy weight of the different features in the data. Next, a feature- reduction and clustering automatic schema was adjusted based on FCoC’s original learning schema which adds steps to calculate new parameters and conditions to eliminate irrelevant

feature components. Simultaneously, the authors have proved that FRFCoC algorithm can converge after a limited number of iterations using mathematical theorems and lemma. The experiments were conducted on some many-features data sets and hyperspectral images. Experimental results have shown that the outstanding performance of FRFCoC algorithm is compared with some previous proposed algorithms.

The rest of the paper is organized as follows. Section2briefly reviews some related works consisting of fuzzy co-clustering, feature-reduction based FCM; Section 3 presents the proposed framework of fuzzy co-clustering algorithm; Section 4 shows comparisons of the proposed method with other existing algorithms using validity indices; Section5includes conclusion and some future works.

2. Related works

In this section, we briefly presents fuzzy co-clustering algorithm and feature reduction based FCM algorithm.

2.1. Feature-reduction fuzzy clustering algorithm

The fuzzy clustering algorithm is also known as the Fuzzy C- Means (FCM) [30] algorithm which initially did not consider the feature components. That is, the objective function of FCM does not include the feature components of the data. Some the feature- reduction methods have been proposed to improve the performance of FCM [23–27]. However, data may contains irrelevant features therefore the feature-reduction becomes a necessary phase in FCM algorithm, especially for the many feature data processing. Yang et al. [12] proposed a new schema to improve FCM by adding feature reduction by using feature-weighted entropy (FRFCM). In this schema, each feature was assigned a weight that will be updated through iteration. Afterwards, features taking the smallest weight value will be eliminated after iterating.

LetNis the number of data objects;Cis the number of clusters (2

≤

C

≤

N); LetX

= (

^x1

,

^x2

, . . . ,

^xN

)

^{, x}i

∈

X

,

ⁱ

=

1

, . . . ,

^{N, be} the input dataset inD-dimensional feature space;x_ij

∈

_R

,

^j

=

1

, . . . ,

^Ddenotes featurejof data objecti;P

= (

^p1

,

^p2

, . . . ,

^pC

)

^, p_c

∈

P(c

=

1

, . . . ,

^C)

,

^pcj

∈

_R(j

=

1

, . . . ,

D) set of feature-based centroids inD-dimensional feature space. Denoteu_ci stands for the object membership degree of x_i to cluster c, U

= {

u_ci

}

be the object membership degree matrix. The objective function of FRFCM algorithm is indicated in(1).

J_FRFCM(U

,

^V

,

^W⁾

=

C

∑

c=₁ N

∑

i=₁ D

∑

j=₁

u^m_ci

δ

j

w

j(x_ij

−

p_cj)²

+

T_w

D

∑

j=₁

(

w

jlog

δ

j

w

j) (1)

Where, W

= [ w

j

]

1xD with

w

j is the feature weight of the j_th feature,

δ = [ δ

j

]

_1xD with

δ

jis used to adjust feature weight

w

j. T_w is a coefficient calculated as follows:T_w

=

N

/

^C [12]. That is, T_w depends on the size of dataNand the number of clusters of dataC.

The components of objective functionJ_FRFCM are defined by formulas(2),(3)and(4).

u_ci

=

( ∑

D j=₁

δ

j

w

j

(

x_ij

−

p_cj

)

2

)

⁻¹_/(^m−1)

∑

C q=₁

( ∑

D

k=₁

δ

k

w

k

(

^xik

−

p_ck

)

²

)

⁻1/(^m−1) (2)

w

j

=

δ1jexp

(

−

_T¹

w

∑

C c=₁

∑

N

i=₁u^m_ci

δ

j(x_ij

−

p_cj)²

)

∑

D k=₁

δ1kexp

(

−

¹

Tw

∑

C c=₁

∑

N

i=₁u^m_ci

δ

k(x_ik

−

p_ck)²

)

⁽³⁾

2

(3)

p_cj

=

∑

N i=₁u_cix_ij

∑

N

i=₁u_ci (4)

There are two terms in the FRFCM objective function in Eq.(3).

The first term,

∑

C c=₁

∑

N i=₁

∑

D j=₁u^m_ci

δ

j

w

j

(

x_ij

−

p_cj

)

2

is the sum of feature-weighted distances between data points and cluster centers, which is minimized when the distance between points and centers is small. The second term, T_w

∑

D

j=₁

w

jlog

( δ

j

w

j

)

is a variant of the feature-weight entropy. Yang et al. [12] have explained why we use the constantT_w

=

N

/

^C to handle effects of the term

∑

D

j=₁

w

jlog

( δ

j

w

j

)

, which is also used in Eq.(3)as an updating equation for feature weights. As seen in Eq.(3), if the term

∑

C

c=1

∑

N i=1u^m_ci

δ

j

(

x_ij

−

p_cj

)

2

is too large, then the numerator exp

(

− ∑

C c=₁

∑

N i=₁u^m_ci

δ

j

(

x_ij

−

p_cj

)

2

)

will become too small as close to a zero value. We need to avoid this case for preventing too many feature weights to be discarded during this updating step. On the other hand, if the term

∑

C

c=₁

∑

N i=₁u^m_ci

δ

j

(

x_ij

−

p_cj

)

2

is too small, then the numerator exp

(

− ∑

C c=₁

∑

N i=₁u^m_ci

δ

j

(

x_ij

−

p_cj

)

2

)

will be closed to one so that it is difficult for the feature(s) to be discarded during the updating step. We also need to avoid this case. In this sense, we need to put a suitable constant to control it. In the FRFCM clustering algorithm, one goal is to cluster a data set (with N data points) intoC clusters. The numbersN and C are the two commonly given constants. We can use the constant T_w

=

N

/

^C to control the term

∑

C

c=₁

∑

N i=₁u^m_ci

δ

j

(

x_ij

−

p_cj

)

2

However, in the experiments of this paper, the formulaT_w

=

N

/

^C does not give exact results for both FRFCM and FRFCoC.

Therefore, in this paper, we consider T_w as a fuzzy parameter (fuzzy feature weight).

δ

j in

∑

C c=₁

∑

N i=₁

∑

D

j=₁u^m_ci

δ

j

w

j

(

x_ij

−

p_cj

)

2

are used for control- ling the variants of feature weights, the choice of

δ

jis important.

To estimate the value of

δ

j, Yang et al. [12] have proposed a learning procedure. In probability theory and statistics, variance- to-mean ratio (VMR), like the coefficient of variation, is a nor- malized measure of the dispersion of a probability distribution.

VMR is a measure used to quantify whether a set of observed occurrences are clustered or dispersed compared to a standard statistical model. Smaller dispersion means the data set would be closer to the cluster center, while larger dispersion means the data set is far from the cluster center. VMR is defined as the ratio of the variance

σ

² to the mean

µ

^, ^VMR

= σ

²

/µ

^{. .} Because we need to retain features that have small dispersion and then discard those which have large dispersion, we consider the reciprocal of VMR, i.e., mean-to-variance ratio (MVR). The coefficient of variation (CV) is defined as the ratio of the standard deviation

σ

to the mean, MVR

= σ/µ

[31]. Thus VMR and MVR can be used to calculate

δ

j[32]. However, [12] has proved experimentally that VMR and MVR cannot produce small weights.

And the term (Mean(x)j/Var(x)j) for the feature j can actually handle the dispersion between clusters in the data set. Therefore, we use (Mean(x)_j

/

^Var(x)j) to estimate

δ

j. That is, we consider the estimate for

δ

jas follows:

δ

j

=

(

Mean(x)_j Var(x)_j

)

(5) In which,Mean(x)_jis the mean ofNdata points, andVar(x)_jis the variance ofNdata points according tojthfeature.

To create a feature-reduction schema in the FRFCM algorithm, we need to select these unimportant features (i.e., small weights) during clustering processes. In our construction, we use a threshold to determine which features will be selected and then discarded.

The FRFCM algorithm diagram consists of the learning processes of membership function matrix U and feature weight matrixW that are shown as Algorithm1.

Algorithm 1FRFCM algorithm Input: Data set X

= {

x_i

|

x_i

∈

R^K

}

,

ⁱ

=

1

, ...,

N, the number of clustersC.

Output: The results of the clustering and feature reduction:

Centroids and the distribution of data points in each cluster and the number of reduction features.

1.Initialize the maximum error

ε

and the maximum number of iterations

τ

max.

2.Initializeu_cisuch that 0

≤

u_ci

≤

1 and the constraint conditions in Eq.(8).

3.Initialize randomly feature weight matrixW⁽⁰⁾ 4.Set iteration

τ

^=1.

5. REPEAT

6. Calculate and update

δ

jusing Eq.(5).

7. Calculate and updateu_ciusing Eq.(2).

8. Calculate and updatep_cjusing Eq.(4).

9. Update

w

jusing Eq.(3).

10. Eliminate D_r features satisfying their weights are lower than the threshold, computeD^(ne^w⁾

=

D

−

D_r and setD

=

D^(ne^w⁾. 11.

τ

⁼

τ

^+1.

12. UNTIL



W^(t)

−

W^(t⁻¹⁾



 < ε

^or

τ ≤ τ

max

The FRFCM algorithm is divided into three parts: (1) Compute the membership function,uci; (2) Update cluster centroid,pcj; and (3) Update the weight

w

j, which the computational complexity is O(

τ

^NC²^D), ^O(

τ

^{NC) and} ^O(

τ

^NCD²), respectively. The total one isO(

τ

^NC²^D

+ τ

^NCD²^{). Because}^C is normally much smaller than D, therefore, the final computational complexity of FRFCM is O(

τ

^NCD²^{). With}^C is the number of clusters which is a known input parameter for each data set,Nis the number of data objects andDis the number of features of data.

Theoretically, FRFCM algorithm has proposed a new objective function for fuzzy clustering based on the feature-weighted entropy and feature reduction techniques. Compared to fuzzy clustering (for example FCM), FRFCM considered data in detail for each feature through their weights and integrated a feature reduction scheme to eliminate small-weighted features. Some comparisons and experiments conducted on the synthetic and real data sets have demonstrated the superior performance of FRFCM compared to some traditional clustering algorithms. In this paper, we will use FRFCM’s contributions to improve fuzzy co-clustering in the many-feature data and the hyperspectral image processing.

2.2. Fuzzy co-clustering algorithm

Fuzzy co-clustering is a clustering method which follows the same principle as general co-clustering. They are simultaneously perform to clustering objects and features [33]. The difference is that the boundary between any two clusters is described in terms of membership functions rather than characteristic functions. Each work in the literature has applied fuzzy co-clustering to describe and solve specific problems in their own way. The authors in [34] denote fuzzy clustering algorithm as FCCM for categorical multivariate data; the fuzzy clustering algorithm Fuzzy CoDoK in [15] for clustering documents and keywords; the fuzzy co-clustering algorithm FCCI is described in [18] for the color image segmentation problem class and the fuzzy co-clustering algorithm ibFCC described in [35] for the biomedical data segmentation. Recently, in [20] we denoted a fuzzy co-clustering algorithm as FCoC to propose IVFCoC algorithm by combining advantages of fuzzy co-clustering and interval-valued fuzzy sets which was aimed at solving clustering problems in the presence

(4)

of complex data. These fuzzy co-clustering algorithms share the same principle where co-clustering is seen as a combination of object partitioning and feature ranking. Although the symbols of fuzzy co-clustering algorithms on different problems are different, their main contents are summarized in a uniform manner as follows.

Let N is the number of data objects; C is the number of clusters (2

≤

C

≤

N); T_u and T_v are fuzzy weights; LetX

= (

^x1

,

^x2

, . . . ,

^xN

)

^{, x}i

∈

X

,

ⁱ

=

1

, . . . ,

N, be the input dataset inD- dimensional feature space;x_ij

∈

_R

,

^j

=

1

, . . . ,

^Ddenotes featurej of data objecti;P

= (

^p1

,

^p2

, . . . ,

^pC

)

^,^pc

∈

P(c

=

1

, . . . ,

^C)

,

^pcj

∈

R(j

=

1

, . . . ,

D) set of feature-based centroids inD-dimensional feature space.

Denote u_ci stands for the object membership degree of x_ito cluster c,U

= {

u_ci

}

be the object membership degree matrix;

v

cj stands for the feature membership degree defined as the membership grade of featurejto the clustercandV

= {

v

cj

}

be the feature membership matrix with sizeCxD. Letd_cijbe distance betweenx_ijandp_cjgiven by:

d_cij

= 



x_ij

,

^pcj



2

=

(x_ij

−

p_cj)² (6)

The objective functionJ_FCoC(U

,

^V

,

P) is revised in Eq.(7).

J_FCoC(U

,

^V

,

^P)

=

C

∑

c=₁ N

∑

i=₁ D

∑

j=₁

u_ci

v

cjd_cij

+

T_U

C

∑

c=₁ N

∑

i=₁

u_cilogu_ci

+

T_V

C

∑

c=₁ D

∑

j=₁

v

cjlog

v

cj (7)

To get optimal clustering results, the objective function FCoC (7) is minimized subject by the object membership func- tionsu_ciand the feature membership functions

v

cjare specific to each cluster to the following constraints:

C

∑

c=₁

u_ci

=

1

,

^uci

∈

[0

,

^1]

, ∀

i

=

1

,

^N

, ∀

c

=

1

,

^C

D

∑

j=₁

v

cj

=

1

, v

cj

∈

[0

,

^1]

, ∀

j

=

1

,

^D

, ∀

c

=

1

,

^C

(8)

That is, every data object (

∀

i

=

1

,

N)), its membership degree on each cluster receives a value in the range [0,1] (u_ci

∈

[0

,

^1], u_ci

∈

[0

,

1]), and the total membership degree of a data object on all clusters is 1.0 (

∑

C

c=₁u_ci

=

1). And every data feature (

∀

j

=

1

,

D), its membership degree on each cluster receives a value in the range [0,1] (

v

cj

∈

[0

,

^1],

∀

c

=

1

,

^C)), and the total membership degree of a data feature on all clusters is 1.0 (

∑

C

c=₁

v

cj

=

1) The components of objective functionJ_FCoC are defined by the formulas(9),(10)and(11).

u_ci

=

^e

(−∑D j=1

vcj dcij Tu )

∑

C q=₁e⁽⁻

∑D j=1

vqj dqij Tu )

(9)

v

cj

=

^e

(−∑N i=1

ucidcij Tv )

∑

D k=₁e⁽⁻

∑N i=1ucidcik

Tu )

(10)

p_cj

=

∑

N i=₁u_cix_ij

∑

N

i=1u_ci (11)

The FCoC algorithm are shown in Algorithm 2. The computational complexity of the FCoC algorithm isO(CND). Where,C is the number of clusters which is a known input parameter for each data set,Nis the number of data objects andDis the number of features of data.

Algorithm 2FCoC algorithm Input: Data set X

= {

x_i

|

x_i

∈

R^D

}

,

ⁱ

=

1

, ...,

N, the number of clusters isC.

Output:The clustering results: Centroids and the distribution of data points in each cluster.

1.Initialize the parameters T_u, T_v, maximum error

ε

^{and the} maximum number of iterations

τ

max.

2.Initializeu_cisuch that 0

≤

u_ci

≤

1 and the constraint conditions in Eq.(8).

3.Set iteration number

τ

^=1.

4.REPEAT

5. Calculate and updatep_cjusing Eq.(11).

6. Calculate and updated_cijusing Eq.(6).

7. Calculate and update

v

cjusing Eq.(10).

8. Calculate and updateu_ciusing Eq.(9).

9. Increase

τ

⁼

τ

^+1.

10.UNTIL(max(

|

u_ci

[ τ ] −

u_ci

[ τ −

1

]|

)

≤ ε

^or

τ > τ

max).

The difference in mathematical model between FCoC and traditional fuzzy clustering algorithms (for example FCM) is their objective functions. The objective function of FCM only considers data according to data object terminology. The objective function of FCoC considers data according to data object terminology and characteristics of data. That is, FCM only considers the number of data objects. Meanwhile, FCoC considers each feature weight of data objects. Therefore, FCoC can be considered as feature- weighted fuzzy clustering which can be used instead of FCM in the multi-dimensional and many features processing. Recent experiments of FCoC on the color image [18] and the biomedical image [35], documents [15], the many-featured data [20, 34] and multi-spectral images and hyperspectral images [20, 29] have shown that the performance of FCoC is better than traditional fuzzy cluster. However, the consideration of feature components leads to the problem of quantifying the importance of features that increase the computational complexity. Therefore, the feature-reduction can be a suitable method to improve the efficiency of FCoC. In this paper, we will use the feature-weighted and feature-reduction entropy techniques in [12] to build a new feature-reduction model for FCoC.

3. Feature-Reduction Fuzzy Co-Clustering algorithm

In this section, we will present a new FRFCoC clustering algorithm and prove mathematically convergence of FRFCoC.

3.1. FRFCoC algorithm

In this section, we present a new method to improve the performance of the algorithm FCoC by applying feature reduction, which called FRFCoC algorithm. In this method, each feature is assigned a weight, which is adjusted in iterations. Then, features taking the smallest weights will be removed. Considering theD- dimensional datasetX

= {

x₁

,

^x2

, . . . ,

^xN

}

, the weights of features are W

= { w

1

, w

2

, . . . , w

D

}

. The objective function of FRFCoC algorithm is modified as follows,

J_FRFCoC(U

,

^V

,

^P

,

^W

, δ

⁾

=

C

∑

c=₁ N

∑

i=₁ D

∑

j=₁

u_ci

v

cj

δ

j

w

jd_cij

+

T_u

C

∑

c=₁ N

∑

i=₁

u_cilogu_ci

+

T_v

C

∑

c=₁ D

∑

j=₁

v

cjlog

v

cj

+

T_w

D

∑

j=₁

w

jlog

δ

j

w

j

(12)

4

(5)

To get optimal clustering results, the objective function FRFCoC (1)is minimized subject by the object membership functionsu_ci and the feature membership functions

v

cj are specific to each cluster and the feature weights

w

jto the following constraints:

C

∑

c=₁

u_ci

=

1

,

^uci

∈

[0

,

^1]

, ∀

i

=

1

,

^N

, ∀

c

=

1

,

^C

D

∑

j=₁

v

cj

=

1

, v

cj

∈

[0

,

^1]

, ∀

c

=

1

,

^C

, ∀

j

=

1

,

^D

D

∑

j

w_j

=

1

,

^wj

∈ [

0

,

¹

] , ∀

j

=

1

,

^D

(13)

That is,u_ciand

v

cjas analyzed in Eq.(8), Section2.1. And

w

jis the weight of each feature in the range [0,1] (

w

j

∈

[0

,

^1]

, ∀

j

=

1

,

^D), the total weight of all features is 1.0 (

∑

D

j=₁

w

j

=

1).

Where, T_u,T_v and T_w are weights indicating fuzziness of U, V andW.

δ

jis used to adjust the feature weigh

w

j, the learning process for

δ

jis presented in Section2.1.

To minimize the objective function J_FRFCoC with constraints are given by (1), we construct an objective function with C Lagrange coefficients

λ

c

(

c

=

1

,

^C

)

corresponding toCclusters for constraint

∑

C

c=1u_ci

=

1, andDLagrange coefficients

γ

j

(

j

=

1

,

^D

)

corresponding toDfeatures for constraint

∑

D

j=₁

v

cj

=

1, andD Lagrange coefficients

ξ

j

(

j

=

1

,

^D

)

corresponding toDfeatures for constraint

∑

D

j=₁

w

j

=

1, we obtain:

J_FRFCoC(U

,

^V

,

^P

,

^W

, δ

⁾

=

C

∑

c=₁ N

∑

i=₁ D

∑

j=₁

u_ci

v

cj

δ

j

w

jd_cij

+

T_u

C

∑

c=₁ N

∑

i=₁

u_cilogu_ci

+

T_v

C

∑

c=₁ D

∑

j=₁

v

cjlog

v

cj

+

T_w

D

∑

j=₁

w

jlog

δ

j

w

j

+

C

∑

c=₁

λ

c

(

^uci

−

1

)

+

D

∑

j=₁

γ

j

( v

cj

−

1

) +

D

∑

j=₁

ξ

j

( w

j

−

1

)

(14)

FRFCoC algorithm is resolved by the following steps. Firstly, we calculate the membership function Uby fixingV,P andW, then minimizing the objective function(14)according toU, and taking derivatives of objective function with respect to the fuzzy object memberships and setting them to zero to obtain,

∂

^JFRFCoC

∂

^uci

=

K

∑

j=₁

v

cj

δ

j

w

jd_cij

+

T_u(logu_ci

+

1)

+ λ

c

=

0 (15) By doing some algebraic simplifications in(15), we obtain,

u_ci

=

^e

−∑D j=1

vcjδjwj dcij Tu

e^λ^Tu^c

(16) Because of the constraint of

∑

C

c=1u_ci

=

1, the Lagrange multiplier

λ

c is eliminated as follows

C

∑

c=₁

u_ci

=

C

∑

c=₁

e⁻

∑D j=1

vcjδjwj dcij Tu

e^λ^Tu^c

=

∑

C c=1e⁻

∑D j=1

vcjδjwj dcij Tu

e^λ^Tu^c

=

1 (17)

⇒

e^λ^Tu^c

=

C

∑

c=₁

e⁻

∑D j=1

vcjδjwj dcij

Tu (18)

By using Eq.(18) in Eq. (16), the closed-form solution for the optimal object membership function is obtained as

u_ci

=

^e

−∑D j=1

vcjδjwj dcij Tu

∑

C c=₁e⁻

∑D j=1

vcjδjwj dcij Tu

(19) Similarly, to find the optimal fuzzy feature membershipsV, taking derivatives of objective function with respect to the fuzzy feature memberships and setting them to zero, we obtain,

∂

^JFRFCoC

∂v

cj

=

C

∑

c=₁

u_ci

δ

jw_jd_cij

+

T_v(log

v

cj

+

1)

+ γ

j

=

0 (20) By some algebraic simplifications in Eq.(20), we reach,

v

cj

=

^e

(−∑N

i=1uciδjwj dcij

Tv )

∑

D j=1e⁽⁻

∑N i=1

uciδjwj dcij

Tv )

(21)

Before finding the cluster centroids P, the square of Euclidean distance



x_ij

−

p_cj



2 is defined as



x_ij

−

p_cj



2

=

(x_ij

−

p_cj)²

=

x²_ij

−

2x_ijp_cj

+

p²_cj, we obtain,

∂

^JFRFCoC

∂

^pcj

= v

cj

δ

j

w

j N

∑

i=₁

u_cix_ij

− v

cj

δ

j

w

jp_cj

N

∑

i=₁

u_ci

=

0 (22) Simplifying, we reach,

p_cj

=

∑

N i=₁u_cix_ij

∑

N

i=1u_ci (23)

Next,

w

j is calculated by fixing U, V and P and the distances



x_ij

−

p_cj



2are constants, then taking derivatives of the objective function (14)with respect to

w

j and setting them to zero, we obtain,

∂

^JFRFCoC

∂w

j

=

C

∑

c=₁ N

∑

i=₁

u_ci

v

cj

δ

jd_cij

+

T_w(log

w

j

+

1)

+ ζ

j

=

0 (24) In the similar way toUandV, we easily obtain,

w

j

=

δ1je⁻

∑C c=1

∑N

i=1ucivcjδj dcij Tw

∑

D k=₁

δ1ke⁻

∑C c=1

∑N

i=1ucivckδk dcik Tw

(25)

FRFCoC algorithm diagram consists of the learning processes of membership function matrices U, V and feature weight matrix W that are shown as Algorithm3.

The output of FRFCoC algorithm consists of centroids and the distribution of data points in each cluster. For image data, clustering results are also presented as images by coloring pixels in each cluster with different colors. In addition, to compare and evaluate the performance of different algorithms, we also quantify the feature reduction results (the number of reduced features per each data set), the cluster quality evaluation indexes, the loops, and clustering time.

We next analyze the computational complexities for FRFCoC algorithms. This algorithm includes

τ

maxiterators to update components of the objective function(12): (1) Calculate membership functions,u_ci, which needs O(NC²D); (2) Calculate membership functions,

v

cj, which needsO(NCD²); (3) Update cluster centroids, p_cj, which needsO(NC); (4) Update the weight

w

j, which needs O(NCD²). The total computational complexity isO(NC²D

+

NCD²), where N is the number of data patterns, C is the number of clusters, andDis the number of data dimensions. Normally, in the complex and multi-dimensions problems, D is larger than theC, therefore, the time complexity of FRFCoC is approximately

(6)

Algorithm 3Feature-Reduction Fuzzy Co-Clustering algorithm Input: Data set X

= {

x_i

,

^xi

∈

R^D

}

,

ⁱ

=

1

..

N, the number of clusters C.

Output:The results of clustering and feature reduction: Centroids and the distribution of data points in each cluster and the number of reduction features.

1. Initialize parametersT_u,T_v,T_w,

ε

1,

ε

2 and maximum number of iterations

τ

max.

2.Initializeu_ci,

w

jsuch that the constraint conditions in(13).

3.Set iteration

τ

^=1.

4. REPEAT

5. Calculate

δ

jusing formula(5).

6. Updatep_cjusing formula(23).

7. Calculated_cijusing Eq.(6).

8. Update

v

cjusing formula(21).

9. Updateu_ciusing formula(19).

10. Update

w

jusing formula(25).

11. For each

w

j(

τ

^{) in}^W⁽

τ

^{) If}

w

j(

τ

⁾

< √

N

∗

DThen 12. W(

τ

^)=W⁽

τ

⁾

\ w

j(

τ

^);

δ

⁽

τ

⁾⁼

δ

⁽

τ

⁾

\ δ

j(

τ

^);

13. Remove the j^th feature components fromX(

τ

^), ^V(

τ

⁾ andP(

τ

^).

14. End For 15.

τ

⁼

τ

^+1.

16. UNTIL

|∥

W

(τ) ∥ − ∥

W

(τ) ∥| < ε

^or

τ ≤ τ

max

O(CD²N

τ

), i.e. similar to the FCoC algorithms. However, FRF- CoC algorithm reducesDafter each iteration, therefore, the time complexity of FRFCoC algorithm is smaller than FCoC algorithm.

Mathematically, FRFCoC algorithm has contributed a new objective function and a learning scheme for clustering data. FRFCoC algorithm is a combination of advanced techniques of algorithms FRFCM and FCoC which simultaneously consider objects and features and reduce features in the machine learning schema. The objective function of FRFCoC not only considers the object membership functions, but also considers the feature membership functions and the weight of features and the condition to remove the feature from the data processing loop. The schema of FRFCoC algorithm not only calculates the object membership functions and the feature membership functions, but also makes decisions to eliminate influential features to improve the performance in terms of accuracy and processing speed. To clarify the contributions of FRFCoC algorithm, we would like to give some discussions below.

Compare with traditional clustering techniques, FRFCoC algorithm considers data in the detail such as the feature of data, the weight of the feature, and the ranking of the feature to eliminate the low-ranking features from the processing of data and improve the performance of clustering.

Compare with traditional fuzzy clustering algorithms, FRFCoC algorithm ranks the features of data and eliminated low-ranking features to reduce the features and improve the performance of clustering.

Compare with the previous reduction-feature techniques, which reduced the feature before conducting the data processing.

That is, the previous reduction-feature techniques are like a preprocessing step of data. Meanwhile, FRFCoC algorithm automatically calculates the parameters to reduce the feature during the data processing.

Compare with FRFCM algorithm, which is considered to be the same kind as FRFCoC algorithm. FRFCoC algorithm is formed on the basis of FCoC algorithm which the original algorithm

considered the feature and their weight. Therefore, integrating the reduction-feature technique into FCoC to form FRFCoC is very natural and achieving higher efficiency than FCoC algorithm is completely understandable and high logic meaning. Meanwhile, FRFCM algorithm itself is formed on the basis of FCM which only considers data by the object membership functions. There- fore, in order to form FRFCM, authors must integrate techniques of weighting of features and ranking features to eliminate the low-ranking features and improve the performance of clustering.

Logically, the integration of techniques in FRFCM shows its ratio- nality. However, FRFCM algorithm does not consider the feature membership function (as for FRFCoC) in ranking features that show the ranking features in FRFCM is weaker.

Based on the comments above, FRFCoC algorithm could achieve the performance better than previously proposed clustering algorithms including FRFCM algorithm which has recently proposed. It is essential to propose FRFCoC algorithm which demonstrates the science and novelty in automatically reducing the features and improving the performance of complex data clustering such as multi-dimensional, many-features data. To validate our predictions, this paper used theorems and lemmas to prove the convergence of FRFCoC algorithm (Section3.2) and conduct experiments (Section4) on the sample data sets and to demonstrate the performance of FRFCoC algorithm compared to previously proposed clustering algorithms.

3.2. Convergence theorems for FRFCoC algorithm

We next provide convergence theorems for the FRFCoC clustering algorithm that a FRFCoC convergent subsequence can tend to optimal solutions. Zangwill’s convergence theorem [29] and bordered Hessian matrix [36] will be used in our proof. We mention that this way had been used in Yang and Tian [37].

Originally, Zangwill defined a point-to-set map with T

:

V

→

P(V): where P(V) represents the power set of V and a closed point-to-set map must be defined. However, FRFCoC algorithm here is a point-to-point map and the ‘‘closed’’ property is exactly

‘‘continuity’’ for the case of the point-to-point map. Thus, the Zangwill’s convergence theorem is given as follows.

3.2.1. Zangwill’s convergence theorem

Let the point-to-point mapT

:

V

→

P(V) generate a sequence

{

z_k

}

^∞_k₌₀ byz_k+₁

=

T

(

^zk

)

. Let a solution setΩ

∈

V be given and suppose that:

1. There is a continuous functionZ

:

V

→

R: such that, if z_k

∈ /

_Ω,Z

(

^T

(

^zk

)) <

^Z

(

^zk

)

^{, and if}^zk

∈ /

_Ω,Z_k

(

^T

(

^zk

)) ≤

Z

(

^zk

)

^.

2. The mapT is continuous onV

\

_Ω.

3. All pointsz_kare contained in a compact setS

⊆

V.

Then the limit of any convergent subsequence shall be in the solution setΩ ^and^Z(zk) will monotonically converge toZ(z) for somez

∈

_Ω.

Set M_u

= {

U

=

[u_ci]_CxN

⏐

∑

C

c=1u_ci

=

1

,

^uci

≥

0

}

and M_v

= {

V

= [ v

cj

]

CxD

⏐

∑

C

c=₁

v

cj

=

1

, v

cj

≥

0

}

andM_w

= {

W

= [ w

j

]

D×₁

⏐

∑

D

j=₁

w

j

=

1

, w

j

≥

0

}

andP

= {

p₁

,

^p2

, . . . ,

^pC

}

.

LetΩFRFCoC be a solution set for FRFCoC algorithm, defined as ΩFRFCOC

6