Robust pruning of training patterns for

(1)

Robust Pruning of Training Patterns for

Optimum-Path Forest Classification Applied to

Satellite-Based Rainfall Occurrence Estimation

João Paulo Papa, Alexandre Xavier Falcão, Greice Martins de Freitas, and Ana Maria Heuminski de Ávila

Abstract—The decision correctness in expert systems strongly depends on the accuracy of a pattern classifier, whose learn-ing is performed from labeled trainlearn-ing samples. Some systems, however, have to manage, store, and process a large amount of data, making also the computational efficiency of the classifier an important requirement. Examples are expert systems based on image analysis for medical diagnosis and weather forecasting. The learning time of any pattern classifier increases with the training set size, and this might be necessary to improve accuracy. However, the problem is more critical for some popular methods, such as artificial neural networks and support vector machines (SVM), than for a recently proposed approach, the optimum-path forest (OPF) classifier. In this letter, we go beyond by presenting a robust approach to reduce the training set size and still preserve good accuracy in OPF classification. We validate the method using some data sets and for rainfall occurrence estimation based on satellite image analysis. The experiments use SVM and OPF without prun-ing of trainprun-ing patterns as baselines.

Index Terms—Expert systems, image analysis, pattern recogni-tion, rainfall estimarecogni-tion, remote sensing.

I. INTRODUCTION

T

HE NEED for expert systems based on image analysis tend to increase with the advances in imaging technolo-gies. The correctness of these systems strongly depends on the accuracy of a pattern classifier, which learns patterns of each class from labeled training samples. Their computational efficiency, however, decreases as the training set size increases, which might be necessary to improve accuracy. Given that imaging technologies are providing larger image data sets, it is paramount to choose a fast and accurate pattern classifier for such expert systems. Examples are systems which use satellite images for rainfall estimation and magnetic resonance images for medical diagnosis. They can produce millions of pixels per

Manuscript received February 25, 2009; revised June 15, 2009 and October 16, 2009. Date of publication December 22, 2009; date of current version April 14, 2010. This work was supported by grants from Projects CNPq 302617/2007-8 and FAPESP 07/52015-0.

J. P. Papa and A. X. Falcão are with the Institute of Computing, Univer-sity of Campinas, 13083-970 Campinas-SP, Brazil (e-mail: papa.joaopaulo@ gmail.com; alexandre.falcao@gmail.com).

G. M. de Freitas is with the School of Electrical and Computer Engineering, University of Campinas, 13083-970 Campinas-SP, Brazil (e-mail: jppbsi@ bol.com.br).

A. M. H. de Ávila is with the Center of Meteorological and Climatic Research Applied to Agriculture, University of Campinas, 13083-970 Campinas-SP, Brazil (e-mail: avila@cpa.unicamp.br).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/LGRS.2009.2037344

image and retraining is very often needed due to changes in the image acquisition process.

The learning time of some popular pattern classification methods, such as support vector machines (SVM) [1] and artifi-cial neural networks using multilayer perceptrons (ANN-MLP) [2], may be prohibitive depending on the training set size. We have proposed the optimum-path forest (OPF) classifier, which can be from tens to thousands times faster than SVM and ANN-MLP with similar to higher accuracies [3], and have demon-strated its effectiveness on some applications [4], [5]. The OPF classifier interprets a training set as a graph, taking the samples (e.g., images, objects, pixels) as nodes and using anadjacency relationto define pair of samples as arcs. A path is a sequence of distinct samples and each path in the graph has a value given by aconnectivity function. It first identifies some key samples in each class, calledprototypes, and then computes optimum paths from the prototypes to the remaining samples, such that the prototypes compete with each other for the most strongly con-nected samples in the graph. The result is an optimal partition of the training set—i.e., a disjoint set of optimum-path trees, each rooted at one prototype. The nodes of each tree are assumed to have the same label of their root. The classification of a new sample is performed by identifying which tree would contain it, if that sample were part of the original graph. The choice of adjacency relation and connectivity function defines a new clas-sification model [6], [7]. The one presented in [3] uses complete graphs, wherein the value of a path is given by its maximum arc weight and optimum paths are those with minimum value.

Despite the efficiency gains of OPF with respect to SVM and ANN-MLP, it is still important to achieve good accuracy in large data sets with minimum training set size. In this letter, we propose a method which selects to the training set the most relevant samples from a larger evaluation set (learning step) and reduces the training set size by eliminating the irrelevant ones (pruning step). By moving the irrelevant samples from the training set to the evaluation set, however, the accuracy on the evaluation set may be affected, then we can repeat the learning and pruning processes until no irrelevant sample exists for pruning. For validation, we also keep a large test set with unseen samples.

We evaluate the results with three experiments. In the first experiment, we show that the accuracy of SVM on the test set might be drastically affected when we limit its training set size in order to considerably reduce its learning time. Second, we show that the proposed approach, OPF with pruning, can considerably reduce the training set size without affecting too much its accuracy on the test set. Third, we evaluate the new

(2)

method for rainfall occurrence estimation based on satellite im-age analysis using SVM and OPF with no pruning as baselines.

Algorithm 1: OPF Algorithm

We have first proposed the OPF classifier for rainfall oc-currence estimation in [8], and this problem has already been addressed by ANN [9], but the authors used only 32 × 32 images. Here, we propose the OPF with pruning and validate it for rainfall estimation by using 256×256 images.

The remainder of this letter is organized as follows. A background on the OPF classifier is given in Sections II and III presents the proposed learning algorithm with pruning of irrelevant patterns. Experiments and results, including evalua-tion in rainfall estimaevalua-tion, are described in Secevalua-tion IV. We state conclusions and discuss future work in Section V.

II. OPF CLASSIFIER

This section describes the OPF classifier as proposed in [3]. Letλ(s)be the function that assigns the correct labeli,i= 1,2, . . . , c, of class ito any samplesof a given data set. Let (Z1, A)be a graph whose nodes are samples from aλ-labeled

training setZ1and arcs are all pairs of distinct samples inA=

Z1×Z1. The arcs do not need to be stored and so the graph

does not need to be explicitly represented. Each samplesinZ1

(e.g., pixel) has a feature vectorv(s)and the weightd(s, t)of an arc(s, t)∈Ais a distance between the feature vectorsv(s) andv(t).

A path is a sequence of distinct samplesπt=s1, s2, . . . , t

with terminus at a samplet. A path is saidtrivialifπt=t. A connectivity functionfmaxassigns a costfmax(πt)to any path πt(i.e., the maximum arc weight along the path)

fmax(s) =

0, ifs∈S +∞, otherwise

fmax(πs· s, t) = max{fmax(πs), d(s, t)} (1) whereS ⊂Z1is a set ofprototypes(i.e., representative samples

from each class) andπs· s, tis the concatenation of a pathπs and an arc(s, t).

A pathπtisoptimumiff(πt)≤f(τt)for any other pathτt. By computing minimum-cost paths fromSto every nodetin Z1, we uniquely define a cost mapC(t)

C(t) = min

∀πt∈(Z₁,A)

{fmax(πt)}. (2)

This is done by Algorithm 1, called the OPF algorithm, which is an extension of the general image foresting transform

algo-rithm [10] from the image domain to the feature space, here specialized forfmax.

Algorithm 1 assigns one optimum pathP∗_(t)_from_S_{to each}

training sample t in a nondecreasing order of minimum cost, such that the graph is partitioned into an OPFP—a function with no cycles which assigns to eacht∈Z1\Sits predecessor

P(t)inP∗_(t)_{or a marker}_nil_when_t_∈_{S. The root}_R(t)_∈_S_of

P∗_(t)_{can be obtained from}_P_(t)_{by following the predecessors}

backwards along the path, but its label is propagated during the algorithm by settingL(t)←λ(R(t)). Lines 1–3 initialize maps and insert prototypes inQ. The main loop computes an optimum path fromS to every samples(Lines 4–11). At each iteration, a path of minimum costC(s)is obtained inP when we remove its last node s from Q(Line 5). Ties are broken in Q using first-in-first-out policy, i.e., when two optimum paths reach an ambiguous sample s with the same minimum cost, s is assigned to the first path that reached it. Note that C(t)> C(s)in Line 6 is false whenthas been removed fromQ and, therefore,C(t)= +∞in Line 9 is true only whent∈Q. Lines 8–11 evaluate if the path that reaches an adjacent nodet throughsis cheaper than the current path with terminustand update the position oftinQ,C(t),L(t), andP(t)accordingly. At the end, each prototype inSwill be root of an optimum-path tree (may be a trivial tree) with its most strongly connected nodes in Z1, i.e., the label map L classifies the nodes in a

given tree with the true label of its root. The classification of a new nodet∈Z1considers all arcs connectingtwith samples

s∈Z1, as thoughtwere part of the training graph. Considering

all possible paths fromStot, we find the optimum pathP∗_(t)

fromS and labeltwith the classλ(R(t))of its most strongly connected prototype R(t)∈S. This path can be identified incrementally, by evaluating the optimum costC(t)as

C(t) = min{max{C(s), d(s, t)}} ∀s∈Z1. (3)

Let node s∗∈Z1 be the one that satisfies (3) [i.e., the

predecessor P(t) in the optimum path P∗(t)]. Given that L(s∗_{) =}_{λ(R(t)), the classification simply assigns} _L(s∗₎ _as

the class oft. An error occurs whenL(s∗₎₌_λ(t).

The prototypes inSare computed by exploiting the relation between minimum-spanning tree (MST) and optimum-path tree for fmax[11]. By computing an MST in the complete graph

(Z1, A), we obtain a connected acyclic graph whose nodes are

all samples ofZ1and the arcs are undirected and weighted by

the distancesdbetween adjacent samples. The spanning tree is optimum in the sense that the sum of its arc weights is minimum as compared to any other spanning tree in the complete graph. In the MST, every pair of samples is connected by a single path which is optimum according tofmax, i.e., the MST contains one

optimum-path tree for any selected root node. The prototypes in Sare then the nodes in the MST with different true labels inZ1.

These prototypes minimize the classification errors inZ1 and,

hopefully the classification error of new samples, because the optimum paths between classes tend to pass through them, and they are blocking that passage. Thus, reducing the chances of samples in any given class be reached by optimum paths from prototypes of other classes.

III. LEARNINGWITHPRUNING OFIRRELEVANTPATTERNS

(3)

|Z2| ≈ |Z3|. The idea is to assume that Z2 represents the

problem never seen (i.e., samples inZ3), select forZ1the most

informative samples fromZ1∪Z2, project the OPF classifier

Algorithm 2: Learning Algorithm

(Algorithm 1), and apply (3) to classify new unseen samplest∈ Z3. In a real problem,Z1andZ2must beλ-labeled sets. The

true label of samples inZ3may be unknown. Here, we know

the true label of all samples in order to validate the method (Section IV).

The accuracy of classification onZ2(andZ3) is measured by

acc=2c−

c

i=1E(i)

2c (4)

E(i) =ei,1+ei,2 (5)

ei,1=

FP(i) |Z2| − |N Z2(i)|

(6)

ei,2= FN(i)

|N Z2(i)|

(7)

where FP(i)and FN(i)are the false positives and false nega-tives, respectively, for each classi= 1,2, . . . , c, andN Z2(i)is

the number of samples from classi inZ2. This measurement

penalizes classifiers that make errors concentrated on smaller classes.

A general learning algorithm was proposed in [3] for any classification method. A simple variant of this algorithm is presented in Algorithm 2 for the OPF classifier.

Algorithm 2 projects an OPF classifier fromZ1and evaluates

it onZ2, during a few iterations in order to select the instance

of classifier (Z1)with highest accuracy on Z2. It essentially

assumes that the most informative samples in Z2 are the

misclassified ones. It replaces these samples by nonprototype samples in Z1. We have relaxed the restriction of replacing

samples only that belong to the same class, as proposed in [3]. The idea is to allow classes that require more samples in the training set to have them without increasing the training set size. The restriction of preserving prototypes is kept, but observe that those samples may be selected for replacement in future iteration, if they are no longer prototypes.

In practice, we would like to start from a reasonable number of samples inZ1in order to speedup learning and further reduce

that number to speedup classification. Therefore, we propose a new learning algorithm which starts from |Z1|<|Z2|and

Algorithm 3: Learning-With-Pruning Algorithm

combines Algorithm 2 with the identification and elimination of irrelevant samplesfromZ1to reduce its size. A sample is said

irrelevant if it is not used to classify any sample inZ2, i.e., it

does not belong to any optimum path which was used to classify the samples in Z2. The remaining samples, said relevant, can

be easily identified during classification by marking them in the optimum pathP∗(t)(3), as we follow backward the prede-cessor nodes of tinP until its root prototypeR(t)(note that P(R(t)) =nil). However, by moving the irrelevant samples (samples that did not participate from the classification process inZ2) fromZ1 toZ2, the accuracy onZ2 might be affected.

Then we can repeat the learning and pruning processes until no irrelevant sample exists in Z1. This algorithm is presented in

Algorithm 3.

The main loop (Lines 1–10) executes the learning algorithm with pruning until no irrelevant samples remains inZ1. Line 2

obtains the most informative samples fromZ1∪Z2to project

an OPF classifier on Z1. All samples inZ1 used to classify

samples inZ2 are included in the relevant setR(Lines 4–8).

Line 9 insert inIthe irrelevant samples ofZ1and these samples

are moved fromZ1toZ2in Line 10.

IV. EXPERIMENTALRESULTS

We have evaluated OPF with pruning by taking SVM and OPF with no pruning [3] as baselines. For SVM implemen-tation, we used the LibSVM package [12] with Radial Basis Function kernel, parameter optimization and the one-versus-one strategy for the multiclass problem, and for OPF we used the LibOPF package [13]. Table I presents the data sets used in our experiments. These data sets are not large, but they are enough to predict the behavior of the method in the case of large data sets. Data sets Cone Torus, Petals, Saturn, and Boat are synthetic and defined by (x, y) coordinates (a 2-D feature space). Data set MPEG7 contains shapes, and we il-lustrate the behavior of the method for it with different shape descriptors, the Fourier coefficients and Bean Angle Statistics. These data sets are better explained in [3]. Data set Rainfall comes from a real application, which uses image analysis for rainfall estimation (Section IV-C).

(4)

TABLE I

DESCRIPTION OF THEDATASETSUSED IN THEEXPERIMENTS

TABLE II

MEANACCURACY ANDSTANDARDDEVIATION FOROPFAND

SVM CLASSIFIERSAFTERTENROUNDS OFEXECUTIONS

TABLE III

TOTALMEANTIMES(INSECONDS)OFPROJECT ONZ1AND

CLASSIFICATION ONZ3FOROPFANDSVM CLASSIFIERS

AFTERTENROUNDS OFEXECUTIONS

A. Impact of the Training Set Size

We have demonstrated in [3] that OPF (without pruning) and SVM can produce similar accuracies on several data sets. However, the accuracy of SVM on a test set might be drastically affected when we limit the size of the training set to consider-ably reduce the project time of SVM. This is shown in Table II. In this experiment, we did not use the learning algorithm, only project on the training set Z1 and classification on the

test setZ3. Project and classification with ramdomly selected

samples was repeated ten times for each classifier in order to compute mean accuracy (robustness) and its standard deviation (precision). We ramdomly selected 40% of the samples to constituteZ3 and the remaining 60% were selected forZ1 in

OPF, but only 30% of them forZ1in SVM. This was enough to

show that the accuracy of SVM drops drastically with respect to the one of the OPF. Even though, OPF was still from 207 to 876 times faster than SVM (Table III).

B. Effectiveness of the Learning With Pruning

In this experiment, we divided the data sets in 20% of ramdomly selected samples forZ1, 40% forZ2, and 40% for

Z3. OPF with learning onZ2(Algorithm 2) was compared with

OPF using the learning-with-pruning algorithm (Algorithm 3). Their mean accuracies on Z3 and standard deviations were

obtained after ten rounds of experiments. Table IV presents the results. Note that the accuracy of OPF was little affected.

Table V shows the effectiveness of the proposed approach in reducing memory to storage the classifiers. The mean

per-TABLE IV

MEANACCURACIES ANDSTANDARDDEVIATIONS FOROPF WITH

LEARNING ANDOPF USINGLEARNINGWITHPRUNING

AFTERTENROUNDS OFEXECUTIONS

centage of pruning ranged from 81.28% to 86.27%. The total mean times to learn from Z2 and classify samples ofZ3 are

also shown. They indicate that the increase in learning time of OPF with pruning was not significant as compared to its gain in classification time.

C. Rainfall Estimation

In this experiment, we used the data set Rainfall (Table I) to compare the performance of SVM and OPF, both using learning onZ2[3], with the one of the OPF using learning with pruning.

We divided the data set into 20% of samples inZ1, 40% inZ2

and 40% in Z3 for the OPF classifiers. In order to make the

learning time of SVM acceptable, we limited to 3% the number of samples inZ1for it.

This data set was obtained by taking pixels as samples from cropped regions in one image (256×256, 8 b/pixel, GOES-12 infrared channel), covering the area of Bauru, São Paulo, Brazil (Fig. 1). The true labels of its samples were obtained from another image of the same location using the Meteorological Radar of the São Paulo State University.

One of the main characteristics for identifying precipitation pixels from infrared channel images is their temperature, in which high precipitation levels are also associated with low temperatures, but the opposite is not always true, i.e., cirrus clouds are easily mistaken with precipitation clouds. The tem-perature is linearly related to aSlopeparameter [14] computed for each pixels

Slope(s) = 0.568 (T(s)−217) (8) whereT(s)is the cloud temperature atp. This equation deter-mines if the calculated temperature is obtained from a cirrus or a precipitation cloud. Later, Adler and Negri [15] proposed a method to correlate the Slope parameter with the gray values obtained from infrared images. A parameterW(s)given by

W(s) =I(s)−I(s) (9) is calculated, in which I(s) and I(s) are, respectively, the pixel gray value and the average intensity value within its eight neighborhood. Values ofW(s)less thanSlope(s)(8) are generally associated with cirrus clouds and, otherwise, these values indicate the presence of precipitation clouds. In that way, we used as feature vector v(s) = (I(s), W(s))and the Euclidean metric as distance function.

Table VI shows the results of mean accuracy and standard deviation, total mean time for learning (in seconds) fromZ2,

and total mean time for classification onZ3(in seconds). Note

(5)

TABLE V

MEANSTORAGESPACE(INKILOBYTES)ANDTOTALMEANTIMES(INSECONDS)TOLEARNFROMZ2AND

CLASSIFYSAMPLES OFZ3 FOREACHCLASSIFIERAFTERTENROUNDS OFEXECUTIONS

Fig. 1. Image obtained from visible channel of the GOES-12 satellite, cover-ing the area of Bauru, SP-Brazil. Some rainfall regions are represented by the bounded locations.

TABLE VI

MEANACCURACY ANDSTANDARDDEVIATION ANDMEANEXECUTION

TIMES(INSECONDS)FORLEARNING ANDTESTING, RESPECTIVELY, AFTERTENROUNDS OFEXPERIMENTS

OPF with pruning spent only 60 more seconds for learning to reduce the storage space from 53.76 to 24.55 kB (54.33% of storage space reduction) and the test time from 13.52 to 11.41 s (15.61% of classification time reduction), maintaining the same accuracy of OPF. In the case of large data sets, Z1 and Z2

can be fixed and Z3 will be much larger than that. Since

the learning time depends only on|Z1| and|Z2|, the gain in

classification time will be much more significant in practice (the classification time is proportional to|Z3| × |Z1|. The

ac-curacy in both OPF approaches was 1.26 times higher than the one of the SVM, which also required about 4.4 h for learning.

V. CONCLUSION

We have presented a novel learning algorithm for the OPF classifier, which can reduce the training set size by pruning irrelevant samples. The method was first evaluated on some data sets and then validated for rainfall occurrence estimation using

satellite image analysis. The results indicate good accuracy and considerable gains in storage space and classification time. These advantages become more relevant, when we consider data sets larger than the one we had available (i.e., test sets with millions of pixels). In any case, the advantages of the method over SVM are clear in efficiency and accuracy, demonstrating its potential for satellite image analysis. Future work includes evaluation with more images and larger data sets from different regions and satellites.

REFERENCES

[1] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” inProc. 5th Workshop Comput. Learn. Theory, 1992, pp. 144–152.

[2] S. Haykin,Neural Networks: A Comprehensive Foundation. Englewood Cliffs, NJ: Prentice-Hall, 1994.

[3] J. P. Papa, A. X. Falcão, and C. T. N. Suzuki, “Supervised pattern clas-sification based on optimum-path forest,”Int. J. Imaging Syst. Technol., vol. 19, no. 2, pp. 120–131, Jun. 2009.

[4] J. P. Papa, A. A. Spadotto, A. X. Falcão, and J. C. Pereira, “Optimum path forest classifier applied to laryngeal pathology detection,” inProc. 15th IWSSIP, 2008, vol. 1, pp. 249–252.

[5] J. A. Montoya-Zegarra, J. P. Papa, N. J. Leite, R. S. Torres, and A. X. Falcão, “Learning how to extract rotation-invariant and scale-invariant features from texture images,”EURASIP J. Adv. Signal Process., vol. 2008, no. 18, pp. 1–16, 2008.

[6] L. M. Rocha, F. A. M. Cappabianco, and A. X. Falcão, “Data clustering as an optimum-path forest problem with applications in image analysis,”

Int. J. Imaging Syst. Technol., vol. 19, no. 2, pp. 50–68, Jun. 2009. [7] J. P. Papa and A. X. Falcão, “A new variant of the optimum-path forest

classifier,” inProc. 4th Int. Symp. Adv. Vis. Comput., vol. 5358,Lecture Notes in Computer Science, Berlin, Germany, 2008, pp. 935–944. [8] G. M. Freitas, A. M. H. Ávila, J. P. Papa, and A. X. Falcão,

“Optimum-path forest-based models for rainfall estimation,” presented at the 16th Int. Workshop Syst., Signals Image Process., 2009.

[9] H. Murao, I. Nishikawa, S. Kitamura, M. Yamada, and P. Xie, “A hybrid neural network system for the rainfall estimation using satellite imagery,” inProc. Int. Joint Conf. Neural Netw., 1993, vol. 2, pp. 1211–1214. [10] A. X. Falcão, J. Stolfi, and R. A. Lotufo, “The image foresting transform:

Theory, algorithms, and applications,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 1, pp. 19–29, Jan. 2004.

[11] A. Rocha, P. A. V. Miranda, A. X. Falcão, and F. P. G. Bergo, “Object de-lineation byκ-connected components,”EURASIP J. Adv. Signal Process., vol. 2008, pp. 1–15, Jan. 2008.

[12] C. C. Chang and C. J. Lin, LIBSVM: A Library for Support Vector Machines, 2001. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/ libsvm

[13] J. P. Papa, C. T. N. Suzuki, and A. X. Falcão,LibOPF: A Library for the Design of Optimum-Path Forest Classifiers2008, Software ver. 1.0. [Online]. Available: http://www.ic.unicamp.br/~afalcao/LibOPF [14] H. A. Panofsky and G. W. Brier, Some Applications of Statistics to

Meteorology. Philadelphia, PA: Mineral Ind. Continuing, 1968. [15] R. F. Adler and A. J. Negri, “A satellite infrared technique to estimate