Classification Quality Measures - On weather and waves : applications to coastal engineering.

4.2 Method

4.2.3 Classification Quality Measures

It is important to quantify the sensitivity of the CP classification algorithm to changes in the parameters that define the methodology. For example, what are the effects of changing the number of classes used for classification or of using different temporal resolutions in the data? A good classification contains dissimilar classes while reducing the variability within each class. Furthermore for this study it should be able to balance the degree of certainty (or confidence) with which a particular subset of CP classes explains the extreme wave events while reducing the total number of classes required.

Changing the number of CP classes has direct implications on the classification quality. If the number of classes is too many the classification is unable to capture general climatic features that describe extreme wave events. Alternatively, too few classes result in CPs that are unable to describe climatic mechanisms in sufficient detail to distinguish between them (Pringle et al., 2014).

Huthet al. (2008) list a number of quality measures such as the explained variance (EV), pattern correlation ratio (PCR) and within type standard deviation (WSD).

These measures all relate to the model’s ability to maximise dissimilarity between classes while reducing the variability within classes. The explained variance incorpo- rates a type of Euclidean distance measure using the sum of squares. It is given by

EV = 1− ss_i

sst (4.3)

with

ssi = ^P^K^k=1^P^Nk^j=1,j̸=i^P^Nkⁱ⁼¹^P^L^l=1^P^M^m=1(^x^lmik^−x^lmjk)²

k=1Nk(Nk−1) , (4.4)

ss_t = ^P^N^j=1,j̸=i^P^Nⁱ⁼¹^P^L^l=1^P^M^m=1(^x^lmi^−x^lmj)²

N(N−1) , (4.5)

where ss_i is the mean total sum of squares within a class, ss_t is the mean total sum of squares without classification, and x_lmik the anomaly value at the (l, m) grid point for the ith index belonging to kth class, with K the total number of classes.

Another useful measure of the classification performance is the pattern correlation ratio. This is defined as the ratio between the mean correlation coefficient between classes and the mean correlation coefficient within the classes (Huth, 1996). It is given by

P CR= P c_b

P c_i (4.6)

with

P c_b =

l=1,l̸=k

PK k=1

P_Nl

j=1

PN k

i=1r(xik,xjl)

l=1,l̸=k

PK k=1NkNl

, (4.7)

P c_i =

PK k=1

PNk

j=1,j̸=i

P_Nk

i=1r(x_ik,x_jk)

k=1Nk(Nk−1) , (4.8)

whereP c_b is the mean correlation coefficient between fields from different classes,P c_i is the mean correlation coefficient between fields within a given class,r(x_ik,x_jl) is the correlation coefficient between theith index for thekth class and thejth index for the lth class,xa vector containing all anomaly values, andNk is the number of anomalies belonging to the kth class.

The within class standard deviation also provides insight into the ability of the classification to minimize variability within a class and it is given by

W SD = 1 K

i=1

sd_i (4.9)

where K is the total number of classes and sd_i is the standard deviation within the ith class.

The aforementioned performance measures are only able to quantify the quality of the classification with respect to one variable, i.e. pressure anomalies. However for this study it is also important to evaluate how well the classification performs in deriving CPs that are associated with extreme wave events. For this we propose the

Shannon Entropy.

Shannon Entropy as a measure of classification quality

A good classification should balance the degree to which a particular subset of CP classes explains extreme wave events while reducing the total number of classes required. Furthermore this performance measure should be based on the wave climate and be strongly coupled with the classification but used in an independent manner.

Since our objective is to successfully identify atmospheric circulation patterns that drive extreme wave events, a method that quantifies the model’s performance based on this objective is required. Simply measuring the variability within and between classes offers little knowledge of how well the classification explains extreme events.

Bárdossy (2010) used the objective functions themselves to identify an optimal number of CP classes. However the objective functions are not an independent measure of the classification quality since they are used in the optimization process to derive the classes.

The Shannon entropy (Shannon, 1948) provides useful information on the certainty of an outcome or the expected information quantity and is defined as

H =−

n=1

(p_nlogp_n) (4.10)

where p_n is the probability of occurrence of event n and N is the total number of events.

Shannon entropy has found wide application. Originally used to gage the certainty of an outcome it is extensively used in information theory (Shannon, 1948). It has also been shown to be a useful tool in statistical modeling. For examples see Petrov et al. (2013), Gotovac & Gotovac (2009). Equation 4.10 has the following important properties particular to the present study (Shannon, 1948):

(a) A smaller number of classes contains more information i.e. more CPs are assigned to each class. This reduces the certainty with which the classification explains extreme wave events therefore increasing the entropy,

(b) H is always positive unless all but one p_n has the value unity, and

The contribution of CPs to extreme events provides useful insight into the drivers of those events. This information is also unbiased towards the frequency of occurrence since it is conditioned on the occurrence of a CP given an event. Therefore it is possible to identify a class that drives extreme events and occurs frequently, as an example.

However it is better to find classes that occur infrequently but also contribute to the event space since extreme wave events tend to occur infrequently. Hence it is possible to use the contribution of CPs to events defined as P_n=p(CP|H_s≥3.5) to calculate the entropy as follows:

H =−1 K

n=1

P_nlnP_n (4.11)

where P_n is the contribution of class n to extreme events and K is the total number of classes. Equation 4.11 is a measure of the average entropy per CP class. Increasing the number of classes improves the ability of the classification algorithm to find classes driving extreme wave events. Therefore P_n for the classes driving the events should increase whereas P_n for all other classes should reduce. This increases the certainty of the CP classes driving the events thus reducing the average entropy value. However, increasing the number of classes eventually leads to diminishing returns. A larger set of classes improves the classification’s ability to capture specific climatic features while simultaneously reducing its ability to capture general features.

Dalam dokumen On weather and waves : applications to coastal engineering. (Halaman 87-90)