ΕζζηνδεσΣ α δ δεσΙν δ οτ ο
Π α 18υΠα υ υ υ α (2005) .485-494
A DISCREPANCY BASED MODEL SELECTION
CRITERION
Kyriacos Mattheou and Alex Karagrigoriou
University of Cyprus
ABSTRACT
The aim of this work is to develop a new criterion of model selection using a general technique based on measures of discrepancy. The new criterion is constructed using the Power Divergence introduced by Basu et. al (1998) and is shown to be an asymptotically unbiased estimator of the expected overall discrepancy between the true and the fitted models.
1. INTRODUCTION
A model selection criterion can be constructed by an approximately unbiased estimator of an expected “overall discrepancy” (or divergence), a nonnegative quantity which measures the “distance” between the true model and a fitted approximating model. A well known divergence is the Kullback-Leibler discrepancy that was used by Akaike (1973) to develop the Akaike Information Criterion (AIC).
Measures of discrepancy or divergence between two probability distributions have a long history. A unified analysis was recently provided by Cressie and Read (1984) who introduced the so called power divergence family of statistics for multinomial goodness-of-fit tests. The fit of the model for the behaviour of a population can be assessed by comparing expected
( )
nπ
i and observed( )
Xi frequencies using the family of power divergence statisticsfor =1 the statistic (1.1) becomes the well known Pearson’s X2 statistic while for
(1.2) 2 2 log i
which coincides with the Kullback-Leibler distance.
The term power divergence describes the fact that the statistic measures the divergence of the “expected” from the “observed frequencies” through a (weighted) sum of powers of the term (observed / expected).
Note that in the continuous case, the power divergence between the true and the hypothesized distribution
g
andf
takes the form [Cressie and Read,1988 p.125](1.3)
( ) ( ) ( )
( )
which for
λ
→
0
becomes the Kullback-Leibler distance which was used to constract the AIC criterion as mentioned before.A new measure of discrepancy was recently introduced by Basu et. al (1998). This new family of discrepancy measures which is given in (2.1) is referred to as the class of density power divergences and is indexed by a single parametera. They measure the divergence between two densities f and g through the integral of powers of the terms Kullback-Leibler distance (see Lemma 2.1, Section 2).
In this paper, we develop a new model selection criterion which is shown to be an approximately unbiased estimator of the expected overall discrepancy that corresponds to Basu’s density power divergence.
2. POWER DIVERGENCE AND THE EXPECTED OVERALL
DISCREPANCY
In parametric estimation, there are a lot of methods that have been introduced in order to extract an estimator of the true parameter. Some of them are known as
(2.1)
d
a( )
g f
,
f
1 a( )
z
1
1
g z f
( ) ( )
az
1
g
1 a( )
z
dz a
,
0,
where
g
is the true model, f the fitted approximating model, and a a positive number.Proof.
The minimum density power divergence estimator
θ
ˆ of the parameterϑ
, is generated by minimising(2.2) 1
( )
1( )
For general families, as it can be easily seen from equation (2.2) and Lemma (2.2), the estimating equations are of the form(2.3)
( )
1( ) ( )
( )
1( )
that this estimating equation is unbiased wheng
=
f
ϑ.Some motivation for the form of the divergence (2.1) can be obtained by looking at the location model, where 1 a
( )
fϑ+ z dz
∫
is independent ofϑ
. In this case, the proposed estimators maximise( )
1
∑
, with the corresponding estimating equations being of the form(2.4)
( ) ( )
This can be viewed as a weighted version of the efficient maximum likelihood score equation. When a>0, (2.4) provides a relative-to-the-model downweighting for outlying observations; observations that are wildly discrepant with respect to the model will get nearly zero weights. In the fully efficient case a=0, all observations, including very severe outliers, get weights equal to one.
To construct the new criterion for goodness of fit we shall consider the quantity: (2.5)
W
f
1 a( )
z
1
1
g z f
( ) ( )
az dz a
,
0
(2.6)
W
f
1 a( )
z dz
1
1
E
g(
f
a( )
Z
)
,
a
0
a
ϑ
=
ϑ+− +
⎛
⎜
⎞
⎟
ϑ>
⎝
⎠
∫
.Our target theoretical quantity is
(2.7)
E W
( )
θˆwhere
θ
ˆ is the estimator of the parameter that minimizes (2.2). Observe thatE W
( )
θˆcan be viewed as the average distance between g and
f
ϑ and is known as theexpected overall discrepancy between g and
f
ϑ. Note that the target quantity gets a different value for each candidate modelf
ϑ used. Our purpose is to obtain unbiased estimates of the theoretical quantities (2.7), for eachf
ϑ, which will then be used as a new criterion for model selection denoted by DIC (Divergence Information Criterion). The modelf
ϑ selected will be the one for which DIC will be minimized. This is discussed in Section 3.The following Lemma provides the second derivative of (2.6). Observe that the first derivative of (2.6) is given in Lemma 2.2.
Proof.
Lemma 2.4. If the true distribution g belongs to the parametric family {
f
ϑ}, then the second derivative of (2.6) simplifies to:(2.8)
( )
assumption, is equal to 0.Proof. If the true distribution
g
belongs to the parametric family {f
ϑ}, then:( )
( )
∂
. It is obvious that the first derivative is 0.(2.10)
( )( ) ( )
1
2,
3. THE NEW CRITERION DIC
In this section we introduce the new criterion which we show than it is an approximately unbiased estimator of (2.7). First we have to estimate (2.6) because the true distribution
g
is unknown. So we use a quantity like the empirical distribution function and defineQ
ϑ to be:The following Lemma provides the derivatives of
Q
ϑ. Lemma 3.1.The first derivative of (3.1) is:The second derivative of (3.1) is:
( ) ( )
{
( )
( )
Proof. The proof is similar to the proofs of Lemma (2.2) and Lemma (2.3). The following theorem has been proved by Basu et. al (1998).
Theorem 3.1 [Basu et. al (1998)]. Under certain regularity conditions, for
θ
ˆ which minimizes (2.2), we have, as n→ ∞ ,(i)
θ
ˆ is consistent forθ
, and(ii)
n
( )
θ θ
ˆ
−
is asymptotically normal with mean zero and varianceJ K
−2 , where J=J(θ
) and K=K(θ
), under the assumption that the true distribution g belongs to the parametric family {f
ϑ} andθ
being the true value of the parameter, are given by:(3.3)
Κ =
∫
⎡
⎣
u
θ( )
z
⎦
⎤
2f
θ1 2+a( )
z dz
−
ξ
2where
ξ
=∫
uθ( ) ( )
z fθ1+a z dz.By the weak law of large numbers
(3.4) Qϑ P Wϑ and under the assumption that the true distribution g belongs to the parametric family {
f
ϑ} and from Lemma (2.4) we have: estimatorθ
ˆ which minimizes Basu’s discrepancy, the following approximation:( )
2Substituting the true value
θ
forϑ
and taking expectations on both sides we have the desired result.Theorem 3.3.The expected overall discrepancy evaluated at
θ
ˆ is given by:( )
( )
( )
2ˆ ˆ 1 ˆ .
E W⎡ ⎤ =⎣ ⎦θ E Qθ + +a E⎡⎢
θ θ
− J⎤⎥Proof. Observe that the expectation of (3.1) for
ϑ
=θ
yields E Q( )
θ =Wθ . Combining the above relation and the result of Theorem (3.2) and using equation (2.9) we obtain the desired result for the expected overall discrepancy.The above result for p-dimensional
θ
ˆ can be expressed as:( )
( )
( ) ( )
ˆ ˆ
1
ˆ
ˆ
E W
⎡ ⎤ =
⎣ ⎦
θE Q
θ+ +
a
E
⎢
⎡
θ θ
−
′
J
θ θ
−
⎤
⎥
⎣
⎦
.Taking into consideration [see Basu et. al (1998)] that
( ) ( )
1 1asymptotic covariance matrix of the maximum likelihood estimator of the p-dimensional parameter
θ
. The new criterion DIC is defined by(3.6)
( )( )
Note that the family of candidate models is indexed by a single parameter a which controls the trade-off between robustness and asymptotic efficiency of the parameter estimators which are the minimizers of this family of divergences. When a→0, Basu’s density power divergence is the Kullback-Leibler divergence and the method is maximum likelihood estimation; when a=1, the divergence is the
L
2-distance, and a robust but inefficient minimum mean squared error estimator ensues.what extent the resulting methods become statistically more robust than the maximum likelihood methods, and should be thought of as an algorithmic parameter. The robustness of the proposed method can be easily understood in the case of the location model where for a>0, the estimating equations given in (2.4) provide a dowweighting for observations wildly discrepant with respect to the underlying model. One way of selecting the parameter a is to fix the efficiency loss, at the ideal parametric model employed, at some low level, like five or ten percent. Other ways could in some practical applications involve prior motions of the extent of contamination of the model.
This criterion could be used in applications where outliers or contaminated observations are involved. Preliminary simulations with a contamination proportion of approximately 10% show that DIC has a tendency of underestimation by selecting the true model as well as smaller models in contrast with AIC which overestimates the true model.
Acknowledgments
The authors wish to thank Tasos Christofides for fruitful conversations and an anonymous referee for insightful comments and suggestions that greatly improve the quality of the paper.
ΠΕΡΙΛΗΨΗ
Αυ α α α υ α υ υ π
α υ π υ α α «απ α » (discrepancy or
divergence) α υ α / . α α υ α
π α α α «απ α », Power Divergence π υ απ Basu α
υ (1998) α α π α π α
α α «απ α » α α π α α α υπ φ .
REFERENCES
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. Second International Symposium on Information Theory. (B. N. Petrov and F. Csaki, eds.), 267-281, Akademiai Kaido, Budapest.
Basu, A., Harris, I. R., Hjort, N. L. and Jones, M. C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika, 85, no. 3, 549– 559.
Beran, R. (1977). Minimum Hellinger distance estimates for parametric models, Ann. Statist., 5, 445–463.
Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests. J. R. Statist. Soc., B 46, 440–454.
Cressie, N. and Read, T. R. C. (1988). Goodness-of-Fit Statistics for Discrete Multivariate Data, Springer Verlag, New York.