Chapter 1 Introduction and Summary

(1)

Copyright

IIT Kharagpur

1.1 Introduction

The problem of estimating Shannon’s entropy has received considerable attention from the researchers of different areas of science and technology. In the theory of communication, entropy refers to the uncertainty content in a message to be transmitted through a communication channel. The concept of entropy is also widely referred as a measure of biological disorder in biology and as a measure of diversity indices of different species in ecology. Molecular biologists use the concept of Shannon’s entropy in the analysis of patterns in gene sequences. It is also used in hydrology and water resources to under- stand, characterize and forecast various phenomena. For a detailed description on the entropy one may refer Cover and Thomas (2006), Robinson (2008) and Broadbridge and Guttmann (2009). Some of the important properties of entropy are given below:

Let H(p1, . . . , pn) denotes the entropy of a system containing n components with probability of occurrence p1, . . . , pn.

Nonnegativity:

Entropy is always non-negative that is H(p^∗)≥0, where p^∗ = (p1, . . . , pn).

Continuity:

Entropy H(p^∗) is a continuous function of p^∗ which means that there should be a small change in entropy if we change the values of the probabilities by small amount.

Symmetry:

Entropy is a symmetric function of probabilities that is, entropy will not be changed if

(2)

Copyright

IIT Kharagpur

we change the order of the probabilities. Thus

H(p1, . . . , pn) =H(pn, . . . , p1) =H(pn−1, . . . , p1) =. . .etc.

Maximality:

Entropy is maximum when all outcomes are equally likely that is, when the probabilities of all outcomes are equal which means

H(p₁, . . . , pn)≤H1

n, . . . , 1 n

.

Equality sign holds when pi = 1

n for all i= 1, . . . , n.

Concavity:

Entropy H(p^∗)is a concave function of p^∗. Additivity:

Entropy is additive. If a system consists of two independent sub-systems havingn and m number of components with probabilities of occurrence (p₁₁, . . . , p_1n) and (p₂₁, . . . , p_2m) respectively, then

H(p11p21, . . . , p11p2m, p12p21, . . . , p12p2m, . . . , p1np21, . . . , p1np2m)

=H(p11, . . . , p1n) +H(p21, . . . , p2m).

1.2 Entropy in Thermodynamics

The concept of entropy was first introduced in 1865 in second law of thermodynamics by Clausius. It tells that the entropy of an isolated system, not in equilibrium will tend to increase over time. The Entropy attains its maximum value when the system is in equilibrium. He defined entropy as dS =dQ/T, wheredS is the small change in entropy, dQ is the quantity of heat and T is the temperature at which heat is supplied to the system.

In 1872, Boltzmann made the connection between thermodynamical entropy and statistical entropy, with the equation S =KlnW, where K is the Boltzmann constant and W is the number of realizations that a system has. Specifically we can say that entropy is a logarithmic measure of the density of states.

The probabilistic nature of the concept of entropy was introduced by Gibbs in 1880 who defined entropy asH =KPn

i=1piln(1/pi), wherepi is the probability of occurrence of the ith event of a system containing n number of events.

(3)

Copyright

IIT Kharagpur

1.3. Some Measures of Information

1.3 Some Measures of Information

Whenever we make some statistical observations, design and conduct statistical experiments, we seek information. The question often arises: how much information we can get from a particular set of statistical observations or experiments about the population. The answer of this question is given by several authors by defining the measure of information in different ways which are given below.

Fisher (1925) is probably the first to study the measure of the information supplied by data about an unknown parameter of a distribution. Let X be a vector valued random variable with values in space T and having probability density fθ(x), θ∈Θ with respect to σ-finite measure υ. Further let fθ(x) be differentiable with respect to θ and for any C ⊂T,

d dθ

Z

C

fθ(x)dυ= Z

C

dfθ(x) dθ dυ.

Fisher’s information measure on θ contained in the random variable X is defined by I(θ) =Edlnf

dθ 2

.

Hartley (1928) introduced a logarithmic measure of information for communication.

When one message is chosen from a finite set of equally likely choices then the number of possible choices or any monotonic function of the number of choices can be regarded as a measure of information. Hartley pointed out that the logarithmic function is the most natural measure. Usually the logarithm with base two is chosen. Therefore one relay or flip-flop which can be in any of two stable positions holds one bit of information. n such devices can storenbits, since the total number of possible states is 2ⁿandI = log₂(2ⁿ) = n. Hartley defined the information contained in an outcome U as

I(U) = log₂ 1 P{U}

.

Shannon’s entropy is a broad and more general concept. It was originally derived by Shannon (1948) to study the amount of information in a transmitted message. Entropy is expressed in terms of a discrete set of probabilities pi, i = 1, . . . , n. Shannon defined entropy as

S =

n

X

i=1

pilog₂(1/pi).

(4)

Copyright

IIT Kharagpur

In the case of transmitted messages, these probabilities are the probabilities that a particular message is transmitted. The entropy of the message system was a measure of how much information was in the message. It can be shown that Shannon’s entropy is a weighted average of the numbers −log₂p₁, . . . ,−log₂pn, where the weights are the probabilities of the various values of X.This suggests that Shannon’s entropy may be in- terpreted as the expectation of a random variable which assumes the values −log₂pi with probabilitiespi, i= 1, . . . , n.He also introduced the concept of mutual information which measures the mutual dependence of two random variables. It reduces the uncertainty of one random variable due to the knowledge of the other. The mutual information I(X;Y) between two random variables X and Y is defined as

I(X;Y) =









 X

x∈X

X

y∈Y

p(x, y) log₂ p(x, y)

p(x)p(y), if (X, Y) is of the discrete type, Z

X

Z

Y

f(x, y) log₂ f(x, y)

f(x)f(y)dxdy, if (X, Y) is of the continuous type, where p(x, y), f(x, y) denote the joint probability mass and density functions of (X, Y) respectively and p(x), p(y) denote the marginal probability mass functions of X, Y and f(x), f(y)denote marginal density functions of X, Y.

Kullback and Leibler (1951) defined the relative entropy as

D(p||q) =









 X

x∈X

p(x) log₂ p(x)

q(x), if p(x) and q(x)are probability mass functions, Z ∞

−∞

f(x) log₂ f(x)

g(x)dx, if f(x) and g(x) are probability density functions.

It is also known as Kullback Leibler distance, cross entropy or information divergence. It is a measure of the distance between two distributions. The relative entropy D(p||q) is a measure of the inefficiency of assuming that the distribution isqwhen the true distribution is p.

Rényi (1961) defined entropy as Rα^′ = _1−α¹ ′

log₂P

ip^α_i^′

for α^′ >0. Renyi’s entropy is more generalized entropy than Hartley’s entropy and Shannon’s entropy. In particular for α^′ tending to 0, Renyi’s entropy converges to Hartley’s entropy and for α^′ tending to 1, Renyi’s entropy converges to Shannon’s entropy.

(5)

Copyright

IIT Kharagpur

1.4. Units of Entropy

1.4 Units of Entropy

Entropy is independent of the units of the data. The unit of entropy depends on the base of the logarithmic function used in the definition of entropy. If the base of logarithm is 2, e and 10then units are bit, nat and dit respectively.

1.5 A Review of the Problems of Estimating the En- tropy in Statistics

In this section we give a brief review on nonparametric and parametric entropy estimation problems. Section 1.5.1 reviews the nonparametric entropy estimation problem and Section 1.5.2 reviews the parametric entropy estimation problem.

1.5.1 Nonparametric Entropy Estimation

For a good review on some known results on nonparametric entropy estimation one may refer to Beirlant et al. (1997). They discussed about plug-in, sample-spacing and nearest neighbour distances methods, used for the nonparametric entropy estimation of a continuous random variable. Several applications, such as entropy-based tests for goodness of fit, entropy-based parameter estimation etc. are given. A very rich list of references have also been included.

Let X1, . . . , Xn be a random sample drawn from a population with associated probability density function f(x).Shannon’s entropy isS(f) =−R

f(x) lnf(x)dx.Ahmad and Lin (1976) proposed a nonparametric estimate of S(f)given by S(fˆ ) =−n¹

Pn

i=1ln ˆf(xi), where fˆ(x) is the kernel estimate of f(x) due to Rosenblatt (1956) and Parzen (1962).

They obtained some regularity conditions under which the first and second mean consis- tencies of S(f)ˆ are established.

Joe (1989) explained how a functional of a multivariate density f(x) of the form I(f) = R

J(f)f dµ, where J(.) is a smooth and real valued function can be represented by an empirical process based on the kernel density method. For the entropy function

−R

flnf dµ, it is also possible to find out non-unimodal kernels satisfying certain conditions which can reduce bias and mean squared error.

(6)

Copyright

IIT Kharagpur

Mokkadem (1989) proposed a method for estimating entropy of absolutely continuous random vectors. He gave an upper bound of the mean risks for the proposed estimators under strong mixing conditions.

Hall and Morton (1993) introduced a class of estimators whose construction does not need numerical integration. They studied in detail the effect of tail properties of under- lying distributions on the behaviour of estimators. Both histogram and kernel estimators of entropy were studied.

Singh et al. (2003) defined estimators of entropy based on the k-th nearest neighbour distances. Asymptotic properties of these estimators were established. For small samples, the bias and mean squared error performances of these estimators with m−spacing estimators were also carried out.

Schraudolph (2004) derived a family of differential learning rules that optimize Shan- non’s entropy at the output of an adaptive system via kernel density estimation. He developed a normalized entropy estimate which is invariant with respect to affine transformations, facilitating optimization of the shape rather than scale of the output density.

Mnatsakanov et al. (2008) proposed estimators of entropy based on Euclidean distances betweenn sample points and theirKn-nearest neighbours. The proposed estimators were shown to be asymptotically unbiased and consistent.

Misra et al. (2010) studied the problem of estimating entropies of circular random vectors. They introduced nonparametric estimators based on circular distances between n sample points and their kth nearest neighbour, where k (≤ n−1) is a fixed positive constant. The nearest neighbour estimators are also shown to be asymptotically unbiased and consistent. Finally, they compared the performance of one of the circular-distance estimators with that of the existing Euclidean-distance nearest neighbour estimator using Monte Carlo sample drawn from an analytical distribution of six circular variables of an exactly known entropy.

1.5.2 Parametric Entropy Estimation

Lazo and Rathie (1978) derived the expressions of Shannon’s entropy for various univariate continuous probability distributions.

(7)

Copyright

IIT Kharagpur

1.5. A Review of the Problems of Estimating the Entropy in Statistics

Ahmed and Gokhale (1989) derived the expressions of the entropy of several multivariate distributions such as multivariate normal, multivariate log-normal, multivariate logistic, multivariate Pareto of Type II and multivariate exponential. In particular, they studied the problem of estimating the entropy of a multivariate normal distribution in detail. Let X1, . . . , Xn be a random sample drawn from a p variate normal distribution Np(µ,Σ) (n ≥ p+ 1), where the mean vector µ ∈ R^p and Σ, a p order positive definite covariance matrix, are assumed to be unknown. The entropy expression for this distribution, denoted by H(NORM), is given by

H(NORM) = p 2 +p

2log(2π) + 1

2log|Σ|.

Ahmed and Gokhale proposed a consistent estimator and a uniformly minimum variance unbiased estimator (UMV UE) of entropy, denoted by Hc(NORM) and H∗(NORM) respectively, which are given by

Hc(NORM) = 1

2log|V|+p

2log 2eπ n−1

and

H∗(NORM) = p

2log(eπ) + 1

2log|V| − 1 2

p

X

i=1

ψn+ 1−i 2

,

where V is a p order symmetric and positive definite matrix, defined by V = XX^′, X^′ is the transpose of X vector and ψ(x) = _dx^d(ln Γ(x)), is known as Euler psi (digamma) function. It is also proved that bothHc(NORM) andH_∗(NORM)converge in probability to E(H∗(NORM)) asn → ∞. Numerically, they studied the convergence properties of the estimators Hc(NORM)andH∗(NORM)for the case of bivariate normal distribution. Finally, it is noticed that the estimator H∗(NORM) converges slightly faster than Hc(NORM) and the variance of H∗(NORM) is smaller than that of Hc(NORM).

Darbellay and Vajda (2000) derived analytical expressions for entropy of different multivariate continuous distributions such as multivariate Pareto distribution of type-IV, multivariate logistic distribution, multivariate Burr distribution etc.

Misra et al. (2005) investigated the problem of estimating the entropy of a p variate normal distribution having unknown mean vector and covariance matrix with respect to the squared error loss function. Here statistic (X, S) is minimal sufficient, where X =√nX¯ and S =Pn

i=1(Xi−X)(X¯ i−X)¯ ^′, X¯ = ^Pⁿⁱ⁼¹_n^Xⁱ.They obtained the best affine

(8)

Copyright

IIT Kharagpur

equivariant estimator (BAEE)as

δBAE = log|S| −plog 2 +

n

X

i=1

ψn−i 2

.

It is shown that the BAEE is unbiased and generalized Bayes with respect to some noninformative prior distribution. In the class of estimators, depending on the determinant of the sample covariance matrix alone, they established the BAEE to be admissible. Further, it is proved that the BAEE is inadmissible, dominated by Stein-type and Brewster-Zidek- type estimators. It is also shown that the Brewster-Zidek-type estimator is generalized Bayes. Finally, they obtained some risk improvements of BAEE over the maximum like- lihood estimator (MLE), given by

δM LE = log|S| −plogn.

Frigyik et al. (2008) proposed a data dependent Bayesian estimate of the entropy of multivariate Gaussian distribution using an inverted Wishart prior with respect to the Bregman loss function, given by dφ(z,z) =˜ φ(z)−φ(˜z)−φ(˜z)^′(z−z)˜ for z,z˜∈R,where φ:R×R→R is a twice differentiable strictly convex function.

Gupta and Srivastava (2010) derived Bayesian estimates of differential entropy and relative entropy of uniform, Gaussian, Wishart and inverse Wishart distributions by min- imizing the expected Bregman loss function.

1.6 Summary of the Results in the Thesis

This section presents the summary of the results contained in this thesis.

InChapter 2, some basic definitions and fundamental results of decision theory have been discussed which are useful in the subsequent chapters.

In Chapter 3, we consider the problem of estimating the entropy of an exponential distribution with a scale parameter. In Section 3.2, the estimation problem for one population is studied whereas in Section 3.3 it is investigated for several populations. In Section 3.2, we obtain the MLE and the UMVUE. Two loss functions are taken. The problem of estimating the entropy with respect to the squared error loss function is described in Section 3.2.1.The best scale equivariant estimator (BSEE)is obtained which

(9)

Copyright

IIT Kharagpur

1.6. Summary of the Results in the Thesis

turns to be the UMVUE and a generalized Bayes estimator (GBE). A Bayes estimator is also obtained. The BSEE is proved to be admissible and minimax. The problem with respect to the linex loss function is considered in Section 3.2.2. The BSEE is obtained which also turns to be a GBE. A Bayes estimator with respect to the linex loss function is also derived. Under the linex loss function the BSEE is shown to be admissible and minimax. In Section3.3,we derive the MLE and the UMVUE when samples are taken fromk exponential populations. Section 3.3.1 focuses on the problem of estimating the entropy with respect to the squared error loss function. The UMVUE is shown to be a GBE with respect to a noninformative prior distribution. The Bayes estimator is obtained. We prove that the UMVUE is admissible and minimax. Further, a general inadmissibility result for the scale equivariant estimators is proved. Section 3.3.2describes the entropy estimation problem with respect to the linex loss function. Bayes and generalized Bayes estimators are obtained. Under the linex loss function, the GBE is proved to be admissible and minimax. Finally, a general inadmissibility result is established.

Chapter 4presents the problem of estimating the entropy of shifted exponential populations. Several models of exponential population are considered. In Section4.2,we take one population. The MLE and the UMVUE are obtained. In Sections4.2.1and4.2.2,we consider the entropy estimation problem with respect to the squared error and linex loss functions respectively. The BAEE is obtained in Section 4.2.1.It turns to be a GBE. The admissibility of the BAEE in a larger class of estimators is proved. It is also minimax in this class. A general inadmissibility result for the scale equivariant estimators is established. As a consequence, an improved estimator over the BAEE is obtained. Further, the problem of estimating the entropy is also considered when restrictions are placed on the parameter space. When µ is non-negative, a new estimator improving the BAEE is proposed. A GBE is also obtained. When µis negative we propose a MLE, which we call the restricted MLE. An improved estimator over the BAEE is obtained. Further, we derive a GBE with respect to a noninformative prior distribution. In Section4.2.2we obtain the BAEE which is also a GBE. A general inadmissibility result for the scale equivariant estimators is proved. Analogous to this result, an improved estimator over the BAEE is obtained. We also consider the problem of estimating the entropy when parameter space is restricted. Improved estimators are proposed when µis non-negative or negative.

GBEs are also obtained. Finally, numerical comparisons of the values of the risk of several estimators with respect to the both loss functions are presented in Section 4.2.3. In Section 4.3,we study the problem of estimating the entropy of k exponential populations

(10)

Copyright

IIT Kharagpur

with a common scale parameterσ and unequal location parametersµ1, . . . , µk. The MLE and the UMVUE are obtained. Section 4.3.1 focuses on the problem with respect to the squared error loss function. We obtain the BAEE which is also the UMVUE. A general inadmissibility result for the scale equivariant estimators is established and consequently, an estimator improving the BAEE is obtained. The problems of estimating the entropy when allµi’s are non-negative or negative,i= 1, . . . , kare also considered. The improved estimators dominating the BAEE are obtained for both the cases. The restricted MLE is also obtained when allµi’s are negative. The linex loss function is taken in Section4.3.2.

The BAEE is obtained. We prove a general inadmissibility result for the scale equivariant estimators. As a consequence, an improved estimator over the BAEE is proposed. We also obtain improved estimators dominating the BAEE when all µi’s are non-negative as well as negative. A numerical comparison of the values of the risk of different estimators is carried out in the Section4.3.3.In Section4.4,we investigate the problem of estimating the entropy of several exponential populations with a common location parameter µand different scale parameters σ1, . . . , σk. We obtain the MLE and the UMVUE. In Section 4.4.1, the estimation with respect to the squared error loss function is considered. The invariance with respect to the group of the affine transformations is introduced. A general inadmissibility result for the affine equivariant estimators is proved. As a consequence of the theorem, the MLE is shown to be inadmissible. Section 4.4.2 deals with the problem of estimating the entropy with respect to the linex loss function. A general inadmissibility result for the affine equivariant estimators is established. Using this result, the MLE and the UMVUE are shown to be inadmissible. Section 4.4.3gives a numerical comparison of the risk values of the improved estimators with that of the MLE and the UMVUE.

In Chapter 5, we investigate the problem of estimating the entropy of a bivariate exponential population due to Freund (1961) with the density function given by

f(x, y) =











αβ^′ exph

−β^′y−

α+β−β^′ xi

, if 0< x < y, βα^′ exph

−α^′x−

α+β−α^′ yi

, if 0< y < x,

where x > 0, y > 0, α > 0, β > 0, α^′ > 0 and β^′ > 0. The MLE and the method of moments estimators(MME)are derived in Section5.1.Section5.2describes the problem with respect to the squared error loss function. A general inadmissibility result for the scale equivariant estimators is proved. We also consider the problem of estimating the entropy of two sub-families of the Freund’s bivariate exponential model: (i) α =β =θ₁

(11)

Copyright

IIT Kharagpur

1.6. Summary of the Results in the Thesis

and α^′ = β^′ = θ2 and (ii) α = α^′ = φ1 and β = β^′ = φ2. We observe that these two problems are particular cases of the problems solved in Section 3.3.1 of Chapter 3 for k = 2. In Section 5.3, a general inadmissibility result for the scale equivariant estimators is established when loss function is taken to be linex. It is also shown that the problems of estimating the entropy for the two sub-families can be reduced from the problems studied in Section 3.3.2of Chapter 3.

Chapter 6focuses on the problem of estimating the entropy of a Pareto population. In Section6.2,we consider the estimation problem for the two cases: (i)the shape parameter α is unknown and the scale parameter β is known and (ii) both the shape parameter α and the scale parameter β are unknown. We take squared error loss function. The MLE and the UMVUE are obtained. The Bayes and generalized Bayes estimators are also obtained. We consider a class of estimators of the form δc =cY + lnY −ψ(n), c is real, where Y =Pk

i=1Xi, ψ(n) is known as Euler psi (digamma) function and X1, . . . , Xn is a random sample drawn from an exponential population with an unknown scale parameter σ. A class of admissible estimators in the class δc is obtained. The UMVUE and the GBE are shown to be inadmissible. Further, we consider a larger class of estimators δc₁,c₂ =c1Y + lnY +c2, c1 andc2 are real. In the class of estimators of the form δc₁,c₂, we obtain the following results.

(i) The estimatorδc1,c2∗ improvesδc1,c2 for c₁ < _n¹ and c₂ <(n−1)ψ(n)−nψ(n+ 1),and for c₁ > _n¹ and c₂ > nψ(n+ 1)−(n+ 1)ψ(n),where c_2∗ =nψ(n+ 1)−(n+ 1)ψ(n).

(ii) The estimator δc^∗₁,c2 dominates the estimator δc1,c2 if c1 > _n+1² , c2 > −ψ(n+ 1) and c1 <0, c2 <−ψ(n+ 1), wherec^∗₁ = 0.

The MME, MLE and the UMVUE are obtained when both the shape parameter α and the scale parameter β are unknown. A GBE is also obtained. A class of estimators of the formδd=X(1)+ lnT +dT −ψ(n−1),whered is real, is considered. An admissible class of estimators is also obtained in the class δd₁,d₂.The UMVUE and the GBE are shown to be inadmissible. A larger class of estimators of the form δd₁,d₂ = X(1)+ lnT +d1T +d2, d₁ and d₂ are real, is considered. We get some improvement result in the class of the estimators of the form δd1,d2 similar to those for δc1,c2 described above.