Exponential Family and Generalized Linear Models 3.1 Introduction - Chapter3 An Itroduction to generalized linear models

(1)

3 Exponential Family and Generalized

Linear Models

3.1 Introduction

Linear models of the form

E(Yi) =µi =xTiβ; Yi∼N(µi, σ2) (3.1)

where the random variables Yi are independent are the basis of most analyses of continuous data. The transposed vectorxT_i represents theith row of the design matrix X. The example about the relationship between birth-weight and gestational age is of this form, see Section 2.2.2. So is the exercise on plant growth where Yi is the dry weight of plants and Xhas elements to identify the treatment and control groups (Exercise 2.1). Generalizations of these examples to the relationship between a continuous response and several explanatory variables (multiple regression) and comparisons of more than two means (analysis of variance) are also of this form.

Advances in statistical theory and computer software allow us to use meth-ods analogous to those developed for linear models in the following more general situations:

1. Response variables have distributions other than the Normal distribution – they may even be categorical rather than continuous.

2. Relationship between the response and explanatory variables need not be of the simple linear form in (3.1).

One of these advances has been the recognition that many of the ‘nice’ properties of the Normal distribution are shared by a wider class of distribu-tions called the exponential family of distributions. These distributions and their properties are discussed in the next section.

A second advance is the extension of the numerical methods to estimate the parameters β from the linear model described in (3.1) to the situation where there is some non-linear function relating E(Yi) =µi to the linear component

xT_iβ, that is

g(µi) =xTi β

(2)

com-putation involving numerical optimization of non-linear functions. Procedures to do these calculations are now included in many statistical programs.

This chapter introduces the exponential family of distributions and deﬁnes generalized linear models. Methods for parameter estimation and hypothesis testing are developed in Chapters 4 and 5, respectively.

3.2 Exponential family of distributions

Consider a single random variable Y whose probability distribution depends on a single parameterθ. The distribution belongs to the exponential family if it can be written in the form

f(y;θ) =s(y)t(θ)ea(y)b(θ) (3.2)

wherea,b,sand tare known functions. Notice the symmetry betweeny and θ. This is emphasized if equation (3.2) is rewritten as

f(y;θ) = exp[a(y)b(θ) +c(θ) +d(y)] (3.3)

where s(y) = expd(y) and t(θ) = expc(θ).

If a(y) =y, the distribution is said to be in canonical (that is, standard)

form and b(θ) is sometimes called the natural parameter of the distribu-tion.

If there are other parameters, in addition to the parameter of interest θ, they are regarded as nuisance parameters forming parts of the functions a, b,c and d, and they are treated as though they are known.

Many well-known distributions belong to the exponential family. For exam-ple, the Poisson, Normal and binomial distributions can all be written in the canonical form – see Table 3.1.

3.2.1 Poisson distribution

The probability function for the discrete random variable Y is

f(y, θ) = θ y_e−θ

y!

Table 3.1 Poisson, Normal and binomial distributions as members of the exponential family.

Distribution Natural parameter c d

Poisson logθ −θ −logy!

Normal µ

σ2 −

µ2 2σ2 −

1 2log

2πσ2

− y 2

2σ2

Binomial log

π 1−π

nlog (1−π) logn_y

(3)

where y takes the values 0,1,2, . . . . This can be rewritten as

f(y, θ) = exp(ylogθ−θ−logy!)

which is in the canonical form because a(y) =y. Also the natural parameter is logθ.

The Poisson distribution, denoted by Y ∼ P oisson(θ), is used to model count data. Typically these are the number of occurrences of some event in a deﬁned time period or space, when the probability of an event occurring in a very small time (or space) is low and the events occur independently. Exam-ples include: the number of medical conditions reported by a person (Example 2.2.1), the number of tropical cyclones during a season (Example 1.6.4), the number of spelling mistakes on the page of a newspaper, or the number of faulty components in a computer or in a batch of manufactured items. If a random variable has the Poisson distribution, its expected value and variance are equal. Real data that might be plausibly modelled by the Poisson distri-bution often have a larger variance and are said to be overdispersed, and the model may have to be adapted to reﬂect this feature. Chapter 9 describes various models based on the Poisson distribution.

3.2.2 Normal distribution

The probability density function is

f(y;µ) = 1

(2πσ2₎1/2exp

−₂_σ1₂(y−µ)2

where µis the parameter of interest and σ2 is regarded as a nuisance param-eter. This can be rewritten as

f(y;µ) = exp other terms in (3.3) are

c(µ) =− µ

(4)

worthwhile trying to identify a transformation, such asy′ _{= log}_y _or_y′₌√_y_,

which produces datay′ _{that are approximately Normal.}

3.2.3 Binomial distribution

Consider a series of binary events, called ‘trials’, each with only two possible outcomes: ‘success’ or ‘failure’. Let the random variable Y be the number of ‘successes’ in n independent trials in which the probability of success, π, is the same in all trials. Then Y has the binomial distribution with probability density function

f(y;π) =

n y

πy(1−π)n−y

whereytakes the values 0,1,2, . . . , n. This is denoted byY ∼binomial(n, π). Hereπis the parameter of interest and nis assumed to be known. The prob-ability function can be rewritten as

f(y;µ) = exp

ylogπ−ylog(1−π) +nlog(1−π) + log

n y

which is of the form (3.3) with b(π) = logπ−log(1−π) = log [π/(1−π)]. The binomial distribution is usually the model of ﬁrst choice for observa-tions of a process with binary outcomes. Examples include: the number of candidates who pass a test (the possible outcomes for each candidate being to pass or to fail), or the number of patients with some disease who are alive at a speciﬁed time since diagnosis (the possible outcomes being survival or death).

Other examples of distributions belonging to the exponential family are given in the exercises at the end of the chapter; not all of them are of the canonical form.

3.3 Properties of distributions in the exponential family

We need expressions for the expected value and variance ofa(Y). To find these we use the following results that apply for any probability density function provided that the order of integration and differentiation can be interchanged. From the definition of a probability density function, the area under the curve is unity so

f(y;θ)dy= 1 (3.4)

where integration is over all possible values of y. (If the random variableY is discrete then integration is replaced by summation.)

If we diﬀerentiate both sides of (3.4) with respect to θ we obtain d

dθ

f(y;θ)dy= d

dθ.1 = 0 (3.5)

If the order of integration and diﬀerentiation in the ﬁrst term is reversed

(5)

then (3.5) becomes

df(y;θ)

dθ dy= 0 (3.6)

Similarly if (3.4) is diﬀerentiated twice with respect to θ and the order of integration and diﬀerentiation is reversed we obtain

d2f(y;θ)

dθ2 dy= 0. (3.7)

These results can now be used for distributions in the exponential family. From (3.3)

f(y;θ) = exp [a(y)b(θ) +c(θ) +d(y)]

so

df(y;θ)

dθ = [a(y)b

′₍_θ_{) +}_c′₍_θ_)]_f₍_y_;_θ₎_.

By (3.6)

[a(y)b′₍_θ_{) +}_c′₍_θ_)]_f₍_y_;_θ₎_dy_{= 0}_.

This can be simpliﬁed to

b′₍_θ_)E[_a₍_y_{)] +}_c′₍_θ_{) = 0} _(3.8)

because

a(y)f(y;θ)dy =E[a(y)] by the deﬁnition of the expected value and

c′₍_θ₎_f₍_y_;_θ₎_dy₌_c′₍_θ_{) by (3.4). Rearranging (3.8) gives}

E[a(Y)] =−c′₍_θ₎_/b′₍_θ₎_. _(3.9)

A similar argument can be used to obtain var[a(Y)].

d2_f₍_y_;_θ₎

dθ2 = [a(y)b′′(θ) +c′′(θ)]f(y;θ) + [a(y)b′(θ) +c′(θ)] 2

f(y;θ) (3.10)

The second term on the right hand side of (3.10) can be rewritten as

[b′₍_θ_)]2_{_a₍_y₎₋_E_[_a₍_Y_)]_}2_f₍_y_;_θ₎ using (3.9). Then by (3.7)

d2_f₍_y_;_θ₎

dθ2 dy=b

′′₍_θ_)E[_a₍_Y_{)] +}_c′′₍_θ_{) + [}_b′₍_θ_)]2_var[_a₍_Y_{)] = 0} _(3.11) because

{a(y)−E[a(Y)]}2_f₍_y_;_θ₎_dy_{= var[}_a₍_Y_{)] by deﬁnition.} Rearranging (3.11) and substituting (3.9) gives

var[a(Y)] = b′′(θ)c′(θ)−c′′(θ)b′(θ)

[b′₍_θ_)]3 (3.12)

(6)

We also need expressions for the expected value and variance of the deriva-tives of the log-likelihood function. From (3.3), the log-likelihood function for a distribution in the exponential family is

l(θ;y) =a(y)b(θ) +c(θ) +d(y).

The derivative of l(θ;y) with respect toθ is

U(θ;y) = dl(θ;y)

dθ =a(y)b

′₍_θ_{) +}_c′₍_θ₎_.

The functionU is called the score statisticand, as it depends ony,it can be regarded as a random variable, that is

U =a(Y)b′₍_θ_{) +}_c′₍_θ₎_. _(3.13)

Its expected value is

E(U) =b′₍_θ_)E[_a₍_Y_{)] +}_c′₍_θ₎_.

From (3.9)

E(U) =b′₍_θ₎

−c

′₍_θ₎

b′₍_θ₎ +c

′₍_θ_{) = 0}_. _(3.14)

The variance ofU is called theinformationand will be denoted byI_.

Us-ing the formula for the variance of a linear transformation of random variables (see (1.3) and (3.13))

I_{= var(}_U_{) =}_b′₍_θ₎2

var[a(Y)].

Substituting (3.12) gives

var(U) = b

′′₍_θ₎_c′₍_θ₎

b′₍_θ₎ −c

′′₍_θ₎_. _(3.15)

The score statistic U is used for inference about parameter values in gen-eralized linear models (see Chapter 5).

Another property of U which will be used later is

var(U) = E(U2) =−E(U′₎_. _(3.16)

The ﬁrst equality follows from the general result

var(X) = E(X2)−[E(X)]2

for any random variable, and the fact that E(U) = 0 from (3.14). To obtain the second equality, we diﬀerentiateU with respect toθ; from (3.13)

U′₌ dU

dθ =a(Y)b

′′₍_θ_{) +}_c′′₍_θ₎_.

Therefore the expected value ofU′ _is

E(U′₎ ₌ _b′′₍_θ_)E[_a₍_Y_{)] +}_c′′₍_θ₎

= b′′₍_θ₎

−c′(θ)

b′₍_θ₎ +c

′′₍_θ₎ _(3.17)

= −var(U) =−I

(7)

by substituting (3.9) and then using (3.15).

3.4 Generalized linear models

The unity of many statistical methods was demonstrated by Nelder and Wed-derburn (1972) using the idea of a generalized linear model. This model is deﬁned in terms of a set of independent random variables Y1, . . . , YN each with a distribution from the exponential family and the following properties:

1. The distribution of eachYihas the canonical form and depends on a single parameterθi (theθi’s do not all have to be the same), thus

f(yi;θi) = exp [yibi(θi) +ci(θi) +di(yi)].

2. The distributions of all the Yi’s are of the same form (e.g., all Normal or all binomial) so that the subscripts onb,c andd are not needed.

Thus the joint probability density function of Y1, . . . , YN is

The parameters θi are typically not of direct interest (since there may be one for each observation). For model speciﬁcation we are usually interested in a smaller set of parameters β1, . . . , βp (where p < N ). Suppose that E(Yi) = µi where µi is some function of θi. For a generalized linear model there is a transformation of µi such that

g(µi) =xTi β.

In this equation

g is a monotone, diﬀerentiable function called the link function; xi is a

p ×1 vector of explanatory variables (covariates and dummy variables for levels of factors),

ith column of the design matrix X.

(8)

1. Response variables Y1, . . . , YN which are assumed to share the same dis-tribution from the exponential family;

2. A set of parameters β and explanatory variables

X=

3. A monotone link function g such that

g(µi) =xTiβ

where

µi = E(Yi).

This chapter concludes with three examples of generalized linear models.

3.5 Examples

3.5.1 Normal Linear Model

The best known special case of a generalized linear model is the model

E(Yi) =µi=xTiβ; Yi∼N(µi, σ2)

whereY1, ..., YN are independent. Here the link function is the identity func-tion,g(µi) =µi. This model is usually written in the form

⎦ and the ei’s are independent, identically distributed

ran-dom variables withei∼N(0, σ2) for i= 1, ..., N.

In this form, the linear component µ = Xβ represents the ‘signal’ and e

represents the ‘noise’, random variation or ‘error’. Multiple regression, analysis of variance and analysis of covariance are all of this form. These models are considered in Chapter 6.

3.5.2 Historical Linguistics

Consider a language which is the descendent of another language; for example, modern Greek is a descendent of ancient Greek, and the Romance languages are descendents of Latin. A simple model for the change in vocabulary is that if the languages are separated by time t then the probability that they have cognate words for a particular meaning is e−θt _where _θ _{is a parameter (see}

Figure 3.1). It is believed thatθis approximately the same for many commonly used meanings. For a test list ofN diﬀerent commonly used meanings suppose that a linguist judges, for each meaning, whether the corresponding words in

(9)

Latin word

Modern French word

Modern Spanish word

time

Figure 3.1 Schematic diagram for the example on historical linguistics.

two languages are cognate or not cognate. We can develop a generalized linear model to describe this situation.

Deﬁne random variables Y1, . . . , YN as follows:

Yi=

1 if the languages have cognate words for meaningi, 0 if the words are not cognate.

Then

P(Yi= 1) =e−θt and

P(Yi = 0) = 1−e−θt.

This is a special case of the distributionbinomial(n, π) withn= 1 and E(Yi) =

π=e−θt_{. In this case the link function}_g _{is taken as logarithmic}

g(π) = logπ=−θt

so that g[E(Y)] is linear in the parameter θ. In the notation used above,

xi= [−t] (the same for all i) and β= [θ].

3.5.3 Mortality Rates

For a large population the probability of a randomly chosen individual dying at a particular time is small. If we assume that deaths from a non-infectious disease are independent events, then the number of deaths Y in a population can be modelled by a Poisson distribution

f(y;µ) = µ y_e−µ

y!

where ycan take the values 0,1,2, . . . andµ= E(Y) is the expected number of deaths in a speciﬁed time period, such as a year.

The parameterµwill depend on the population size, the period of observa-tion and various characteristics of the populaobserva-tion (e.g., age, sex and medical history). It can be modelled, for example, by

(10)

Table 3.2 Numbers of deaths from coronary heart disease and population sizes by 5-year age groups for men in the Hunter region of New South Wales, Australia in 1991.

Age group Number of Population Rate per 100,000 men

(years) deaths, yi size,ni per year,yi/ni×10,000

30 - 34 1 17,742 5.6

35 - 39 5 16,554 30.2

40 - 44 5 16,059 31.1

45 - 49 12 13,083 91.7

50 - 54 25 10,784 231.8

55 - 59 38 9,645 394.0

60 - 64 54 10,706 504.4

65 - 69 65 9,933 654.4

30-34 40-44 50-54 60-64 1.5

2.5 3.5 4.5 5.5 6.5

Age (years) log(death rate)

Figure 3.2 Death rate per 100,000 men (on a logarithmic scale) plotted against age.

wherenis the population size and λ(xTβ) is the rate per 100,000 people per year (which depends on the population characteristics described by the linear componentxTβ).

Changes in mortality with age can be modelled by taking independent ran-dom variablesY1, . . . , YN to be the numbers of deaths occurring in successive age groups. For example, Table 3.2 shows age-speciﬁc data for deaths from coronary heart disease.

Figure 3.2shows how the mortality rateyi/ni×100,000 increases with age. Note that a logarithmic scale has been used on the vertical axis. On this scale the scatter plot is approximately linear, suggesting that the relationship between yi/ni and age group i is approximately exponential. Therefore a

(11)

possible model is

E(Yi) =µi=nieθi ; Yi∼P oisson(µi),

where i= 1 for the age group 30-34 years, i= 2 for 35-39, ..., i= 8 for 65-69 years.

This can be written as a generalized linear model using the logarithmic link function

g(µi) = logµi = logni+θi

which has the linear componentxT_iβ withxT_i =

logni i

andβ=

1 θ .

3.6 Exercises

3.1 The following relationships can be described by generalized linear models. For each one, identify the response variable and the explanatory variables, select a probability distribution for the response (justifying your choice) and write down the linear component.

(a) The eﬀect of age, sex, height, mean daily food intake and mean daily energy expenditure on a person’s weight.

(b) The proportions of laboratory mice that became infected after exposure to bacteria when ﬁve diﬀerent exposure levels are used and 20 mice are exposed at each level.

(c) The relationship between the number of trips per week to the super-market for a household and the number of people in the household, the household income and the distance to the supermarket.

3.2 If the random variable Y has the Gamma distribution with a scale pa-rameterθ,which is the parameter of interest, and a known shape parameter φ, then its probability density function is

f(y;θ) = y

φ−1_θφ_e−yθ

Γ(φ) .

Show that this distribution belongs to the exponential family and ﬁnd the natural parameter. Also using results in this chapter, ﬁnd E(Y) and var(Y). 3.3 Show that the following probability density functions belong to the

expo-nential family:

(a) Pareto distributionf(y;θ) =θy−θ−1_. (b) Exponential distributionf(y;θ) =θe−yθ_.

(c) Negative binomial distribution

f(y;θ) =

y+r−1 r−1

θr(1−θ)y

(12)

3.4 Use results (3.9) and (3.12) to verify the following results:

(a) For Y ∼P oisson(θ), E(Y) = var(Y) =θ. (b) For Y ∼N(µ, σ2), E(Y) =µ and var(Y) =σ2.

(c) For Y ∼binomial(n, π), E(Y) =nπ and var(Y) =nπ(1−π).

3.5 Do you consider the model suggested in Example 3.5.3 to be adequate for the data shown in Figure 3.2? Justify your answer. Use simple linear regression (with suitable transformations of the variables) to obtain a model for the change of death rates with age. How well does the model ﬁt the data? (Hint: compare observed and expected numbers of deaths in each groups.)

3.6 Consider N independent binary random variablesY1, . . . , YN with

P(Yi= 1) =πi andP(Yi= 0) = 1−πi .

The probability function ofYi can be written as

πyi

i (1−πi) 1−yi

whereyi = 0 or 1.

(a) Show that this probability function belongs to the exponential family of distributions.

(b) Show that the natural parameter is

log

πi 1−πi

.

This function, the logarithm of theodds πi/(1−πi), is called thelogit function.

(c) Show that E(Yi) =πi. (d) If the link function is

g(π) = log

π 1−π

=xTβ

show that this is equivalent to modelling the probabilityπ as

π= e

xTβ

1 +exTβ.

(e) In the particular case wherexTβ=β1+β2x, this gives

π= e

β1+β2x

1 +eβ1+β2x

which is the logistic function.

(f) Sketch the graph of π against x in this case, taking β1 and β2 as con-stants. How would you interpret this graph if x is the dose of an insec-ticide andπ is the probability of an insect dying?

(13)

3.7 Is theextreme value (Gumbel)distribution, with probability density function

f(y;θ) = 1 φexp

(y−θ)

φ −exp

(y−θ) φ

(whereφ >0 regarded as a nuisance parameter) a member of the exponen-tial family?

3.8 SupposeY1, ..., YN are independent random variables each with the Pareto distribution and

E(Yi) = (β0+β1xi)2.

Is this a generalized linear model? Give reasons for your answer. 3.9 Let Y1, . . . , YN be independent random variables with

E(Yi) =µi=β0+ log (β1+β2xi) ; Yi∼N(µ, σ2)

for alli= 1, ..., N. Is this a generalized linear model? Give reasons for your answer.

3.10 For the Pareto distribution ﬁnd the score statistics U and the information