Regression Model for Count Data - Distribution for Count data

3.6. Distribution for Count data

3.6.6. Regression Model for Count Data

In classical terms, linear regression is also known as a least square regression model. This is a statistical method that allows us to summarize and study the relationship between two continuous variables.

One variable, denoted as X, is regarded as a predictor, explanatory, or independent variable, and the other, denoted as Y, is regarded as the response, outcome, or dependent variable. A linear regression model with a single predictor variable is known as a simple linear regression model (Kutner et al., 2004, Seber and Lee, 2012, Montgomery et al., 2015, Williams, 1959).

The simple linear regression model is written as:

𝑌 = 𝑋𝛽 + 𝜖 (3.23)

where

𝑌 = ( 𝑦₁

⋮ 𝑦_𝑛

) , 𝑋 = (

1 𝑥₁₁⋯ 𝑥_1𝑘

⋮ ⋮ ⋮

1 𝑥_𝑛1 𝑥_𝑛𝑘

) , 𝛽 = ( 𝛽₁

⋮ 𝛽_𝑘

) , 𝜖 = ( 𝜖₁

⋮ 𝜖_𝑘

)

Y is the response variable, 𝛽₀ is the intercept, 𝛽 is the vector of the coefficient estimates of the random variables (an unknown pparameter that needs to be estimated) and 𝜀 is the error term (or residual) used to capture the deviation of the data from the model. The aim is to find the values for the parameters 𝛽_𝑎 {𝑎 = 1, 2, … . . 𝑘} which would provide a best fit for the data. The regression must follow the following assumptions: linear relationship, independence of errors with each other and the covariates, multivariate normality, no or little multicollinearity, no autocorrelation and homoscedasticity (errors must have zero mean and constant variance) (Kutner et al., 2004, Seber and Lee, 2012, Montgomery et al., 2015).

Applying the zero-mean assumption of the errors in equation 3.23, the expectation of the random matrix is defined as:

𝐸[𝑌] = [𝐸{𝑌_𝑖𝑗}] (3.24)

Where 𝑖 = (1, … . . 𝑛; 𝑗 = 1, … … 𝑝). Least squares regression describes the behaviour of the location of the conditional distribution, using the mean of the distribution to represent its central tendency.

The residual 𝜀_𝑖 are defined as the difference between the observed and the estimated values.

Minimizing the sum of the square residuals,

∑ 𝑟(𝑌 − 𝑋^𝑇𝛽̂) = ∑(𝑌 − 𝑋^𝑇𝛽̂)²

𝑛

𝑖=1 𝑛

𝑖=1

(3.25)

Where

𝑟(𝜇) = 𝜇

² is the quadratic loss function, which gives the least squares estimator 𝛽̂ by

𝛽̂ = (𝑋^𝑇𝑋)⁻¹𝑋^𝑇𝑌 𝑤ℎ𝑒𝑟𝑒 𝑋 ≠ 𝑋^𝑇 (3.36)

Also, the additional assumption that the errors (residual) 𝜀_𝑖 follow a Gaussian distribution

𝜀_𝑖~ N(0, 𝜎²𝐼_𝑛) (3.27)

Where 𝐼_𝑛 is the n x n identity matrix, it provides a framework for testing the significance of the coefficient found in equation 3.26. Under this assumption, the least square estimator is also the maximum likelihood estimator. By taking the expectations, with respect to

𝜀

_𝑖 in equations 3.25.and 3.26, as well as noting that the linear function of a normally distributed random variable is normally distributed itself, we can rewrite the model in 3.26 as:

𝑋 ~ 𝑁(𝜇, 𝜎²𝐼_𝑛), 𝑤ℎ𝑒𝑟𝑒 𝜇 = 𝑋^𝑇𝛽 (3.28)

Therefore, the model in 3.26 represents the relationship between the mean of 𝑦_𝑖. for 𝑖 = 1,2, … 𝑛, and the covariates linearly.

3.6.6.2. Generalized Linear Model

The family of generalized linear models (GLMs) provides a collection of models extending basic concepts from linear regression to applications where error terms follow any wide range of distributions, including binomial and Poisson for modeling count data (Waller and Gotway, 2004).

Thus, equation 3.28 refers to data that are normally distributed, but can be generalized to any distribution of the exponential family (Nelder and Baker, 1972, McCullagh and Nelder, 1989). GLMs consist of three components:

1. a probability distribution that belongs to the exponential family of distributions (known as a random component which defines the distribution of error terms)

2. a linear predictor 𝜌_𝑖= 𝛽₀+ 𝛽₁𝑥_𝑖1+ ⋯ + 𝛽_𝑘𝑥_𝑖𝑘 = 𝑋^𝑇𝛽̂ (also known as a systematic component defining the linear combination of explanatory variables) and

3. A link function which defines the relationship between the systematic and random components, given as: 𝐸[𝑌] = 𝜇_𝑖 = 𝜗⁻¹ (𝜌_𝑖).

GLM parameters generally require an iterative procedure rather than closed-form solutions for linear models (McCullagh and Nelder, 1989, Waller and Gotway, 2004). A GLM can be used for data that are not normally distributed and for cases where the relationship between the mean of the response variable and the covariates is not linear. The GLM includes many important distributions such as Gaussian, Poisson, Gamma, and Inverse-Gamma (Cameron and Trivedi, 2013, Kutner et al., 2004, Seber and Lee, 2012, Montgomery et al., 2015, Waller and Gotway, 2004).

3.6.2.1. Poisson regression

This is a special case of GLM commonly used to model count data. Poisson regression has been used for modelling count data in many fields such as public health (Arslan et al., 2013, Duncan et al., 2002, Xiang and Song, 2016) , Epidemiology (Best et al., 2000, Frome and Checkoway, 1985, Zou, 2004, Gartner et al., 2016, Hanewinckel et al., 2010, Sobngwi et al., 2001), insurance (Boucher and Denuit, 2006, Christiansen and Morris, 1997, Ismail and Jemain, 2007) and many other research areas. The canonical link function is logarithm.

The model is specified as:

Pr(𝑌 = 𝑦) = 𝑒^−𝜆 𝜆^𝑦

𝑦! (3.29)

For 𝜆 > 0, the mean and variance of a poison distribution are shown as

𝐸(𝑌) = 𝑉𝑎𝑟(𝑌) = 𝜆 (3.30)

The likelihood function is given as

𝐿(𝛽|𝑦, 𝑥) = ∏ Pr(𝑦_𝑖|𝑢_𝑖) = ∏exp(−𝑢_𝑖) 𝑢_𝑖^𝑦^𝑖 𝑦!

𝑁

𝑖=1 𝑁

(3.31)

The assumptions are:

Response Y has a poison distribution, 𝑌 ~𝑃𝑜𝑖𝑠(𝜆), (𝐸(𝑌) = 𝜆, 𝑎𝑛𝑑 𝑉𝑎𝑟( 𝑌) = 𝜆

With the assumption that the mean is equal to the variance, any factor that affects one will affect the other. This poses a problem when the data exhibits a different behavior. Thus, the usual assumption of homoscedasticity would not be appropriate for Poisson data (Preston, 2005). Statistically, an important motivation for the Poisson distribution, however, lies in the relationship between the mean and the variance. Most of the proposed approaches to this problem focus on over-dispersion (Ismail and Jemain, 2007, Berk and MacDonald, 2008).

3.6.2.2. Negative binomial

One of the ways to handle the situation posed by a Poisson regression is to fit a parametric model that is more dispersed than Poisson. A natural choice is the negative binomial (NB), given as:

𝑃(𝑌 = 𝑦|𝜇_𝑖, 𝑘) = Ґ (1 𝑘+ 𝑦_𝑖) Ґ (1

𝑘) 𝑦_𝑖!

( 𝑘𝜇_𝑖 1 + 𝑘𝜇_𝑖)

𝑦

( 1

1 + 𝑘𝜇_𝑖)

𝑘, (3.42)

log{𝜇_𝑖} = 𝑋^𝑇𝛽 (3.33)

Where the parameters 𝜇_𝑖 𝑎𝑛𝑑 𝑘 represent the mean and the dispersion of the negative binomial. The respective mean and variance of this model are:

𝐸[𝑌_𝑖] = exp{𝑋^𝑇𝛽} , 𝐸[𝑌_𝑖] = exp{𝑋^𝑇𝛽}, (3.34)

The variance of a negative binomial is a quadratic function of its mean. The negative binomial approaches the Poisson (𝜇_𝑖) model for 𝑘 → 0.

The negative binomial PDF can be described as the probability of observing y failures before kth success in a series of Bernoulli trials. Under such description, r is a positive integer (Hilbe, 2011).

However, there is no compelling mathematical reason to limit this parameter to integers.

Negative binomial is a generalization of Poisson regression. It loosens the highly restricted assumption that the variance is equal to the mean. this is based on the mixture of the poison -gamma mixture distribution. This model is popular because it models the Poison heterogeneity with a gamma distribution.

Given the negative binomial PDF with parameter ((𝜇, 𝑘):

𝑉[𝑌_𝑖] = exp{𝑋^𝑇𝛽} + 𝑘 𝑒𝑥𝑝{𝑋^𝑇𝛽}² (3.35)

𝑓(𝑦| 𝜇, 𝑘) = (𝑦_𝑖+ 𝑘 − 1

𝑘 − 1 ) 𝜇_𝑖^𝑘 (1 − 𝜇_𝑖)^𝑦^𝑖 (3.36)

𝑓 (𝑦| 𝜇, 𝑘) = (𝑦_𝑖+ 𝑘 − 1)

𝑦_𝑖! (𝑘 − 1)! 𝜇_𝑖^𝑘 (1 − 𝜇_𝑖)^𝑦^𝑖 (3.37)

Converting the NB PDF into exponential family form results in

Where

exp{𝑦_𝑖 𝑙𝑛(1 − 𝜇_𝑖)} is the link, and 𝑘 ln(𝜇_𝑖) + ln(^𝑦^𝑖_𝑘−1^+𝑘−1) is the cumulant.

Thus, the canonical link and cumulant can easily be extracted from a PDF when expressed in exponential family form. This gives:

Therefore, the first and second derivates, with respect to 𝜃, respectively yield the mean and variance functions, given as:

𝑉(𝜇) therefore equals 𝑟(1 − 𝑘) 𝑘⁄ ². Assume we now parameterize 𝑘 and 𝜇 in terms of 𝜋 and 𝛾 (𝑦| 𝜇, 𝑘) = exp {𝑦_𝑖 𝑙𝑛(1 − 𝜇_𝑖) + 𝑘 ln(𝜇_𝑖) + ln (𝑦_𝑖 + 𝑘 − 1

𝑘 − 1 )} (3.38)

𝜃_𝑖 = ln (1 − 𝜇_𝑖) → 𝜇_𝑖 = 1 − 𝑒𝑥𝑝(𝜃)_𝑖 𝑏(𝜃_𝑖)= −k ln 𝜇_𝑖 → −𝑘 (1 − exp(𝜃_𝑖)

= 𝛼_𝑖∅(𝑠𝑐𝑎𝑙𝑒) = 1

(3.59)

E[𝑌] = 𝑏^′ (𝜃_𝑖) = ^𝛿𝑏

𝛿𝑘𝑖 𝛿𝑘_𝑖

𝛿𝜃𝑖= − ^𝑟

𝑘𝑖(− (1 − 𝑘_𝑖)) = ^{𝑟(1−𝑘}^𝑖⁾

𝑘𝑖 = 𝜇_𝑖

V[Y]= 𝑏′^′(𝜃_𝑖) = ^𝛿²^𝑏

𝛿𝑘_𝑖²(^𝛿𝑘^𝑖

𝛿𝜃_𝑖)²+ ^𝛿𝑏

𝛿𝑘_𝑖 𝛿²𝑘𝑖

𝛿𝜃_𝑖² = ^𝜇

𝑘_𝑖² (1 − 𝑘_𝑖)²+^−𝜇

𝑘_𝑖 (−(1 + 𝑘_𝑖))

=𝑟 (1 − 𝑘_𝑖) 𝑘_𝑖²

(3.40)

(1 − 𝑘_𝑖⁄(𝑘_𝑖) = 𝛾𝜋_𝑖

Where 𝛾 = 1 𝑟⁄ given the defined values of 𝜇 𝑎𝑛𝑑 𝛾 . negative binomial PDF can then be re- parametrization such that

This is identical to equation 3.59, which we derived via Poisson-gamma mixture.

Dalam dokumen Investigating the spatial distribution of diabetes in Africa using both classical and Bayesian approaches. (Halaman 64-71)