• Tidak ada hasil yang ditemukan

Study Methods: An Overview of Survival Analysis

A Parametric Survival Analysis

6.4 Study Methods: An Overview of Survival Analysis

Survival analysis is just another name for time to event analysis. The engineering sciences have contributed to the development of survival analysis, which is called

“reliability analysis” or “failure time analysis” in this field since the main focus is in modeling the time it takes for machines or electronic components to break down.

Likewise, survival analysis has long been a cornerstone of biomedical research.

The analysis of duration data comes fairly recently to the social science literature,

Table 6.2 Selected descriptive statistics

Variables Observations

Frequency (%)

Average Stay (days)

Median Stay (days)

Sd. Stay (days)

Average Age (years) (1) Socio-demographic Profiles

Portugal

(Mainland) 150 37.50 11 10 11 34

Sweden 95 23.75 9 7 3 57

Other Nordic

Countries 49 12.25 9 7 3 47

Germany 21 5.25 14 8 24 41

Other Countries 85 21.25 17 14 14 44

Male 203 50.75 11 8 11 44

Marital Status

(Married) 264 66.00 11.8 9.5 12 49

Azorean

Ascendancy 70 17.50 18.7 15 13.2 43

Education1.

Secondary 127 31.75 10.9 10 6.8 40

Education2.

Tertiary 183 45.75 10.6 7 10.3 45

Education3.

Technical 8 2.00 8.6 7 3 48

Education4.

Lesser 82 20.5 16.8 13 17 47

High Level

Profession 127 31.75 10.2 8 6.5 48

(2) Trip Attributes

Leisure 294 73.50 10.6 8 8.6 45

Visit Friends/

Relatives 57 14.25 17.4 15 14 44

Business 35 8.75 15.6 10 22 36

Other motive 14 3.50 8.5 6 6.2 37

Repeat visitor 141 35.25 16.5 14 16.7 41

Charter flight 150 37.50 11.2 7 6.9 52

Total 400 100.00 11 9 11 43

arguably with a more significant impact on applied economics literature. In fact, economists have recently applied the same body of techniques to strike duration, length of unemployment spells, time until business failure, and so on (for applica- tions of survival analysis in economics, see, among others, Greene 2000; Hosmer and Lemeshow 1999; Kieffer 1998; Lancaster 1990). This section borrows heavily from Cleves et al. (2002).

There are certain aspects of survival analysis data, such as censoring and non- normality that generate great difficulty when trying to analyze the data using traditional statistical models such as the linear regression model. The variable of interest in the analysis of duration is the length of time that elapses between the beginning of some event either until its end or until the measurement is taken, which

may precede termination. Hence, it is sometimes the case that durations – the so- called spells – are censored, in the sense that the researcher does not observe the termination of the event.

This framework of analysis naturally lends itself to the study of length of stay, as one is interested in the determinants of the length of time that elapses between the tourist’s arrival on a given tourist destination and his departure. The data set employed in the present article was collected at airports from tourists who were departing from their trips. Hence, there is no censoring in the data since all intervie- wees reported their length of stay. Therefore, the discussion that follows assumes away censoring.

Spell length is, by construction, a non-negative variable. Let spell length be rep- resented by a random variable T, with continuous probability distribution f(t), where t is a realization of T. It is usually the case that one is interested in the proba- bility that the spell is of length at least t, which is given by the survival function S(t) = Pr (Tt). The hazard rate,λ(t), in turn, answers the following question:

Given that the spell has lasted until time t, what is the probability that it will end in the next short interval of time,? More formally:

λ(t)=lim

>0

Pr (tTt+|Tt)

= f (t)

S(t) (6.1)

Intuitively, the hazard rate is the rate at which spells are completed after duration t, given that they last at least until t. Armed with the hazard rate, one computes the survival function through backward integration S(t)=exp (−t

0λ(s)ds). Hence, and as a matter of convenience, one usually focuses on estimating the hazard function directly.

Two frequently used models for adjusting survival functions for the effects of covariates are the accelerated failure-time (AFT) model and the multiplicative or proportional hazards (PH) model. In the AFT model, the natural logarithm of the survival time, ln t, is expressed as a linear function of the (1k) vector of time- invariant covariates x, yielding the linear model ln t = +z, whereβ is a (k1) vector of regression coefficients to be estimated, and z is the error term with density f(). The distributional form of the error term determines the regression model.

In the proportional hazards model, the concomitant covariates have a multiplica- tive effect on the hazard function:

λ(t,xi)=λ0(t) exp (xiβ) (6.2) whereλ0(t) is the baseline hazard function. Intuitively, the baseline hazard function λ0(t) summarizes the pattern of duration dependence and is common to all persons while λ = exp (xiβ) is a non-negative function of person specific covariates xi, which scales the baseline hazard function common to all persons, controlling, hence, the effect of individual heterogeneity.

The PH property implies that absolute differences in x imply proportionate dif- ferences in the hazard rate at each t. For some t = t, and for two persons i and j

identical in all matters except with respect to the kth covariate, then a unit increase in the kth covariate induces the following proportionate change in the hazard rates:

λ(t,xi)

λ(t,xj) =exp (βk) (6.3)

The above expression lends a natural interpretation toβk, namely, the log hazard ratioβk=logλ(t,x)/∂log xkwhich is easily recognized as either a semi-elasticity or elasticity.

The baseline functionλ0(t) may be left unspecified, yielding the Cox’s PH model, or it may take a specific parametric distributional form, which, and assuming that the correct distributional form is chosen, leads to more efficient estimates.

The choice of a particular distribution matters since it conditions the slope of the hazard function. A particular distribution yields a particular hazard function, which may feature duration dependence, in the sense that the probability that termination of a stay occurs in the next short interval of time may depend on length of stay.

Since there is scant or virtual none empirical evidence on the shape of the hazard function of lengths of stays, this paper takes an agnostic view and entertains the possibility of a myriad of shapes of the hazard function. Hence, the hazard function of stays is estimated under the following six alternative distributions – exponential, Weibull, Gompertz, the three most popular PH models; generalized gamma, log- normal and log-logistic, the most widely employed AFT models – which altogether accommodate, ex ante, several possible shapes of the hazard function. It should be noted that this paper’s approach – of letting the data speak – allows to formally test some models against others, and, hence, select a model which is formally deemed as more appropriate.

The exponential distribution yields a constant hazard rateλ0(t)=λand hence is suitable to model length of stay when the probability of termination of a stay in the next short interval of time does not depend on the length of the stay. The Weibull distribution, in turn, is a generalization of the exponential distribution and is suitable for modeling data with monotone hazard rates that either increase or decrease expo- nentially with time. The corresponding baseline function isλ0(t)=pλtp–1 where p is an ancillary parameter to be estimated from the data. Note that when p=1 the Weibull model collapses to the exponential model. The Gompertz distribution yields the baseline functionλ0(t)=exp(γt), whereγ is an ancillary parameter to be estimated from the data. Like the Weibull distribution, the Gompertz distribu- tion is suitable for modeling data with monotone hazard rates that either increase or decrease exponentially over time. Unlike the PH models – namely the expo- nential, Weibull and Gompertz models – the lognormal and log-logistic are two AFT models that tend to produce similar results and are indicated for data exhibit- ing nonmonotonic hazard rates, specifically initially increasing and then decreasing rates. Finally, the generalized gamma, another AFT model, yields a hazard function extremely flexible, allowing for a large number of possible shapes, including as spe- cial cases the Weibull, the exponential and the lognormal models. The generalized

gamma model is, therefore, commonly used for evaluating and selecting an appro- priate model for the data. Note that the lognormal model, the log-logistic model and the generalized gamma model are estimated in AFT form whilst the exponen- tial model, the Weibull model and the Gompertz model are estimated in the PH form, and, therefore, the resulting regression coefficientsβare not directly compa- rable (see Cleves et al. (2002) for more details on survival analysis models). The remainder of this section deals with model estimation and model selection.