KAPLAN–MEIER SURVIVAL CURVE - Biostatistical Methods in Epidemiology

C H A P T E R 9

Kaplan–Meier and Actuarial Methods for Censored Survival Data

In this chapter we describe the Kaplan–Meier and actuarial methods of estimating a survival function from censored survival data. An important feature of these methods is that, aside from uninformative censoring, they are based on relatively few assumptions. In the case of the Kaplan–Meier method, nothing is assumed about the functional form of either the survival curve or the hazard function. An approach to comparing survival curves is presented which is based on the stratified odds ratio techniques of Chapter 5. The MH–RBG methods are shown to be especially useful in this regard. References for this chapter are those given at the beginning of Chapter 8.

those who are censored atτj, and those who die atτj. Defining the first two groups of subjects to be at risk atτjis reasonable since, if death occurs, it will happen some time afterτj. However, including subjects who die atτjin the risk set is not intuitive.

This convention has its origins in the theoretical development of the Kaplan–Meier method, which can be viewed as a limiting case of the actuarial method discussed below. Indeed, another name for the Kaplan–Meier approach to censored survival data is the product-limit method. Loosely speaking, the jth risk set consists of all subjects who are alive “just prior to”τj. Let r_j denote the number of subjects in the jth risk set(j =0,1, . . . ,J), and denote byr_J₊₁the number of subjects who survive toτJ+1. Note thatc_J includes ther_J₊₁subjects who survive to the end of the study. We need to separate this group of individuals from those who are censored for other reasons. Definec_j as follows:c_j =c_j for j < J, andc_J = c_J −r_J₊1. Since subjects exit the cohort only by death or censoring, it follows that

r_j₊₁=r_j −a_j−c_j (9.1)

(j=0,1, . . . ,J). Therefore, r0−r_J₊1=

J j=0

(r_j−r_j₊1)= J

j=0

a_j+ J

j=0

c_j =a_•+c_•

and sor_J₊₁=r₀−a_•−c_•. This identity says that the number of subjects surviving to the end of the study equals the number who started minus those who died or were censored prior toτJ+1.

We now derive an estimate ofS(t)based on the above partition and certain conditional probabilities. For brevity we denoteS(τj)byS_j so that, in particular,S₀=1.

For j >0, consider the interval[τj−ε, τj +ε), whereεis a positive number that is small enough to ensure that the interval does not contain any death times other thanτj or any censoring times. For j =0, we use the interval[τ0, τ0+ε)and make some obvious modifications to the following arguments. Let p_j denote the conditional probability of surviving toτj+ε, given survival toτj−ε (j =0,1, . . . ,J).

That is, pj is the conditional probability of surviving to just afterτj, given survival to just prior toτj. Letqj =1−pj be the corresponding conditional probability of dying. Since there arerj subjects at risk atτj−ε, andaj of them die prior toτj+ε, the binomial estimates are

ˆ q_j =a_j

r_j (9.2)

and

p_j =1− ˆq_j = rj−aj

r_j (9.3)

(j =0,1, . . . ,J). These estimates are valid regardless of how small we takeε. In the limit, asεgoes to 0,[τj −ε, τj +ε)shrinks to the single pointτj, in keeping with the product-limit approach. Sincea₀=0, it follows thatqˆ₀=0 andpˆ₀=1.

The probability of surviving fromτ0toτ1isp1. For those who survive toτ1, the conditional probability of surviving toτ2is p2, and so the probability of surviving fromτ0toτ2isS2 = p1p2. Likewise, the probability of surviving fromτ0toτ3is S₃=p₁p₂p₃. Proceeding in this way we obtain the probabilityS_j =p₁p₂· · ·p_j of surviving fromτ0toτj (j =1,2, . . . ,J). The Kaplan–Meier estimate ofS_j is

Sˆ_j = ˆp₁pˆ₂· · · ˆp_j (9.4) (j =1,2, . . . ,J). We defineSˆ0 =1 andSˆ_J₊1 = ˆS_J. Since there are no deaths in the jth interval other than atτj,Sˆ(t)equalsSˆj for allt in the interval, and so the graph ofSˆ(t)is a step function.

When the Kaplan–Meier estimate of the survival function is graphed as a function of time, it will be referred to as the Kaplan–Meier survival curve. A schematic rep- resentation of a Kaplan–Meier survival curve is shown in Figure 9.1. Note that each of the line segments making up the steps includes the left endpoint (indicated by a solid dot) but not the right endpoint (indicated by a circle), except for the final line segment that includes both endpoints. This is consistent with the way intervals were defined. Most software packages join the steps with vertical lines to enhance visual appearance. IfτJ+1is a death time, the final line segment shrinks to a single point.

An estimate of var(Sˆj)is given by “Greenwood’s formula,”

var(Sˆ_j)=(Sˆ_j)² j i=1

ˆ q_i ˆ

p_ir_i (9.5)

and a(1−α)×100% confidence interval forSˆj is [S_j,S_j] = ˆS_j±z_α/₂

var(Sˆ_j)

FIGURE 9.1 Schematic Kaplan–Meier survival curve

(j = 1,2, . . . ,J) (Greenwood, 1926). The normal approximation can be im- proved by using the log–minus-log transformation log(−logSˆj). This leads to the Kalbfleisch–Prentice estimate,

var[log(−logSˆj)] = 1 (logSˆj)²

j i=1

ˆ q_i ˆ p_ir_i

(j=1,2, . . . ,J)(Kalbfleisch and Prentice, 1980, p. 15). A(1−α)×100% confidence interval forSˆ_j is obtained from

[log(−logS_j),log(−logS_j)] =log(−logSˆ_j)∓z_α/₂

var[log(−logSˆ_j)] (9.6) by inverting the log–minus-log transformation, as illustrated in Example 9.1. Note the∓sign in (9.6) rather than the usual±sign.

It is of interest to examine the estimatesSˆ_jandvar( Sˆ_j)under the assumption that there is no censoring except atτJ+1—that is, assumingc_j = 0(j =0,1, . . . ,J).

With this assumption, (9.1) simplifies torj+1 =rj −aj (j =0,1, . . . ,J). From this identity, as well as (9.3)–(9.5),pˆ0=1, andqˆ0=0, it follows that

Sˆj = ˆp1pˆ2· · · ˆpj = ˆp0pˆ1· · · ˆpj

r0−a0

r1−a1

· · ·

rj −aj

r_j

r₀−a₀ r₀

r₁−a₁ r₀−a₀

· · ·

rj+1

r_j₋₁−a_j₋₁

=r_j₊1

r₀ (9.7)

and

var(Sˆj)=(Sˆ_j)² j i=1

ˆ q_i ˆ

p_ir_i =(Sˆ_j)² j

i=0

ˆ q_i ˆ p_ir_i

=(Sˆ_j)² j i=0

1 r_i₊1− 1

r_i

=(Sˆ_j)² 1

r_j₊1 − 1 r0

= Sˆ_j(1− ˆS_j)

r₀ . (9.8)

Identities (9.7) and (9.8) show that when there is no censoring except atτJ+1, the Kaplan–Meier and Greenwood estimates simplify to estimates based on the binomial distribution. The numerator ofSˆ_j in (9.7) isr_j₊₁ because, in the absence of censoring, the risk set atτj+1is the same as the group of survivors toτj.

TABLE 9.1 Survival Times: Stage–Receptor Level–Breast Cancer Receptor

Stage level r₀ Survival times

I Low 12 50^∗ 51 51^∗ 53^∗(2) 54^∗(2) 55^∗ 56 56^∗ 57^∗ 60^∗

I High 57 10 34 34^∗ 47(2) 49^∗(2) 50^∗(7) 51^∗(6) 52^∗(5) 53^∗(6) 54^∗(5) 55^∗(2) 56^∗(2) 57^∗(5) 58^∗(5) 59^∗(4) 60^∗(3)

II Low 23 4^∗ 9 13 21 29(2) 40 46 49^∗(2) 52^∗(2) 53^∗ 54^∗ 55^∗(2) 56^∗ 57 57^∗ 58^∗(2) 59^∗ 60^∗

II High 75 11 16 21 23(2) 24 33(2) 36(2) 36^∗ 37 45 46 49^∗(2) 50^∗(6) 51^∗(4) 52^∗(5) 53^∗(5) 54^∗(4) 55^∗(4) 56^∗(6) 57^∗(4) 58(2) 58^∗(8) 59^∗(5) 60^∗(6)

III Low 15 9 12 14 15 15^∗ 17 21 22 23(2) 31 34 35 53^∗ 60^∗ III High 17 7^∗ 9 17 21^∗ 22(2) 34(2) 41 49^∗ 52^∗ 55 56^∗ 58^∗(2) 59^∗(2)

Example 9.1 (Receptor Level–Stage–Breast Cancer) The data in Table 9.1 are based on the cohort of 199 breast cancer patients described in Example 4.2. Recall that these individuals are a random sample of patients registered on the Northern Al- berta Breast Cancer Registry during 1985. For the present example this cohort was followed to the end of 1989. Therefore the maximum observation times range between 4 and 5 years depending on the date of registration. Survival time was defined to be the length of time from registration until death from breast cancer or censoring. Survival times were first calculated to the nearest day and then rounded up to the nearest month to ensure that at least a few subjects had the same survival time.

So the maximum survival time is 60 months. As discussed in Example 4.2, Registry patients receive regular checkups and their vital status is monitored on an ongoing basis. It is therefore reasonable to assume that members of the cohort were alive at the end of 1989 in the absence of information to the contrary. Therefore the reasons for censoring in this cohort are death from a cause other than breast cancer and exiting the study alive at the end of 1989.

For present purposes, we interpret the survival times as continuous, so that, for example,t=50 is to be read ast =50.0. The asterisks in Table 9.1 denote censored survival times and so 50^∗means that the subject was censored att = 50, while 51 indicates that the subject died (of breast cancer) at t = 51. Strictly speaking, we should refer to the entries in Table 9.1 as observations since, for example, 50^∗and 51 are actually shorthand for (50, 0) and (51, 1). In Table 9.1, numbers in parentheses denote the multiplicity of survival times, so that 53^∗(2)represents two survival times of 53^∗. Note that when death and censoring take place at the same time, the censoring time has been recorded to the right of the death time. This convention is adopted since, when there is a tie, the (unobserved) death time for the censored individual (when it occurs), will be larger than the (observed) censoring time.

TABLE 9.2 Kaplan–Meier Estimates and Kalbfleisch–Prentice 95% Confidence Intervals:

Breast Cancer

j τj a_j r_j pˆ_j Sˆ_j S_j S_j

0 0 0 199 1.0 1.0 — —

1 9 3 197 .985 .985 .954 .995

2 10 1 194 .995 .980 .947 .992

3 11 1 193 .995 .975 .940 .989

4 12 1 192 .995 .970 .933 .986

5 13 1 191 .995 .964 .927 .983

6 14 1 190 .995 .959 .920 .979

7 15 1 189 .995 .954 .914 .976

8 16 1 187 .995 .949 .908 .972

9 17 2 186 .989 .939 .895 .965

10 21 3 184 .984 .924 .877 .953

11 22 3 180 .983 .908 .858 .941

12 23 4 177 .977 .888 .835 .925

13 24 1 173 .994 .883 .829 .920

14 29 2 172 .988 .872 .817 .912

15 31 1 170 .994 .867 .811 .908

16 33 2 169 .988 .857 .800 .899

17 34 4 167 .976 .836 .777 .881

18 35 1 162 .994 .831 .771 .877

19 36 2 161 .988 .821 .760 .868

20 37 1 158 .994 .816 .754 .863

21 40 1 157 .994 .811 .748 .859

22 41 1 156 .994 .805 .743 .854

23 45 1 155 .994 .800 .737 .850

24 46 2 154 .987 .790 .726 .841

25 47 2 152 .987 .779 .714 .831

26 51 1 129 .992 .773 .708 .826

27 55 1 77 .987 .763 .695 .818

28 56 1 67 .985 .752 .680 .810

29 57 1 55 .982 .738 .662 .800

30 58 2 43 .953 .704 .615 .776

Table 9.2 gives the Kaplan–Meier estimates of the survival probabilities as well as the Kalbfleisch–Prentice 95% confidence intervals. The first step in creating Table 9.2 was to list the death times in increasing order and then count the number of deaths at each death time. The first death time isτ1 =9 and the number of deaths isa1 = 3. In total, there were 30 distinct death times and 49 deaths in the cohort.

The next step was to determine the number of subjects in each risk set using (9.1).

For this purpose, the survival times were listed in increasing order and (9.1) was applied in a recursive fashion, starting with j = 0. To illustrate, for j = 0 the interval is[0,9.0),r0 = 199,a0 =0, andc₀ = 2. Sor1 =199−0−2 = 197.

Observe that the two subjects with censoring times prior to the first death time make

no contribution to the Kaplan–Meier calculations, and so the effective size of the cohort isr1=197.

To calculate the 95% confidence interval for Sˆj using the Kalbfleisch–Prentice method it is necessary to invert the log–minus-log transformation. This is illustrated forS₂. FromSˆ₂=.980 and

2 i=1

ˆ qi

piri = .015

.985(197)+ .005

.995(194)=(.010)²

we havevar[log(− logSˆ₂)] =(.010)²/[log(.980)]² =(.495)²and log(−logS₂)= log[−log(.980)]−1.96(.495)= −4.87. To invert the log–minus-log transformation, the exp–minus-exp transformation is applied to obtainS₂=exp[−exp(−4.87)] = .992. Based on Greenwood’s formula, the upper 95% confidence bound for Sˆ1 is S1 =1.002. The Kalbfleisch–Prentice approach has the attractive property that the upper and lower bounds are always between 0 and 1.

Figure 9.2 shows the Kaplan–Meier survival curve and Kalbfleisch–Prentice 95%

confidence intervals for the breast cancer cohort. Strictly speaking, the endpoints of the confidence intervals should be plotted only for the death times rather than joined as they have been here. An appropriate alternative is to estimate what is referred to as a confidence band (Marubini and Valsecchi, 1995, §3.4.2; Hosmer and Lemeshow, 1999, §2.3). A confidence band places simultaneous upper and lower bounds on the entire survival curve, but there is the drawback that the computations are somewhat involved. Confidence bands tend to be wider than joined confidence intervals.

There were seven deaths in the cohort from causes other than breast cancer, and these were treated as censored observations. The question as to whether these deaths might somehow be related to breast cancer needs to be decided on substan- tive grounds. This decision has implications for whether censoring is deemed to be

FIGURE 9.2 Kaplan–Meier survival curve and Kalbfleisch–Prentice 95% confidence intervals: Breast cancer cohort

informative or uninformative. Even if we assume that censoring is uninformative, it would be incorrect to interpretSˆj as an estimate of the probability of not dying of breast cancer beforeτj. In light of remarks made at the end of Section 8.1, the correct interpretation adds the caveat that breast cancer is assumed to be the only cause of death. This is a convenient fiction when cohorts are being compared, but in reality, competing risks are virtually always present. As a result, when evaluating the findings of a survival analysis it is crucial to consider the possible effects of informative censoring and competing risks.

Dalam dokumen Biostatistical Methods in Epidemiology (Halaman 176-183)