4.4 Methods
4.4.1 Efficient Two-Phase Estimation
effects. We assume that the cumulative hazard function of the event timeT conditional on covariatesXand Ztakes the formR0texp
β(s)X+γTZ dΛ(s), whereβ(·)is the time-varying regression coefficient ofX,γ is the time-fixed regression coefficient ofZ, andΛ(·)is an unspecified positive increasing function. In the presence of right censoring, we observeY and∆instead ofT, whereY =min(T,C),∆=I(T≤C),Cis the censoring time onT, andI(·)is the indicator function.
LetP(·|·)denote a conditional density function. If(Y,∆,X,Z)is observed for allnsubjects in the study, then the inference onβ(·),γ, andΛ(·)is typically based on the likelihood∏ni=1P(Yi,∆i|Xi,Zi).
Under the two-phase design, however, only(Y,∆,Z)is measured on allnsubjects in Phase I, andXis measured for a sub-sample of sizen2in Phase II. LetRbe the selection indicator for the measurement ofX in Phase II. We assume that the distribution of(R1, . . . ,Rn)depends on(Yi,∆i,Xi,Zi) (i=1, . . . ,n)only through the Phase I data(Yi,∆i,Zi) (i=1, . . . ,n). This assumption implies that the data onXare missing at random, such that the joint distribution of(R1, . . . ,Rn)conditional on(Y1,∆1,Z1, . . . ,Yn,∆n,Zn)can be disregarded in the likelihood inference ofβ(·),γ, andΛ(·). Following Zeng and Lin (2014), we further assume that the censoring timeCis independent ofT given(X,Z)among subjects withR=1 and independent ofT andXgivenZamong subjects withR=0. In this situation, the observed-data likelihood takes the form
n i=1
∑
Ri{logP(Yi,∆i|Xi,Zi) +logP(Xi|Zi)}+
n i=1
∑
(1−Ri)log Z
P(Yi,∆i|x,Zi)P(x|Zi)dx
, (4.1)
where
P(Yi,∆i|Xi,Zi)∝
Λ′(Yi)exp{β(Yi)Xi+γTZi}∆ exp
− ZYi
0
exp
β(t)Xi+γTZi dΛ(t)
,
wheref′(x) =d f(x)/dx. Our main interest lies in the inference ofβ(·).
We use non-parametric maximum likelihood estimation and sieve approximation to maximize expression (4.1). Firstly, we approximateβ(·)on[0,τ]using B-splines, whereτis the study duration; that
is, we assume
β(t) =
dn
l=1
∑
αlAwl(t),
whereAwl(·)is thelth B-spline basis function of orderwon[0,τ],dnis the total number of functions in this B-spline basis, andαjis the coefficient forAwl(·) (l=1, . . . ,dn)in the B-spline approximation ofβ(·).
Secondly, we estimateΛ(·)by a step function with jumps only at the observedYiwith∆i=1(i=1, . . . ,n).
Letλibe the jump size ofΛ(·)atYi. We haveλi>0 and=0 when∆i=1 and 0, respectively. Finally, ifZ is discrete, then for each distinct observedZ=Z, we estimateP(X|Z)by a discrete probability function on the distinct observed values ofX, denoted byx1, . . . ,xm(m≤n2), wheremis the total number of distinct values ofX (i.e.,mincreases withn2). IfZcontains continuous components, then this nonparametric estimation procedure becomes infeasible because only a small number of observations onXare associated with eachZ. In this situation, we approximateP(x|Zi)and logP(x|Zi)in expression (4.1) by
P(x|Zi) =
m
∑
k=1
I(x=xk)
sn
∑
j=1
Bqj(Zi)pk j,
and
m
∑
k=1
I(x=xk)
sn
∑
j=1Bqj(Zi)logpk j,
respectively, whereBqj(·)is the jth B-spline basis function of orderq,snis the total number of functions in the B-spline basis, andpk jis the coefficient ofBqj(Zi)atxk(k=1, . . . ,m; j=1, . . . ,sn) in the B-spline approximation ofP(x|Zi). Details about the construction of the B-spline bases{Awl (·)}dl=1n and{Bqj(·)}sj=1n and guidelines about the choices of(w,dn)and(q,sn)can be found in Chen et al. (2013) and Tao et al.
(2017), respectively. In practice,wandqare typically chosen to be less than or equal to four, which corresponds to cubic splines, anddnandsnare determined by the Phase I sample sizen. We note thatpk j
needs to satisfy the constraints
m k=1
∑
pk j=1 andpk j≥0(k=1, . . . ,m; j=1, . . . ,sn) (4.2)
becauseP(x|Zi)is a conditional probability function. Consequently, the observed-data log-likelihood can be rewritten as
ln(θ,{λi},{pk j}) =
n
∑
i=1
Ri (
logP(Yi,∆i|Xi,Zi) +
m
∑
k=1 sn
∑
j=1
I(Xi=xk)Bqj(Zi)logpk j )
+
n
∑
i=1
(1−Ri)log ( m
∑
k=1
P(Yi,∆i|xk,Zi)
sn
∑
j=1
Bqj(Zi)pk j )
, (4.3)
where
P(Yi,∆i|x,Zi)∝ (
λiexp
dn
∑
l=1
αlAwl(Yi)x+γTZi
!)∆
×exp (
−
n
∑
i′=1
I(Yi′≤Yi)λi′exp
dn
l=1
∑
αlAwl(Yi′)x+γTZi
!) ,
We aim to maximize expression (4.3) under the constraints (4.2).
It is difficult to maximize expression (4.3) directly because of the intractable form of the second term.
Following Tao et al. (2017), we solve this maximization problem by artificially creating a latent variableU for subjects withR=0 such thatUtakes values on 1/sn,2/sn, . . . ,1 and satisfies the equations
P(U= j/sn|Z) =Bqj(Z),
P(X=xk|Z,U= j/sn) =P(X=xk|U=j/sn) =pk j, P(Y,∆|X,Z,U) =P(Y,∆|X,Z).
This step is essential because it enables us to interpret∑sj=1n Bqj(Z)pk jasP(X=xk|Z)for subjects with R=0. Hence, the second term in expression (4.3) is equivalent to the log-likelihood of(Yi,∆i,Zi), assuming that the complete data consist of(Yi,∆i,Xi,Zi,Ui)but withXiandUimissing.
The maximization of expression (4.3) is carried out through an EM-algorithm, where(X,U)for subjects withR=0 are treated as missing data. The complete-data log-likelihood is
n i=1
∑
Ri (
logP(Yi,∆i|Xi,Zi) +
m k=1
∑
sn
∑
j=1I(Xi=xk)Bqj(Zi)logpk j )
+
n i=1
∑
(1−Ri){logP(Yi,∆i|Xi,Zi) +logP(Xi|Ui) +logp(Ui|Zi)}
=
n
∑
i=1
Ri (
logP(Yi,∆i|Xi,Zi) +
m
∑
k=1 sn
∑
j=1
I(Xi=xk)Bqj(Zi)logpk j )
+
n
∑
i=1
(1−Ri) ( m
∑
k=1
I(Xi=xk)logP(Yi,∆i|xk,Zi)
+
m
∑
k=1 sn
∑
j=1
I(Xi=xk,Ui=j/sn)logpk j+
sn
∑
j=1
I(Ui=j/sn)logBqj(Zi) )
.
Letθ= (α1, . . . ,αdn,γT)T. We start with the following initial values:bθ
(0)=0,bλi(0)=∆i/ ∑ni′=1∆i , and bp(0)k j =n
∑ni=1RiI(Xi=xk)Bqj(Zi)o
/n
∑ni=1RiBqj(Zi)o .
In the E-step of the(t+1)th iteration, we calculate the conditional expectations of I(Xi=xk,Ui=j/sn)andI(Xi=xk)given(Yi,∆i,Zi),x1, . . . ,xm, evaluated atbθ(t),bλ1(t), . . . ,bλn(t), bp(t)11, . . . ,pb(t)msn, denoted asψbk ji(t+1)andqb(t+1)ik , respectively. That is,
ψbk ji(t+1)=
I(Xi=xk)Bqj(Zi), Ri=1,
P(Yi,∆i|xk,Zi)Bqj(Zi)pb(t)k j
∑mk′=1P(Yi,∆i|xk′,Zi)∑snj′=1Bq
j′(Zi)bp(t)
k′j′
,Ri=0,
qb(t+1)ik =
I(Xi=xk), Ri=1,
P(Yi,∆i|xk,Zi)∑sn
j′=1Bq
j′(Zi)bp(t)
k j′
∑mk′=1P(Yi,∆i|xk′,Zi)∑snj′=1Bq
j′(Zi)bp(t)
k′j′
,Ri=0.
In the M-step of the(t+1)th iteration, we updateθb
(t+1)
andbλi(t+1)(i=1, . . . ,n) by maximizing
n
∑
i=1 m
∑
k=1
qb(t+1)ik logP(Yi,∆i|xk,Zi), (4.4) which is a weighted likelihood for the Cox model with time-varying effects. Let
θik(t) = (Aw1(t)xk, . . . ,Awdn(t)xk,ZTi)T,
and
Gik(t) =exp
dn
l=1
∑
αlAwl(t)xk+γTZi
! .
We updatebθ
(t+1)
by solving
S(θ) =
n i=1
∑
m k=1
∑
qb(t+1)ik ∆i
(
θik(Yi)−∑ni′=1∑mk=1I(Yi′≥Yi)bq(t+1)i′k Gi′k(Yi)θi′k(Yi)
∑ni′=1∑mk=1I(Yi′≥Yi)bq(t+1)i′k Gi′k(Yi) )
using the one-step Newton–Raphson algorithm. That is, we updateθb
(t+1)
by
bθ
(t+1)
=θb
(t)+
Ω
bθ
(t)−1
S
θb
(t) ,
where
Ω(θ) =
n
∑
i=1 m
∑
k=1
qb(t+1)ik ∆i
"
∑ni′=1∑mk=1I(Yi′ ≥Yi)qb(t+1)i′k Gi′k(Yi)θi′k(Yi)⊗2
∑ni′=1∑mk=1I(Yi′≥Yi)qb(t+1)i′k Gi′k(Yi)
− n
∑ni′=1∑mk=1I(Yi′ ≥Yi)qb(t+1)i′k Gi′k(Yi)θi′k(Yi)o⊗2
n
∑ni′=1∑mk=1I(Yi′≥Yi)qb(t+1)i′k Gi′k(Yi)o2
#
anda⊗2=aaT. We updatebλi(t+1)by using the Breslow estimator such that
bλi(t+1)= ∆i
∑ni′=1∑mk=1I(Yi′≥Yi)bq(t+1)i′k Gi′k(Yi) .
We observe thatbλi(t+1)>0 and=0 when∆i=1 and 0, respectively. We updatebp(t+1)k j (k=1, . . . ,m;
j=1, . . . ,sn) by maximizing
n i=1
∑
m k=1
∑
sn
∑
j=1n
Riqb(t+1)ik Bqj(Zi) + (1−Ri)ψbk ji(t+1) o
logpk j,
such that
pb(t+1)k j = ∑ni=1
n
Riqb(t+1)ik Bqj(Zi) + (1−Ri)ψbk ji(t+1) o
∑mk′=1∑ni=1
n
Ribq(t+1)ik′ Bqj(Zi) + (1−Ri)ψbk(t+1)′ji
o.
We observe thatpb(t+1)k j satisfies the two constraints in expression (4.2).
We iterate between the E-step and M-step until
θb
(t+1)
−bθ
(t) 1
+
n
∑
i=1
bλ(t+1)−bλ(t) +
m
∑
k=1 sn
∑
j=1
bp(t+1)k j −bp(t)k j <10−4
to obtain the sieve maximum likelihood estimator (SMLE)θb,bλi(i=1, . . . ,n), andpbk j(k=1, . . . ,m;
j=1, . . . ,sn).
To obtain the variance estimate ofbθ, we use the profile likelihood method proposed by Murphy and van der Vaart (2000). By verifying the smoothness conditions of Theorem 1 in Murphy and van der Vaart (2000), it can be shown that the negative inverse of the Hessian matrix of the profile likelihood function pl(θ) =max{{λi},{pk j}}ln(θ,{λi},{pk j})is a consistent estimator for the limiting covariance matrix of n1/2(bθ−θ). In practice, we obtain the value ofpl(θ)by holdingθfixed in the EM algorithm and obtaining the value ofln(θ,{λi},{pk j})at convergence. We estimate the covariance matrix ofθbby the negative inverse of the matrix whose(k,l)th element is
h−2n n
pl(bθ+ekhn+elhn)−pl(bθ+ekhn)−pl(bθ+elhn) +pl(bθ)o ,
whereekis thekth canonical vector, andhnis a constant of the ordern−1/2.