Logistic Regression Method - Incorporating Covariate Dependence

6.3 Incorporating Covariate Dependence

6.3.1 Logistic Regression Method

In this section we consider the model for incorporating the covariates that was proposed by Balasubramanian and Lagakos (2010). Some of the standard notations and expressions are similar to those used by Balasubramanian and Lagakos (2010).

We note that estimation of HIV incidence involves a two stage algorithm such that in stage one, one takes a cross-sectional sample of size n from a population, and test each person using an initial testing algorithm usually based on a sensitive test (typically ELISA denoted by E). Individuals testing negative (E⁻) are assumed to be uninfected at that time. In the second stage, individuals testing positive on the sensitive test (E⁺) are tested again using a less-sensitive test (usually BED capture enzyme immunoassay often referred to as BED assay, B). Subjects that are found to be positive on a sensitive test but negative on the less sensitive test (E⁺B⁻) for HIV infection are considered to be recent HIV infections otherwise (E⁺B⁺) they are categorized as non-recent or long standing infections as also described in Janssen et al.

(1998).

Now suppose T ≥0 denotes the calendar time of HIV infection for someone born at time t₀. Furthermore, let f(t|t₀), F(t|t₀) and λ(t|t₀) denote the density function (incidence density), cumulative distribution function and hazard function for becoming infected at time t for someone born at timet₀ respectively. Let L₂ denote the sojourn or residence time in the recent infection state with the corresponding cumulative distribution denoted by G(·). We assume that L₂ has support in [0, L^∗₂], where L^∗₂ < t

and is independent of T. For simplicity and for purposes of conveying the idea, we assume the incidence density function, f(u) is constant overtime. That is

f(u) =f for u∈[t−L^∗₂, t].

Generally the hazard or incidence rate is given by λ(t|t₀) = f(t|t₀)

1−F(t|t0).

The prevalence probabilities for the 3-state model at timet are π₁(t) = 1−θ

π₂(t) = f µ π₃(t) = θ−f µ

where θ = F(t|t₀). We are interested in the model that will associate X, a vector of covariates, to HIV incidence rate. To investigate this association, one can use the proportional hazards model

λ(t|X) = λ(t₀)e^βX (6.3.1)

Note that when X = 0 (implying there is no covariate structure on incidence)

λ(t|X) =λ(t₀). (6.3.2)

Standard techniques can be used to estimate the unknown parameter vector β in Eq (6.3.1). However, due to the unavailability of the event times, Balasubramanian and Lagakos (2010) proposed an alternative approach where the regression coeﬃ- cients, β, in Eq (6.3.1) are estimated by ﬁtting the logistic regression model. The logistic regression model assumes that the logarithm of the odds of the outcome is a linear function of the predictors.

The odds of the outcome E⁺B⁻, given that the outcome is E⁺B⁻ or E⁻B⁻ and vector of risk factor X, is

P[E⁺B⁻|E⁺B⁻ orE⁻B⁻, X]

1−P[E⁺B⁻|E⁺B⁻ or E⁻B⁻, X] = π₂(t|t₀, X) 1−π₂(t|t₀, X)

= π₂(t|t₀, X) π₁(t|t₀, X)

= f e^βXµ 1−θ

= λ(t0)e^βXµ where λ(t₀) = ₁₋^f_θ

Taking logarithm of the odds, we get log

[π₂(t|t₀, X) π₁(t|t₀, X) ]

= log[

λ(t₀)e^βXµ]

= log [λ(t₀)µ] +βX

= α^∗+βX

whereα^∗ = logλ(t0)µ. Hence one can estimate the regression coefficients in Eq (6.3.1) by fitting a standard logistic regression model as proposed by Balasubramanian and Lagakos (2010). Furthermore, to fit a logistic regression model we must have a binary outcome. As proposed by Balasubramanian and Lagakos (2010) we can discard n₁ individuals who are “non-recent” and hence classified to be in state 3 and regard n_r subjects in state 2 who are classified as “recent” as successes andn₀subjects in state 1 as failures. This is motivated by the way in which we defined the odds function (that is, we used information from state 1 and state 2). In a way this is a conditional logistic regression model where the inclusion criteria in the model estimation set is either a sample is recent or not yet sero-converted or uninfected and non-recent samples are excluded.

6.3.2 Incorporating Sensitivity and Speciﬁcity in the logistic regression

Assuming ELISA has perfect sensitivity and specificity, BED assay misclassification can arise when either recent infections are categorized as non-recent or the non-recent being categorized as recent. Such misclassification will lead to biased estimates of the odds ratios and their standard errors when standard logistic regression is used to model the relationship between HIV incidence rate and its risk factors as noted by Magder and Hughes (1997).

In the current paper, we define sensitivity as the probability that the BED assay is positive (test is positive) given that the subject is in state 3 (in this case we regard being diseased as being a long standing infection) while specificity is defined as the probability that the BED assay is negative (test is negative) given that the subject is in state 1 or 2 (in this case we regard being disease free as being recent infection or uninfected). In addition, Gabaitiri and Mwambi (2012) have shown that if specificity is equal to one then sensitivity is similar to the proportion of assay progressors as in McWalter and Welte (2010) and Wang and Lagakos (2009). Thus the model proposed by Magder and Hughes (1997) for adjustment for sensitivity and specificity for logistic regression with uncertain outcomes can be used to estimate unbiased odds ratio and their standard errors for these settings.

Suppose Yi denotes the outcome of interest for the i^th subject such that

Y_i =

{ 1 if the i^th subject is truly diseased 0 if the i^th subject is truly nondiseased

Letω_i be the classiﬁcation indicator such that

ω_i =

{ 1 if the i^th subject is classiﬁed as having the outcome 0 otherwise

Suppose that ˆY_i is the probability that thei^th subject truly has the disease condition givenω_i and a vector of covariatesX_i. It follows that if the i^th subject is classiﬁed as having the outcome (ω_i = 1),

Yˆ_i = P(Y_i = 1|X_i, β)∗sensitivity

P(Y_i = 1|X_i, β)∗sensitivity + P(Y_i = 0|X_i, β)∗(1−speciﬁcity) (6.3.3) and if ω_i = 0,

Yˆ_i = P(Yi = 1|Xi, β)∗(1−sensitivity)

P(Y_i = 1|X_i, β)∗(1−sensitivity) + P(Y_i = 0|X_i, β)∗specificity (6.3.4) whereβ is ak×1 vector of regression coefficients to be estimated and once the linear predictor is specified then by inverting logit link function,

P(Y_i = 1|X_i, β) = exp(∑_n

i=1β_jX_ij) 1 + exp(∑n

i=1β_jX_ij) (6.3.5) The regression coefficients,β, are estimated using the EM algorithm when sensitivity and specificity of the outcome measurement is incorporated. In the E step, the process starts by first settings β to an arbitrary value and computing the probability of Yi

for each subject.

For the maximization step, the data are then duplicated and each observation included twice, one with the outcome variable set to one (for the diseased) and another with the outcome set to zero (for the non-diseased). A weighted logistic regression

model is fitted with weights equal to Y_i if the subject is diseased and (1− Y_i) if the subject is non-diseased. The new parameter estimates obtained from the fitted weighted logistic model are used to re-calculate new Y_i and the process is repeated until parameter estimates stop changing, that is, until convergence. The model was fitted using Statistical Analysis System (SAS) and the SAS macros were provided by Prof Laurence S. Magder, University of Maryland, United States of America. We modified them where necessary.

6.4 Data Analysis

6.4.1 Introduction

Botswana AIDS impact survey (BAIS) III of 2008 is the third sexual behavioral national population level survey. It included estimation of prevalence and incidence beyond traditional aims of assessing knowledge, attitude and behavior regarding HIV and AIDS. BAIS III was a cross sectional study. The objective is to analyse BAIS III data set in order to identify risk factors for HIV incidence in Botswana. However, due to a large proportion of missing data we only investigated two risk factors, namely sex and age because there was enough information on them.

In the analysis, we ﬁrst consider the incidence rate ratio as a comparative measure.

We also use the logistic regression model as proposed by Balasubramanian and La- gakos (2010) and extend the idea of Magder and Hughes (1997) to incorporate the uncertainty of the outcome of interest.

See (Gabaitiri et al., 2013) for a more detailed description of the data.

6.4.2 The Results

Table 6.1 shows the distribution of the sampled individuals into the 3-state disease model and the estimated incidence risk ratio (RR) and their corresponding 95% con- ﬁdence intervals.

Table 6.1: Summary of BAIS III data stratiﬁed by age and sex and the estimated incidence risk ratio, RR and the corresponding 95% conﬁdence intervals

Sample Estimates

Demographics n (n₀,n_r,n₁) RR (95% CI)

Sex Male 6516 (5603, 55, 858) 1

Female 7775 (6220, 94, 1461) 1.54 (0.99, 2.39) Age (y) ≤19 5703 (5491, 28, 184) 1

20-29 2967 (2401, 43, 523) 3.18 (1.82, 5.54) 30-39 2142 (1287, 33, 822) 3.47 (1.74, 6.93) 40+ 3479 (2644, 45, 790) 2.72 (1.53, 4.84)

Table 6.2 shows the results for unadjusted logistic regression estimates of association between HIV incidence and demographic variables (age and sex) and the 95%

conﬁdence intervals for estimated odds ratios.

Table 6.2: Unadjusted logistic regression estimates of association between HIV incidence and demographic variables (age and sex) and the estimated odds ratio (OR) and their corresponding 95% confidence intervals at different combinations of (sensitivity denoted by sens and specificity denoted by spec)

OR (95% CI)

Demographics (sen=0.99, spec=0.999) (sen=0.985, spec=1) (sen=1, spec=1)

Sex Male 1 1 1

Female 1.60 (1.11, 2.31) 1.54 (1.10, 2.15) 1.54 (1.10, 2.15)

Age (y) ≤19 1 1 1

20-29 4.13 (2.35, 7.24) 3.51 (2.18, 5.67) 3.51 (2.18, 5.67) 30-39 6.01 (3.35, 10.80) 5.03 (3.03, 8.35) 5.03 (3.03, 8.35) 40+ 3.91 (2.24, 6.83) 3.34 (2.08, 5.36) 3.34 (2.08, 5.36)

Table 6.3 presents the results for adjusted logistic regression estimates of association between HIV incidence and demographic variables (age and sex) and the 95%

conﬁdence intervals for estimated odds ratios.

Table 6.3: Adjusted logistic regression estimates of association between HIV incidence and demographic variables (age and sex) and the estimated odds ratio (OR) and their corresponding 95% conﬁdence intervals

OR (95% CI)

Demographics (sen=0.99, spec=0.999) (sen=0.99, spec=1) (sen=1, spec=1)

Sex Male 1 1 1

Female 1.57 (1.08, 2.28) 1.49 (1.06, 2.08) 1.49 (1.06, 2.08)

Age (y) ≤19 1 1 1

20-29 4.16 (2.36, 7.34) 3.49 (2.16, 5.63) 3.49 (2.16, 5.63) 30-39 6.09 (3.37, 11.00) 5.02 (3.02, 8.34) 5.02 (3.02, 8.34) 40+ 3.81 (2.17, 6.69) 3.22 (2.00, 5.18) 3.22 (2.00, 5.18)

Overall the results show that females compared to males have higher risk of HIV infection (RR=1.54 95% CI;0.99, 2.39) as we can see from Table 6.1. Also the risk of infection is higher for those aged 20 years and above compared to those aged ≤ 19 years. For example, when we compare those aged 30-39 years to those aged ≤ 19 years we have RR=3.18 (95% CI;1.82, 5.54). These finding are supported by the results from logistic regression model as we can see from Table 6.2. We also observe similar findings in the adjusted logistic regression model (Table 6.3). Furthermore, when we take into account the uncertainty of the outcome of interest, as determined by the diagnostic test, we see that there is an increase in the estimated odds ratios and their corresponding standard errors (as revealed by wider confidence intervals) for settings where we made adjustments for imperfect sensitivity and specificity simultaneously. But when we make adjustments for sensitivity only assuming perfect specificity, consistent with the model proposed by (McWalter and Welte, 2010; Wang

and Lagakos, 2009) there is almost no change in the parameter estimates estimates and their standard errors compared to perfect sensitivity and speciﬁcity.

6.5 Discussion

We looked at methods for incorporating the eﬀect of covariates. In particular, we focussed on the logistic regression method proposed by Balasubramanian and Lagakos (2010). We extended the method of Magder and Hughes (1997) to incorporate the uncertainty of determining the outcome of interest which in this case is HIV incidence.

We also considered the comparative measure, namely, the incidence rate ratio for settings where we are interested in comparing the incidence rate between two groups.

However this methods requires stratification by covariates which can be inefficient due to reduction in cell frequencies when there many covariates with many levels of stratification. In general, our analysis reveals that risk reduction programs and other intervention measures should be structured according to predictive risk factors, for example, according to sex and age.

We note that although data on other variables was available for the BAIS III data set, we only used sex and age as the risk factors in our analysis simple because other variables alone had more than 25% of the information missing thus making it impossible to include them in the analysis. The situation was worse when we try to investigate them in combination with others. Methods such as multiple imputation are usually recommended for settings where we have less than 20% of the data missing Little and Rubin (2002). Furthermore, missing data was not the focus of this research.

More research is needed on this area particularly when we incorporate information

on imperfect sensitivity and speciﬁcity of the diagnostic test used to measure the outcome of interest in the presence of missing data.

General Conclusion

Accurate estimates of HIV incidence are crucial for planning and assessing the impact of interventions. The use of biomarkers oﬀers a lot of promise and it circumvents some of the problems associated with estimating HIV incidence from longitudinal cohort studies. However, there are often methodological challenges on how reliable is the estimated incidence. Most of the existing and newly developed statistical methods have their advantages and disadvantages.

The major objective of this thesis is to develop statistical methods that can be used to estimate HIV incidence rate. However, most of the methods developed in this thesis can be used to estimate incidence of other diseases provided similar data as the one used here is available. The methods developed take advantage of the statistical framework developed by Balasubramanian and Lagakos (2010) to derive likelihood estimators for the incidence rate. We proposed a method to improve incidence estimation as well as precision of the estimated incidence rate by incorporating information on the immediate past prevalence. The advantage of this method is that it uses

107

the available data on prevalence, in particular the immediate past prevalence, to improve on incidence estimate and its precision. However, the main disadvantage of the method is that there is a need to account for the uncertainty in the past prevalence in addition to the uncertainties in mean window period and the proportion of assay progressors. We further proposed a method for estimating incidence by relaxing the assumption of constant incidence density. In particular we assumed a linear incidence density. The advantage of this method is that it can be used in settings where changes in HIV incidence are over a long period of time and the period between two successive surveys, as proposed by Gabaitiri et al. (2013), is large. But, the disadvantage of the method is that the assumed linear incidence density function may not be suitable for the data hence further research that includes investigating other forms of incidence density is needed. The problem though will be identifiability of parameters. Further research can explore this. We also proposed a method for adjusting the estimates of incidence when a proportion of subjects tested using an antibody sensitive test were not tested using a less sensitive test resulting in missing data. In particular, we considered how adjustments can be made in the log-likelihood when the missing data is assumed to be missing at random. This is a flexible approach to missing address the problem of missing data, however, the problem often arises when this assumption do not hold but rather a more complicated assumption of missing not random is the one that is appropriate. A method that simultaneously makes adjustment for sensitivity and specificity was also considered in this thesis. The advantage of this method, compared to the one described by (Wang and Lagakos, 2009; McWalter and Welte, 2010), is that we are able to simultaneously make adjustments for false recent rate (1-sensitivity) and false non-recent rate (1-specificity). But unlike the method of

(Wang and Lagakos, 2009; McWalter and Welte, 2010) where one have to account for uncertainties in the mean window and proportion of assay progressors, this method adds in an additional parameter thus bringing in additional uncertainty which will results in large estimates of the standard errors and hence wide conﬁdence intervals.

We also showed how sensitivity is similar to the proportion of subjects who eventually transit the “recent infection” state to the “non-recent” infection state as described by Wang and Lagakos (2009). We also described a method for incorporating risk factors for HIV incidence and extended the method of Magder and Hughes (1997) to incorporate the uncertainty of determining the outcome of interest which in this case is HIV incidence. Logistic regression, which assumes that the logarithm of the odds of the outcome is a linear function of the predictors, was used to estimate the eﬀect of various risk factors on HIV incidence. This is a standard method that is widely used in many areas including medical research because it is simple ﬁt using most statistical softwares and also the estimated parameters are easy to interpret as odds ratios. The logistic regression was extended using the method of Magder and Hughes (1997).

Another point worth noting is that Wang and Lagakos (2009) have shown that the adjusted and unadjusted estimators of incidence they proposed can be biased when the wrong underlying model is assumed. The estimators of incidence proposed in this thesis are not resistant to this bias especially that the accuracy of our estimators depends largely on the accuracy of the mean window period, sensitivity, speciﬁcity and the past prevalence which are always unknown. However, more research is needed to assess the robustness of the assumptions to model mis-speciﬁcation.

We however, as noted earlier in most of the chapters, that there are still challenges on the estimation of HIV incidence rate. One is how to get a reliable estimator of

mean window period, sensitivity and speciﬁcity. In addition, for the proposed estimators, research is necessary to investigate how we can account for the uncertainties of the estimates of the unknown parameters. We have highlighted in most chapters that the proposed methods do not take mortality into account and thus the state prevalence probabilities that we developed may be biased. This is because, as noted by Balasubramanian and Lagakos (2010), the risk of death is higher for subjects in the late stage of HIV. However, the introduction of anti-retroviral therapy (ART) has improved survival amongst individuals with late stage of HIV and thus lowering HIV related mortality. Further research is needed on this area particulary on the eﬀect of non-HIV (and HIV) related mortality on incidence estimation.

We used data from the Botswana AIDS Impact (BAIS) III of 2008 to illustrate the proposed methodology and only age and sex were used to illustrate the methods.

Other variables alone had more than 25% of the information missing thus making it impossible to include them in the analysis. As we highlighted, the situation was worse when we try to investigate them in combination with others. Therefore methods for handling missing data such as multiple imputation could not be used since usually it is recommended for settings where we have less than 20% of the data missing (Little and Rubin, 2002). The situation is further exacerbated by the fact that only 67% of the sampled individuals refused to provided blood specimen for HIV Testing. Although the focus of this thesis was not on missing data, more research is still needed on this area.

A simulation study to compare the unadjusted incidence estimators

with known and unknown previous prevalence under constant

incidence density function

### This R function called "simulationkno1" is a

## simulation study that evaluates the performance of the

## proposed estimators of incidence rate

## for the model when the previous prevalence is known.

111

## In this simulation, we compare undjusted

## estimators for of incidence rate

## for the standard model and the proposed model

## when past prevalence is known

## The input variables are "timet"=time between

## two prevalence studies,

## "muL"=mean window period,

## "ftau.hat"=the previous prevalence,

## We define "f.true"=true value of the

## incidence density function

## "nsim"=number of simulations

## "phi.one", "phi.two", "phi.three" are

## the prevalence probabilities

## "n"=overall sample size

## Then we generate "n00, n10, n11" about 1000, times.

## Estimate "theta.hat"="ftau.hat"+f2*timet

## Finally it estimates the incidence rate and its

## corresponding standard errors.

## and coverage probabilities

simulationkno1<-function(timet, muL, lam.true, fhat.tau, nsim){f.true<-lam.true*(1-fhat.tau)/(1+lam.true*timet);

theta.true<-fhat.tau+f.true*timet;

fhat.tau<-fhat.tau;

Dalam dokumen Likelihood based statistical methods for estimating HIV incidence rate. (Halaman 109-158)