Figure 3-4.eps
Outcome
TAKING ADVANTAGE OF NEW TOOLS AND TECHNIQUES 9
Figure 3-5.eps
Confounding
Unmeasured Confounders Measured
Confounders
Design
• Restriction
• Matching
Analysis
• Standardization
• Stratification
• Regression
Unmeasured, but measurable
in sub-study
• 2-stage sample
• Ext. adjustment
• Imputation
Unmeasurable
Design Analysis
• Cross-over
• Active comparator (restriction)
• Instrumental variable
• Proxy analysis
• Sensitivity analysis Propensity scores
• Marginal structural models
FIGURE 3-5 Dealing with unmeasured confounding factors in claims data analyses.
SOURCE: Schneeweiss, S. 2006. Sensitivity analysis and external adjustment for unmeasured confounders in epidemiologic database studies of therapeutics. Phar-macoepidemiology and Drug Safety 15:291-303. Reprinted with permission from Wiley-Blackwell, Copyright © 2006.
ing treatment and outcome, researchers model the instrument—which is unrelated to patient characteristics and therefore unconfounded—and then rescale the estimate for the correlation between the instrumental variable and the actual treatment.
One of the key assumptions is that the instrumental variable is not associated with either the measured or unmeasured confounders and is not related to the outcome directly other than through the actual treat-ment. This is necessary for instrumental variables to produce valid results.
Consequently, in working with such instruments, researchers have to identify a sort of quasi-random treatment assignment in the real world.
For the sake of this paper, two are readily identifiable:
Interruption in Medical Practice
This quasi-random treatment assignment can be caused by sudden and massive interruptions of treatment patterns, for example by regulatory changes. An example might be the FDA aprotinin advisory that reduced the medication’s use by 50 percent—a massive shift. For the same patient
candi-dates for aprotinin, a cardiac surgeon would likely choose a different course of treatment before and after the advisory. A similar example is found in the evolution of the coronary stents; a patient coming for a percutaneous procedure on one day might be treated with a bare metal stent but a year later, after the rapid adoption of drug-eluting stents that same patient might be given a drug-coated stent.
Strong Treatment Preference
Several papers have contributed to our understanding of this valuable instrument for evaluating the comparative effectiveness of therapeutics, which considers such instruments as the distance to specialist, geographic area, physician prescribing preference, and hospital formularies (Brookhart et al., 2006; McClellan et al., 1994; Stukel et al., 2007). A valid preference-based instrument would be the observation of a quasi-random treatment choice mechanism, for example, some hospitals have certain drugs on formulary and others don’t, but patients do not elect to go to one hospital versus another based on whether or not a particular medication is on formulary.
Figure 3-6 presents an example focused on the use of coxibs and nsNSAIDs, with GI complications as the causal relationship, and physician preference to prescribed coxibs versus nonselective NSAIDs (Schneeweiss et al., 2006). This nightmare for everyone writing treatment guidelines might be the dream of an epidemiologist: The same patients get treated differ-ently by different physicians; some physicians always prescribe coxibs and some physicians never prescribe coxibs to patients that need pain therapy (Schneeweiss et al., 2005; Solomon et al., 2003).
Figure 3-6.eps
Physician preference to prescribe Coxibs
Coxibs vs.
nsNSAIDs
GI complication in 180 days C: Prior GI events,
Warfarin, steroids, age, use, comorbidities, etc.
Frailty, smoking, alcohol, body mass index, over-the-counter drug use, etc.
Treatment Outcome
C
IV
U
proton-pump inhibitor
FIGURE 3-6 IV estimation of the association between NSAIDs and GI complication.
SOURCE: Adapted by permission from Macmillan Publishers, Ltd. Clinical Phar-macology & Therapeutics 82:143-156, Copyright © 2007.
TAKING ADVANTAGE OF NEW TOOLS AND TECHNIQUES
Some confounders such as the use of steroids and other medications can be measured with information that we can draw from claims data.
However, there will remain unmeasured confounders—for example, body mass index, and the use of over-the-counter drugs. Such information is usually not available in claims data, leading one to ask what happens when one compares the conventional multivariate-adjusted analysis to an instrumental variable based on physician preference. Data not shown here indicate that the risk difference estimates for GI complications for coxibs in a conventional multivariate analysis is around 0, meaning “no associa-tion.” What we would expect, of course, is a protective effect. When we did the instrumental variable analysis on coxibs and reduced GI toxicity (not shown), we see a negative risk difference, indicating a protective effect of the coxibs as compared to nsNSAIDs. This is an example where the confounding is strong and the confounding factor is either not measured in claims data or is measured only to a small extent.
Let us consider three core assumptions about instrumental variables (Angrist, 1996). One assumption is that the instrument is related to the actual exposure—otherwise it can’t be an instrument—and is a strong pre-dictor of treatment. The assumption is that physician prescribing preference strongly predicts future choices of treatments. This assumption is empiri-cally testable. In comparison with IV analyses from economics, the strength of the physician prescribing preference IV is greater than most but not all published examples (Rassen, 2008).
A second assumption is that the instrument should not be associ-ated with any measured or unmeasured patient care characteristics. To prove such an assumption—a more difficult exercise than proving the first assumption—one must consider the extent to which one achieves bal-ance in the measured covariates between the two treatment groups. This involves summarizing all of the measurable individual covariants into a summary metric called the Mahalanobis distance that considers the covari-ance between individual patient factors. In this case the physician prefer-ence for a variety of instrument definitions has led to substantial reduction in imbalance among observed patient characteristics (Rassen, 2008). The hope is that when improvement in balance in the measured covariates can be achieved by the instrument, there will be a corresponding improvement in the unmeasured covariates. This is different from the balance achieved by propensity score matching that is limited to the measured patient char-acteristics and their correlates (Seeger et al., 2005).
A third assumption is that there should be no direct relationship with outcome other than through actual treatment. It can be attempted to empir-ically test this assumption in the case of the treatment preference instrument through what is colloquially called the “good doc/bad doc” model, which suggests that treatment preference may be correlated with other physician
characteristics that relate to better outcomes. For example, some physicians who generally practice medicine better might have a preference for coxibs versus other NSAIDs. This creates a physician-level correlation and there-fore introduces confounding. To test this assumption, other quality of care measures, such as prescribing long-acting benzodiazepines, or problematic tricyclic antidepressant prescribing could be assessed in a study of the effec-tiveness of antipsychotics. The result was that among general practitioners there was no quality of prescribing and thus a reduced chance of violations of this third assumption (Brookhart et al., 2007).
Another example used regional variation in heart catheterization rates in patients with acute myocardial infarctions as an instrument (Stukel et al., 2007). As seen in Figure 3-7, patients in this study were arranged by quintile of regional cardiac catheterization rates. In the first quintile, 43 percent of patients received a heart catheter; in the highest quintile group, 65 percent received heart catheterization.
One could argue that there shouldn’t be anything different between these populations because patients did not select their residence according to whether their regional cardiac catheterization rate is high. If this argu-ment holds than there are some patients not receiving catheterization who would receive catheterization if they happened to live in another region.
Thus there is quasi-random treatment assignment for these patients.
Looking at the effect estimates in Figure 3-7 we find that the protective effect of heart catheterization in patients with acute myocardial infarction
Figure 3-7.eps
bitmap images with vector elements
Decrease in effect size with better adjustment for measured and unmeasured confounders:
RD
FIGURE 3-7 Regional variation in cardiac catheterization and risk of death.
SOURCE: Journal of the American Medical Association 297(3):278-285. Copyright
© 2007 American Medical Association. All rights reserved.
TAKING ADVANTAGE OF NEW TOOLS AND TECHNIQUES
Figure 3-8.eps bitmap image
FIGURE 3-8 Time as an instrumental variable.
SOURCE: Johnston, K. M., P. Gustafson, A. R. Levy, and P. Grootendorst. 2008.
Use of instrumental variables in the analysis of generalized linear models in the presence of unmeasured confounding with applications to epidemiological research.
Statistics in Medicine 27(9):1539-1556. Reproduced with permission of John Wiley
& Sons, Ltd.
in an unadjusted analysis of 24 less deaths per year per 100 patients reduces in the multivariate-adjusted regression to only 16 deaths prevented; and with the instrumental variable regression, only 5.
One final example (Figure 3-8) uses time as an instrumental vari-able. The question here concerns the use of beta-blocker after heart fail-ure hospitalization and 1-year mortality, and whether beta-blocker use is correlated with reduced mortality. After some landmark trials had been published, beta-blocker use in patients with heart failure increased substan-tially. The investigators defined the binary instrument either before or after this increased use of beta-blocker. As the figure shows, the estimated odds ratio using standard logistic regression was 0.68, whereas the instrumental variable ratio was 0.23—without suggesting which is “right,” we see that there is a considerable difference between the two estimates.
The most frequently mentioned limitation of instrumental variables is that two critical assumptions are not testable but assumptions must be argued using context knowledge. Several empirical tests were suggested to partially evaluate IV assumptions using empirical data, but ultimately we cannot fully prove that assumptions are fully valid. However, readers may be reminded that conventional regression analyses are based on assump-tions, including that the model is specified correctly, i.e., that all con-founders are measured and included in the model, an assumption that is inherently untestable. The lower statistical efficiency as a consequence of the two-stage estimation process is another limitations. In large databases with tens of thousands of people exposed to drug therapy that is usually a minor issue.
Comparative effectiveness research should routinely explore whether
a valid instrument variable is identifiable in settings where important con-founders remain unmeasured. One should search for random components in the treatment choice process, which will sometimes lead to a valid instru-ment. We have found that the physician prescribing preference instrument is worth considering in many situations of drug effectiveness research. We have further recommended that instrumental variable analyses should be secondary to conventional regression modeling until we better understand the qualities of preference-based instruments and how to best empirically test IV assumptions. We further suggest to perform sensitivity analyses to assess how much violation of IV assumptions may change the primary effect estimate (Brookhart, 2007).
In conclusion, instrumental variable analyses are currently underutilized but very promising approaches for comparative effectiveness research using nonrandomized data. Instrumental variable analyses can lead to substan-tial improvements, particularly in situations with strong unmeasured con-founding. The prospect of reducing residual confounding comes at the price of somewhat untestable assumptions for valid estimation. Plenty of research is ahead, particularly developing better methods to empirically assess the validity of IV assumptions and systematic screens for instrument candidates.
ADAPTIVE AND BAYESIAN APPROACHES TO STUDY DESIGN Donald A. Berry, Ph.D.
Head, Division of Quantitative Sciences
Professor and Frank T. McGraw Memorial Chair for Cancer Research Chairman, Department of Biostatistics
The University of Texas M.D. Anderson Cancer Center
Modern clinical studies are subject to the most rigorous of scientific standards. In particular, modern research relies heavily on the randomized clinical study that was introduced by A. Bradford Hill in the 1940s (MRC Streptomycin in Tuberculosis Studies Committee, 1948). Applying random-ization in a clinical research setting was an enormous advance and it revo-lutionized the notion of treatment comparisons. For a variety of reasons, mostly coincidence, the RCT became tied to the frequentist approach to statistical inference. In this approach the inferential unit is the study itself, and the conventional measure of inference is the level of statistical signifi-cance. In the early days of the RCT the sample size was fixed in advance.
Over time, preplanned interim analyses were incorporated to allow for stopping the study early for sufficiently conclusive results.
Randomization will continue to be important in clinical research. How-ever, randomization is difficult and expensive to effect, and there are
legiti-TAKING ADVANTAGE OF NEW TOOLS AND TECHNIQUES
mate ways of learning without randomizing. Moreover, learning can take place at any time during a study and not just when accrual is stopped and sufficient follow-up information obtained. The goal of this chapter is to describe an approach to clinical study design that improves on randomiza-tion in two ways. One way is to make RCTs more flexible, with data accrues during the study used to guide the study’s course. The other improvement is incorporating different sources of information to enable better conclu-sions about comparative effectiveness. Both use the Bayesian approach to statistics (Berry, 1996, 2006). This approach is ideal for both purposes. As regards the first, Bayes rule provides a formalism for updating knowledge with each new piece of information that is obtained, with updates occurring at any time. As regards the second, the Bayesian approach is inherently syn-thetic. Its principal measures of inference are the probabilities of hypotheses based on the totality of information available at the time.
Précis for Frequentist Statistics
Historically, the standard statistical measures used in clinical research have been frequentist. Frequentist conclusions are tailored to and driven by the study’s design. Probability calculations are restricted to the so-called
“sample space,” the set of outcomes possible for the design used. To make these calculations requires the assumption that a particular mechanism that produces the observation. An especially important assumption is that the experimental treatment being evaluated is ineffective, the “null hypothesis.”
Other hypotheses can be assumed as well, including that the experimental treatment has a particular specified advantage.
The most familiar frequentist inferential measure is the “p-value,” or observed statistical significance level. This is the probability of observations in the sample space as extreme or more extreme than the results actually observed, calculated assuming the null hypothesis. To make this calculation requires finding the probabilities (under the null hypothesis) of results that are potentially observable. It also requires ordering the possible results of the experiment so that “more extreme results” can be identified to enable adding probabilities over these results.
An important frequentist calculation made in advance of a study is its statistical power. This is the probability of achieving statistical significance in the study (defined as having a p-value of 0.05 or smaller) when the truth is that the experimental treatment has some particular benefit.
In all of the above calculations the design must be completely described in advance for otherwise the probabilities in the sample space and even the sample space itself will be unknown. And the study must be complete, hav-ing followed the design as specified in advance. The mathematics are easiest when the sample size is fixed and treatment assignments do not depend on
the interim results. But frequentist measures can be calculated (perhaps only via simulation) for any prospective design, however complicated. One potential stumbling block in a complicated study is identifying an order-ing of the study results. There is no natural way of orderorder-ing study results in the frequentist approach when the study has a complicated design. For example, there is no good frequentist approach to answer questions such as,
“Given the current results of the study, how much credibility should I place in the null hypothesis as opposed to competing hypotheses?” That makes it difficult to alter the course of the study on the basis of those results.
Précis for Bayesian Statistics
There are many publications describing the Bayesian approach—for example, Berry (2006) and Spiegelhlater (2004). I will give a brief descrip-tion here, highlighting some points of special importance in clinical study design. In the Bayesian approach, anything which is unknown—including hypotheses—has a probability. So the null hypothesis has a probability. And this probability can be calculated at any time: at the end of the study, dur-ing the study, and at the beginndur-ing of the study. The last of these is called a “prior probability.” Probabilities calculated during or after a study are based on whatever results are available at the time and are called “posterior probabilities.” For example, a Bayesian can always answer the question in the previous paragraph by giving the current (posterior) probability of the null hypothesis.
The Bayesian approach has a characteristic that is very important in designing clinical studies: It enables calculating probabilities of future observations based on previous observations. Frequentists can calculate probabilities of future observations only by assuming particular hypotheses.
In the Bayesian approach predictive probabilities do not require assuming a particular hypothesis because these probabilities are averages with respect to the current posterior probabilities of the various hypotheses.
The online learning aspect of the Bayesian approach makes it ideal for building adaptive designs. If a study’s design is developed as the study is being conducted, which is possible in the Bayesian approach, it is impos-sible to calculate the study’s false-positive rate. This is why I insist on build-ing designs prospectively. It is more work because one must consider many possibilities that will not arise in the actual trial: “What would I want to do if the data after 40 patients are as follows: . . .?” The various “operating characteristics” of any prospective study design, including its false-positive rate, can be calculated. Except in the simplest of adaptive designs, such calculation will require simulation.