2.5 COUNTERFACTUAL APPROACH TO CONFOUNDING
2.5.3 Counterfactual Definition of a Confounder
As intuition suggests, in order for a measure of effect to be confounded according to the counterfactual definition, the exposed and unexposed cohorts must differ on risk factors for the disease. A variable that is, in whole or in part,“responsible”for confounding is said to be a confounder (of the measure of effect). Note that accord- ing to the counterfactual approach, the fundamental concept is confounding, and that confounders are defined secondarily as variables responsible for this phenomenon.
This is to be contrasted with the collapsibility definition thatfirst defines confounders and, when these have been identified, declares confounding to be present. As was ob- served in the previous section, the counterfactual definition of confounding is based on measurements taken at the population level. These usually represent the net ef- fects of many interrelated variables, some of which may play a role in confounding.
An important part of the analysis of epidemiologic data involves identifying from among the possibly long list of risk factors those which might be confounders.
Below we discuss the relatively simple case of a single confounder, but in prac- tice there will usually be many such variables to consider. To get a sense of how complicated the interrelationships can be, consider a cohort study investigating hypercholesterolemia (elevated serum cholesterol) as a risk factor for myocardial infarction (heart attack). Myocardial infarction most often results from atheroscle- rosis (hardening of the arteries). Established risk factors that would typically be considered in such a study are age, sex, family history, hypertension (high blood pressure), and smoking. A number of associations among these risk factors need to be taken into account: Hypertension and hypercholesterolemia tend to increase with age; smoking is related to sex and age; hypercholesterolemia and atherosclerosis are familial; hypercholesterolemia can cause atherosclerosis, which can in turn lead to hypertension; hypertension can damage blood vessels and thereby provide a site for atherosclerosis to develop.
In order to tease out the specific effect, if any, that hypercholesterolemia might have on the risk of myocardial infarction, it is necessary to take account of poten- tial confounding by the other variables mentioned above. As can be imagined, this presents a formidable challenge, in terms of both statistical analysis and pathophys- iologic interpretation. Furthermore, the preceding discussion refers only to known risk factors. There may be unknown risk factors that should be considered but that, due to the current state of scientific knowledge, are not included in the study. This ex- ample also points out that according to the counterfactual approach (and in contrast to the collapsibility approach) a decision about confounding is virtually never decided solely on the basis of study data. Instead, all available information is utilized—in par- ticular, whatever is known about the underlying disease process and the population being studied (Greenland and Neutra, 1980; Robins and Morgenstern, 1987).
We now formalize the counterfactual definition of a confounder. LetR be the complete set of risk factors for the disease, both known and unknown, and letSbe a subset of Rthat does not include E, the exposure of interest. The stratification that results from cross-classifying according to all variables inS will be referred to as stratifying byS, and the resulting strata will be referred to as the strata ofS. For example, letS = {F1,F2}, whereF1is age group (five levels) andF2is sex. Then the strata ofSare the 10 age group–sex categories obtained by cross-classifyingF1and F2. Within each stratum ofSwe form the 2×2 table obtained by cross-classifying by the exposure of interest and the disease. Associated with the actual exposed cohort in each stratum is a corresponding counterfactual unexposed cohort. We say there is no residual confounding in the strata ofS if each stratum is unconfounded; that is, within each stratum the probability of disease in the actual unexposed cohort equals the probability of disease in the counterfactual unexposed cohort.
Suppose that we have constructed the causal diagram relating the risk factors in Rand the exposureEto the disease. Based on the“back-door”criterion, the causal diagram can be used to determine whether, after stratifying by S, there is residual confounding in the strata ofS(Pearl, 1993, 1995, 2000, Chapter 3). When there is no residual confounding we say thatSis sufficient to control confounding, or simply thatS is sufficient. WhenS is sufficient but no proper subset ofS is sufficient,S is said to be minimally sufficient. A minimally sufficient set can be determined by sequentially deleting variables from a sufficient set until no more variables can be dropped without destroying the sufficiency. Depending on the choices made at each step, this process may lead to more than one minimally sufficient set of confounders.
This shows that whether we view a risk factor as a confounder depends on the other risk factors under consideration.
It is possible for a minimally sufficient set of confounders to be empty, meaning that the crude measure of effect is unconfounded. A valuable and surprising lesson to be learned from causal diagrams is that confounding can be introduced by enlarging a sufficient set of confounders (Greenland and Robins, 1986; Greenland et al., 1999).
The explanation for this seeming paradox is that, as has been observed, confound- ing is a phenomenon that is determined by net effects at the population level. Just as stratification can prevent confounding by severing the connections between vari- ables that were responsible for a spurious causal relationship, it is equally true that stratification can create confounding by interfering with the paths between variables that were responsible for preventing confounding.
Causal diagrams have the potential drawback of requiring detailed information on the possibly complex interrelationships among risk factors, both known and un- known. However, even if a causal diagram is based on incomplete knowledge, it can be useful for organizing what is known about established and suspected risk factors.
In Section 2.5.6 it is demonstrated that for an unknown confounder to produce signif- icant bias it must be highly prevalent in the unexposed population, closely associated with the exposure, and a major risk factor for the disease. It is always possible that such an important risk factor might as yet be unknown, especially when a disease is only beginning to be studied, but for well-researched diseases this seems less likely.
Nevertheless, the impact of an unknown confounder must be kept in mind and so causal diagrams should be interpreted with an appropriate degree of caution.
We now specialize to the simple case of a single potential confounder F. The situation with multiple potential confounders is partially subsumed by the present discussion if we considerF to be formed by stratifying on a set of potential con- founders. We make the crucial assumption that F is not affected by the exposure being studied. An instance where this assumption would fail is in the study of hyper- cholesterolemia and myocardial infarction presented earlier, withFtaken to be hy- pertension. As was remarked above, hypercholesterolemia can lead to hypertension, which is a risk factor for myocardial infarction. When risk factors are affected by the exposure under consideration, the analysis of confounding and causality becomes much more complicated, requiring considerations beyond the scope of the present discussion (Rosenbaum, 1984b; Robins, 1989; Robins and Greenland, 1992; Robins et al., 1992; Weinberg, 1993; Robins, 1998; Keiding, 1999). WithF assumed to be unaffected by exposure, it follows thatF cannot be on the causal pathway between exposure and disease. Recall that this is one of the conditions that was specified in Section 2.3.2 as a requirement for a proper definition of a confounder.
Let π1∗j denote the counterfactual probability of disease in the jth stratum and letp∗1j denote the proportion of the counterfactual unexposed cohort in that stratum (j=1,2, . . . ,J). Consistent with (2.4), it follows that
π1∗= J
j=1
p1∗jπ1j∗.
We now assume that there is no residual confounding in the strata of F; that is, π1∗j = π2j for all j. This assumption is related to the assumption of strong ignor- ability (Rosenbaum and Rubin, 1983; Rosenbaum, 1984a). According to Holland (1989), this is perhaps the most important type of assumption that is made in discus- sions of causal inference in nonrandomized studies. It was assumed above thatF is not affected byE. This implies that if the exposed cohort had in fact been unexposed, the distribution ofFwould be unchanged; that is, p1∗j =p1j for allj. Consequently,
π1∗= J
j=1
p1jπ2j. (2.16)
From (2.4) and (2.16), the criterion for no confounding, π1∗ = π2, can be ex- pressed as
J j=1
p1jπ2j = J
j=1
p2jπ2j (2.17)
(Wickramaratne and Holford, 1987; Holland, 1989). Identity (2.17) is the same as identities (2.7) and (2.10), which were shown to be sufficient to guarantee average- ability of the risk difference and risk ratio. Consequently, either condition (i) or con-
dition (ii) is sufficient forπ1∗=π2to be true. So, providedFis not affected byEand assuming that there is no residual confounding within strata of F, the following are necessary conditions for F to be a confounder according to the counterfactual defi- nition of confounding (Miettinen and Cook, 1981; Rothman and Greenland, 1998):
1. Fis a risk factor for the disease in the unexposed population.
2. Fis associated with exposure in the population.
These are the same necessary conditions for F to be a confounder of the risk dif- ference and risk ratio that were obtained using the collapsibility definition of con- founding. An important observation is that (2.16) was derived without specifying a particular measure of effect. This means that conditions 1 and 2 above are applicable to the odds ratio as well as the risk difference and risk ratio. This avoids the problem related to the odds ratio that was identified as aflaw in the collapsibility definition of confounding.
We return to an examination of confounding and effect modification in the hy- pothetical cohort studies considered earlier. Based on criterion (2.17), it is readily verified thatF is a confounder (according to the counterfactual definition) in Table 2.2(e) but not in Table 2.2(d). We observe that in Table 2.2(d), F is an effect mod- ifier of the risk ratio but not an effect modifier of the risk difference or odds ratio.
On the other hand, in Table 2.2(e), Fis an effect modifier of the risk difference and the odds ratio but not an effect modifier of the risk ratio. This shows that confound- ing (according to the counterfactual definition) and effect modification are distinct characteristics that can occur in the presence or absence of one other.
When there are only two strata, (2.17) simplifies to
p11π21+p12π22= p21π21+p22π22. (2.18) Substituting p12 =1−p11andp22=1−p21in (2.18) and rearranging terms leads to(π21−π22)(p11−p21)=0. This identity is true if and only if eitherπ21 =π22
or p11 = p21. The latter identities are precisely conditions (i) and (ii), respectively.
So, when there are only two strata, (2.17) implies and is implied by conditions (i) and (ii). In other words, conditions 1 and 2 completely characterize a dichotomous confounder. In Table 2.2(f),Fis a risk factor for the disease andFis associated with exposure, and yet (2.17) is satisfied:
200 600
50 100
+
300 600
20 200
+
100 600
90 300
= 160 600
= 100
600 50 100
+
200 600
20 200
+
300 600
90 300
.
This illustrates that, whenF has three or more strata, even if both conditions 1 and 2 are satisfied,Fmay not be a confounder. Therefore, when there are three or more strata, conditions 1 and 2 are necessary but not sufficient for confounding.