Examples of criterion validity - Validity of a construct

7.2 Validity of a construct

7.2.4 Examples of criterion validity

For examples of applications of criterion validity, we turn to two studies in the psychiatric literature from the 1990s. The first study [13] provides a fairly typ- ical example of the use of correlational techniques.

At issue was whether a self-report instrument can be used in populations of patients with schizophrenia to obtain valid ratings of depression. To examine this question, the authors compared self-report ratings obtained using the Beck Depression Inventory (BDI) with ratings of the Calgary Depression Scale (CDS), a semistructured interview designed to assess depression in schizophrenia patients. In this study, the CDS is the criterion because it makes use of informed judgements by trained clinicians, which form the current ‘gold standard’ for identifying depression in clinical populations. BDI and CDS scores were compared by calculating the Pearson product-moment correlation coefficient (e.g. see [14]), after creating scatterplots to examine the joint distribution of BDI and CDS scores as well as identifying any outliers.

The latter step was essential because the presence of even a single outlier (i.e. an extreme and atypi- cal value) could easily distort the product-moment correlation (e.g. see [15]).

Another important methodologic step employed by Addingtonet al. [13] was to compare correlations between the BDI and CDS in clinically distinct subgroups of patients with schizophrenia: inpatients vs. outpatients, and (within these subgroups) patients who either did or did not require assistance in completing the self-report instrument. In this particular study, the correlation between the BDI and CDS was stronger among inpatients than outpatients, regardless of whether the patients required assistance (r=0.84 vs.r=0.96). However, the substantially greater percentage of inpatients requiring assistance (34% of inpatients vs. 12% of the outpatients) led the authors to conclude that

‘depressed affect can be assessed in patients with

schizophrenia by both self-report and structured interview, but the BDI poses difficulties with use with inpatients’ [13, p. 561].

For our purposes, however, the substantive find- ings of this study were less important than the fact that this study admirably illustrated the critical importance of selecting and describing validation samples that are clinically meaningful in the context of the measurement instrument of interest [8]. In particular, users of such instruments need to be aware that published validation studies might have used

‘samples of convenience’ (e.g. university students) that do not approximate the clinical population the user has in mind and that the results of such studies do not necessarily generalise to other samples.

Our second example of criterion validity in psychiatric research [16] also illustrates the critical importance of the validation sample. In this study, the validity of using a questionnaire (the Center for Epi- demiologic Studies Depression Scale or CES-D) [17]

as a case identification tool in studies of mood disorders among Native Americans was investigated.

CES-D scores were compared with DSM-III-R diagnoses [18] based on a structured psychiatric interview (the Lifetime Version of the SADS [7]). The authors had concerns about the cross-cultural applicability not only of the screening instrument but also of the criterion itself (e.g. DSM-III-R diagnoses of affective disorders). For purposes of the study, however, it was assumed that DSM-III-R diagnoses would be relevant among Native Americans.

Although the CES-D, like the BDI in the above example, yields a numerical score, its proposed use as a screening instrument for depression was for the purpose of identifying not the degree of depression, but the presence of a particular clinical syndrome, namely, DSM-III-R major depression. The criterion was therefore a categorical (i.e. qualitative) rating rather than a numerical (i.e.

quantitative) rating, making it inappropriate to use correlational procedures. Instead, to evaluate the validity of the instrument for case identification, the authors employed statistical methods that have been expressly developed for qualitative data, includ- ing sensitivity, specificity and receiver operating characteristic (ROC) analysis.

Sensitivity and specificity are both calculated using data that have been summarised in a 2×2 table of

Table 7.2 Schematic representation of the calculation of indices of criterion validity and predictive value^a. Test result Criterion (gold standard)^b

Present Absent Total

Positive a b a+b

Negative c d c+d

Total a+c b+d N

aa,b,c,dandNare frequencies (e.g. numbers of persons rated).

bThe Gold Standardis assumed too represent the ‘true’

value and thus to be free of error.

frequencies (see Tables 7.2 and 7.3 for definitions and computational formulas). In the example at hand, a 2×2 table was used to cross-classify the numbers of screened persons with and without the criterion (e.g. a DSM-III-R diagnosis of major depression) who either did or did not score above the cut-off for depression in the screening instrument, the CES-D. (ROC analysis was used to determine the optimal cut-off value for the CES-D.) As an illustrative finding, the sensitivity for DSM-III-R major depression was 100% (i.e. all three persons in the sample with a diagnosis of major depression scored above the cut-off on the CES-D).

The corresponding value of specificity was 82%

(i.e. 82% of those persons in the sample who did not have diagnoses of major depression scored below the CES-D cut-off for depression). It follows directly from the reported specificity value of 82% that 18%

(100−82%) of the persons in the sample with no psychiatric diagnoses or with DSM-III-R diagnoses other than major depression scoredabovethe CES-D cut-off and would have been classified as depressed by that screening instrument.

Whether or not this degree of misclassification error (or invalidity) is considered to be an unaccept- ably high ‘false-positive rate’ depends on the proposed use of the instrument and on the comparable

‘operating characteristics’ of alternative instruments.

For example, a higher CES-D cut-off value could be expected to decrease the false-positive rate (via increased specificity), but at the expense of sensitivity. In this particular study, a higher CES-D cut-off actually increased specificity without decreasing sensitivity, but this was probably attributable to the small number of cases with DSM-III-R diagnoses of

Table 7.3 Statistical indices for evaluating qualitative data in the assessment of validity.

Term Definition Formula^a Concepts

Sensitivity Proportion of those a test correctly identifies as having the disease (or characteristic) of interest

a/(a+c) Sensitivity and specificity of a test are theoretically independent of disease/exposure prevalence as both conditioned on the bottom, or ‘true’, totals of Table 7.2 Specificity Proportion of those a test correctly

identifies asnothaving the disease (or characteristic) of interest

d/(b+d)

False negative rate Probability of misclassifying a true positive as a negative

1−(a/(a+c)) or 1−sensitivity False positive rate Probability of misclassifying a true

negative as a positive

1−(d/(b+d)) or 1−specificity Positive predictive

value (PPV)

Proportion of true positives among individuals who test positive

a/(a+b) In addition to the dependence of the PPV and NPV on the sensitivity/specificity of a test, they are also a function of the disease/exposure prevalence

Negative predictive value (NPV)

Proportion of true negatives among individuals who test negative

d/(c+d)

Prevalence Proportion of true positives in the population

(a+c)/N

aRefer to Table 7.2.

major depression. In most studies there is a systematic trade-off between sensitivity and specificity, and for that reason both of these indices of criterion validity must be considered together in determining whether a particular instrument is more valid than the available alternatives. ROC analysis provides a useful framework for making such comparisons (e.g.

see [19]). In the present example, the non-negligible false-positive rate was consistent with the investi- gators’ concerns (based on previous research by a number of researchers using other samples) that the CES-D might be reflecting symptoms of not only major depression but also increased levels of anxiety, demoralisation or even physical ill health [16].

The study by Somervellet al. [16] also illustrates the difference between criterion validity and the related, but nevertheless distinct, concept of predictive value. Positive predictive value is literally the predictive value of a positive rating, that is the probability of having the criterion of interestgivena positive rating on the instrument under investigation.

(Formulas for calculating positive predictive value, and the related index, negative predictive value, are given in Table 7.3.) Since the criterion (e.g. DSM- III-R major depression) is frequently of more direct clinical importance than the rating (e.g. a particular CES-D score), positive and negative predictive values are often more clinically meaningful than sensitivity and specificity. For example, most clinicians would probably be more interested in the usefulness of the CES-D for predicting major depression than the other way around. However, positive predictive value is a joint function of sensitivity, specificity and prevalence, such that low prevalence values can severely constrain the values of positive predictive value that can be realistically attained, even with very high sensitivity and specificity values [20, 21].

(Negative predictive value is similarly constrained by high prevalence values.)

In the study by Somervellet al. [16], the prevalence of major depression can be estimated from the rate of major depression in the sample as 3/120=0.025.

Using a cut-off value of 16 on the CES-D, the reported specificity value of 82.1% therefore corresponds to a positive predictive value of 0.125.

In other words, even though sensitivity was perfect (100%) and specificity was very high, only one of every eight persons who scored above the CES-D cut-off of 16 would be expected actually to have major depression. Even increasing the CES-D cut- off to improve specificity would not dramatically change this result. Again, this is due to the con- straint imposed by the low estimated prevalence of major depression in the study population. (With the CES-D cut-off set at 28, the reported specificity value of 96.6% corresponds to a positive predictive value of 0.429.) In conclusion, this example shows that even though an instrument may have excellent criterion validity as assessed using standard indices (namely, sensitivity and specificity), the actual predictive value of the instrument could be much more limited, depending on the prevalence of the disor- der of interest, which in turn may vary with the composition of the validation sample.

Dalam dokumen Textbook in Psychiatric Epidemiology (Halaman 117-120)