Unreliability is a measurement problem that can often be rectified by improving interview procedures, or by using statistical sums or averages of replicate measures. Determining the extent to which unreli- ability is a problem, however, can be challenging.
There are various designs for estimating reliability, but virtually all have some biases and shortcomings.
Studies of sampling variability of reliability statis- tics [9, 39, 47] have suggested that sample sizes in pilot studies are often not adequate to give stable estimates about the reliability of key measurement procedures. It is important that reliability studies be considered critically in search for ways to improve measurement procedures. Specifically, if the reliabil- ity of a measure appears to be very good, ask whether there are biases in the reliability design that might bias the results optimistically. Were the respondents
sampled in the same way in the reliability study that they will be in the field study? Was the respondent given the chance to be inconsistent, or did the repli- cation make use of archived information? If serious biases are not found, and the reliability study pro- duced stable estimates, then one can put the issue of reliability behind you, at least for the population at hand.
If the reliability of a measure appears to be poor, one should also look for biases in the reliability design. How similar were the replications? Could the poor reliability results be an artifact of legiti- mate changes over time, heterogeneous items within a scale, or artificially different measurement condi- tions? Was the sample size large enough to be sure that reliability is in fact bad? Be especially suspicious if you have evidence of validity of a measure that is purported to be unreliable. Rather than dismiss- ing a measure with apparently poor reliability, ask whether it can be improved to eliminate noise.
References
[1] American Psychiatric Association (1980)Diagnostic and Statistical Manual of Mental Disorders, 3rd edn, American Psychiatric Association, Washington, DC.
[2] American Psychiatric Association (1994)Diagnostic and Statistical Manual of Mental Disorders, 4th edn, American Psychiatric Association, Washington, DC.
[3] Cochran, W.G. (1968) Errors in measurement in statistics.Technometrics,10, 637–666.
[4] Snedecor, G.W. and Cochran, W.G. (1967)Statistical Methods, 6th edn, Iowa State University Press, Ames.
[5] Bollen, K.A. (1989)Structural Equations with Latent Variables, John Wiley & Sons, Inc., New York.
[6] Borm, G.F., Munneke, M., Lemmers, O.et al. (2007) An efficient test for the analysis of dichotomized vari- ables when the reliability is known.Stat. Med.,26, 3498–3510.
[7] Shrout, P.E. (1998) Measurement reliability and agree- ment in psychiatry. Stat. Methods Med. Res., 7, 301–317.
[8] Lord, F.M. and Novick, M.R. (1968)Statistical Theo- ries of Mental Test Scores, Addison-Wesley, Reading.
[9] Dunn, G. (1989)Design and Analysis of Reliability Studies, Oxford University Press, New York.
[10] Endicott, J., Spitzer, R.L., Fleiss, J.L.et al. (1976) The global assessment scale: a procedure for measuring overall severity of psychiatric disturbance.Arch. Gen.
Psychiatry,33, 766–771.
[11] Spitzer, R.L. (1983) Psychiatric diagnosis: are clinicians still necessary? Compr. Psychiatry, 24, 399–411.
[12] Cronbach, L.J. (1951) Coefficient alpha and the inter- nal structure of tests.Psychometrika,16, 297–334.
[13] McDonald, R.P. (1999)Test Theory: A Unified Treat- ment, Erlbaum, Mahwah.
[14] Sijtsma, K. (2009) On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychome- trika,74(1), 107–120.
[15] Raykov, T. (1997) Scale reliability, Cronbach’s coefficient alpha, and violations of essential tau- equivalence with fixed congeneric components.Mul- tivariate Behav. Res.,32, 329–353.
[16] Kraemer, H.C., Shrout, P.E. and Rubio-Stipec, M.
(2007) Developing the diagnostic and statistical man- ual V: what will ‘statistical’ mean in DSM-5. Soc.
Psychiatry Psychiatr. Epidemiol.,42(4), 259–267.
[17] Cronbach, L.J., Gleser, G.C., Nanda, H.et al. (1972) The Dependability of Behavioral Measurements: The- ory of Generalizability for Scores and Profiles, John Wiley & Sons, Inc., New York.
[18] Anthony, J.C., Folstein, M.,Romanoski, A.J. et al.
(1985) Comparison of the lay Diagnostic Interview Schedule and a standardized psychiatric diagnosis.
Arch. Gen. Psychiatry,42, 667–675.
[19] Brennan, R.L. (2001) Generalizability Theory, Springer, New York.
[20] Cranford, J.A., Shrout, P.E., Iida, M.et al. (2006) A procedure for evaluating sensitivity to within-person change: can mood measures in diary studies detect change reliably? Pers. Soc. Psychol. Bull., 32 (7), 917–929.
[21] Jannarone, R.J., Macera, C.A. and Garrison, C.Z.
(1987) Evaluating interrater agreement through ‘case- control’ sampling.Biometrics,43, 433–437.
[22] Spearman, C. (1910) Correlation calculated from faulty data.Br. J. Psychol.,3, 271–295.
[23] Brown, W. (1910) Some experimental results in the correlation of mental abilities. Br. J. Psychol., 3, 296–322.
[24] Kraemer, H.C. (1979) Ramifications of a popula- tion model for kappa as a coefficient of reliability.
Psychometrika,44, 461–472.
[25] Fleiss, J.L. and Shrout, P.E. (1989) Reliability consid- erations in planning diagnostic validity studies, inThe Validity of Psychiatric Diagnoses (ed. L. Robbins), Guilford Press, New York, pp. 279–291.
[26] Carey, G. and Gottesman, I.I. (1978) Reliability and validity in binary ratings: areas of common misun- derstanding in diagnosis and symptom ratings.Arch.
Gen. Psychiatry,35, 1454–1459.
[27] Grove, W.M., Andreason, N.C., McDonald-Scott, P.
et al. (1981) Reliability studies of psychiatric diagnosis: theory and practice.Arch. Gen. Psychiatry, 38, 408–413.
[28] Guggenmoos-Holzmann, I. (1993) How reliable are chance-corrected measures of agreement?Stat. Med., 12, 2191–2205.
[29] Spitznagel, E.L. and Helzer, J.E. (1985) A proposed solution to the base rate problem in the kappa statistic.
Arch. Gen. Psychiatry,42, 725–728.
[30] Shrout, P.E., Spitzer, R.L. and Fleiss, J.L. (1987) Quantification of agreement in psychiatric diagnosis revisited.Arch. Gen. Psychiatry,44, 172–177.
[31] Kraemer, H.C. (1992) Measurement of reliability for categorical data in medical research. Stat. Methods Med. Res.,1, 183–199.
[32] Cohen, J. (1960) A coefficient of agreement for nomi- nal scales.Educ. Psychol. Meas.,20, 37–46.
[33] Blackman, N.J.-M. and Koval, J.J. (2000) Interval esti- mation for Cohen’s kappa as a measure of agreement.
Stat. Med.,19, 723–741.
[34] Walter, S.D., Eliasziw, M. and Donner, A. (1998) Sample size and optimal designs for reliability studies.
Stat. Med.,17, 101–110.
[35] SPSS Inc. (2009) SPSS for Windows (Version 16), SPSS Inc., Chicago.
[36] Shrout, P.E. and Fleiss, J.L. (1979) Intraclass correla- tions: uses in assessing rater reliability.Psychol. Bull., 86, 420–428.
[37] Fleiss, J.L. and Cohen, J. (1973) The equivalence of weighted kappa and the intraclass coefficient as measures of reliability. Educ. Psychol. Meas., 33, 613–619.
[38] Fleiss, J.L., Levin, B. and Paik, M.C. (2003)Statistical Methods for Rates and Proportions, 3rd edn, John Wiley & Sons, Inc., New York.
[39] Donner, A. (1998) Sample size requirements for the comparison of two or more coefficients of inter- observer agreement.Stat. Med.,17, 1157–1168.
[40] Donner, A. and Eliasziw, M. (1992) A goodness- of-fit approach to inference procedures for the kappa statistic: confidence interval construction, significance- testing and sample size estimation. Stat. Med., 11, 1511–1519.
[41] Donner, A. and Eliasziw, M. (1994) Statistical impli- cations of the choice between a dichotomous or continuous trait in studies of interobserver agreement.
Biometrics,50, 550–555.
[42] Donner, A. and Eliasziw, M. (1997) A hierarchi- cal approach to inferences concerning interobserver agreement for multinomial data. Stat. Med., 16, 1097–1106.
[43] Donner, A., Eliasziw, M. and Klar, N. (1996) Testing the homogeneity of kappa statistics.Biometrics,52, 176–183.
[44] Donner, A., Shoukri, M.M., Klar, N.et al. (2000) Testing the equality of two dependent kappa statistics.
Stat. Med.,19, 373–387.
[45] Embretson, S.E. and Reise, S.P. (2000)Item Response Theory for Psychologists, Erlbaum, Mahwah.
[46] Gregorich, S.E. (2006) Do self-report instruments allow meaningful comparisons across diverse popu- lation groups?Med. Care,44(11), S78–S94.
[47] Cantor, A.B. (1996) Sample-size calculations for Cohen’s kappa.Psychol. Methods,1, 150–155.
6 Moderators and mediators:
Towards the genetic and environmental bases of psychiatric disorders
Helena Chmura Kraemer
Department of Psychiatry, Stanford University, CA, USA and University of Pittsburgh, Pittsburgh, PA, USA