Reporting Ethical Violations - for Cross-Cultural

If an apparent ethical violation has substantially harmed or is likely to substantially harm a person or organization and is not appropri

ate for informal resolution ... or is not resolved properly in that fash

ion, psychologists take further action appropriate to the situation.

Such action might include referral to state or national committees on professional ethics, to state licensing boards, or to the appropri

ate institutional authorities. This standard does not apply when an intervention would violate confidentiality rights or when psycholo

gists have been retained to review the work of another psychologist whose professional conduct is in question.

Example. The adjudication of alleged ethical violations re

quires the accused person to belong to a professional association that (a) has an established code of ethics that details standards rele

vant to the alleged violation and (b) provides an operational system of review and enforcement. Upon finding an unethical behavior, possible outcomes may range from letters of warning to removal of one's license to practice and expulsion from membership in the as

sociation. Civil legal action also may be initiated. All groups (Table 3.1) may be affected by this standard.

A FINAL NOTE

This volume discusses various issues important to adapting tests and their use. This chapter reviewed six ethical principles and 25 stan

dards from APA'S Ethical Principles of Psychology and Code of Con

duct (2002) in light of various practices that may be associated with adapting tests and their use.

Ethical issues form one of the important cornerstones of profes

sional practice, including that associated with adapting tests and using them. Thus, further discussion of ethical issues associated with the important work in this area seems warranted.

Further discussions at conferences, in journals, and in other fo

rums may shed light on a need to further promote knowledge and application of ethical issues that impact test development and use, including that associated with adapted tests. In addition, discussions should consider a need to develop an ethics code that transcends na

tional boundaries and addresses broad and prevailing issues impor

tant to testing practices. Research that identifies important and prevailing behaviors that either may pose ethical dilemmas or clearly violate ethical behavior is warranted before developing a behavior- based ethics code.

The profession is fortunate to have various forms of scholarship that discuss ethics. These include the following: American Psycho

logical Association. (2002), British Psychological Society (1998a, 1998b), British Psychological Society Steering Committee on Test Standards (1999), Canadian Psychological Association (1987), Eyde et al. (1988), International Test Commission (2000), Joint Commit

tee on Testing Practices (1988), Kendall et al. (1997), Koene, (1997), Koocher and Keith-Spiegel (1998), Lindsay (1996), and the National Council on Measurement in Education (1995). Further efforts to ex

amine ethics issues can use these and other existing resources as a springboard for this work.

AUTHOR'S NOTES

Thomas Oakland is University of Florida Research Foundation Pro

fessor, past president of the International Test Commission, past president of the International School Psychology Associations, and President of the International Foundation for Children's Education.

Appreciation is expressed to Drs. Ronald Hambleton and Fons van de Vijver for their constructive comments on an earlier draft of this manuscript.

REFERENCES

American Educational Research Association, American Psychological Associa

tion, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington DC: American Edu

cational Research Association.

American Psychological Association. (1994). Publication manual of the Ameri

can Psychological Association: Fourth Edition. Washington, DC: Author American Psychological Association. (2002). Ethical principles of psychology

and code of conduct. American psychologist, 57, 1060-1073.

Anastasi, N., & Urbina, S. (1997).Psychological Testing: Seventh Edition. Upper Saddle River, NJ: Prentice Hall.

Bartram, D., & Coyne, I. (1998a). The ITC/EFPPA survey of testing and test use in countries world-wide. Technical report for the ITC Council.

Bartram, D., & Coyne, I. (1998b). The ITC/EFPPA survey of testing and test use in countries within Europe. Technical report for the EFPPA Task Force.

British Psychological Society. (1998a). Certificate and register of competence in occupational testing general information pack (Level A). Leicester, Eng

land: Author.

British Psychological Society (1998b). Code of conduct, ethical principles and guidelines. Leicester, England: Author.

British Psychological Society. (1999). Certificate and register of competence in occupational testing general information pack (Level B). Leicester, Eng

land: Author.

British Psychological Society Steering Committee on Test Standards. (1999).

Checklist of competencies in educational testing: Foundation level.

Leicester, England: Author

Byrne, B. (1998). Structural equation modeling with Lisrel, Prelis, and Simplis. Mahwah, NJ: Lawrence Erlbaum Associates.

Canadian Psychological Association. (1987). Guidelines for educational and psychological testing. Ottawa: Author.

Cummins, J. (1984). Bilingualism and special education: Issues in assessment and pedagogy. Clevedon, Avon, England: Multilingual Matters Ltd.

Embretson, S., & Hershberger, S. (1999). The new rules of measurement.

Mahwah, NJ: Lawrence Erlbaum Associates.

Eyde, L. D., Moreland, K. L., Robertson, G. J., Primoff, E. S., & Most, R. B.

(1988). Test user qualifications: A data-based approach to promoting good test use. Issues in Scientific Psychology. Washington, DC: American

Psychological Association.

Fitzgerald, C., & Ward, P. (1998). Computer-based testing: A Global perspective.

An address to the meeting of the International Congress of Applied Psychol

ogy, San Francisco, CA.

Haladyna, T. (1999). Developing and validating multiple-choice test items.

Mahwah, NJ: Lawrence Erlbaum Associates.

Hambleton, R. K. (1994). Guidelines for adapting educational and psychologi

cal tests: A progress report. European Journal of Psychological Assessment, 10, 229-244.

Hambleton, R. (2001). The next generation of ITC test translation and adaptation guidelines. European Journal of Psychological Assessment, 17(5), 164-172.

Herrnstein, D. J., & Murray, C. (1994). The Bell curve. NewYork: The Free Press.

Hu, S., & Oakland, T. (1991). Global and regional perspectives on testing chil

dren and youth: An international survey. InternationalJournal of Psychol

ogy, 26(3), 329-344.

International Test Commission (2000). International guidelines for test-use.

Punta Gorda, Florida: Author

Jensen, A. (1980). Bias in mental testing. New York: The Free Press.

Joint Committee on Testing Practices. (1988). Code of fair testing practices in education. Washington DC: American Psychological Association.

Kendall, I., Jenkinson, J., De Lemos, M., & Clancy, D. (1997). Supplement to guidelines for the use of psychological tests. Melbourne: Australian Psycho

logical Society.

Koene, C. J. (1997). Tests and professional ethics and values in European psy

chologists. European journal of Psychological Assessment, 13, 219-228.

Koocher, G. P., & Keith-Spiegel, P. (1998). Ethics in psychology (2nd ed.). New York: Oxford University Press.

Lindsay, G. (1996). Psychology as an ethical discipline and profession. Euro

pean Psychologist, 1, 79-88.

Loehlin, J. (1998). Latent variable models. Mahwah, NJ: Lawrence Erlbaum As

sociates.

Mays, V, Rubin, J., Saboruin, M., & Walker, L. (1996). Moving toward a global psychology. American Psychologist, 51, 485-487.

McDonald, R. (1999). Test theory. Mahwah, NJ: Lawrence Erlbaum Associates.

Mercer, J. (1973). Labeling the mentally retarded. Los Angeles: University of California Press.

Muniz, J., Prieto, G., Almeida, L., & Bartram, D. (1999). Test use in Spain, Portu

gal, and Latin American countries. European Journal of Psychological As

sessment, 15, 151-157.

National Council on Measurement in Education. (1995). Code of professional responsibilities in educational measurement. Washington, DC: Author.

Oakland, T. (Ed.). (1977).Psychological and educational assessment of minor

ity children. Larchmont, NY: Brunner-Mazel.

Oakland, T., Bernal, E., Holley, F., Natalicio, D., Leas, R., & Richard, L. (1980). As

sessing students with limited English speaking abilities. Texas Outlook, 64, 32-33.

Oakland, T., & Hambleton, R. (Eds.). (1995). International perspectives on as

sessment of academic achievement. Norwell, MA: Kluwer Academic.

Oakland, T., & Hu, S. (1991). Professionals who administer tests with children and youth: An international survey. Journal of Psychoeducational Assess

ment, 9(2), 108-120.

92 OAKLAND Oakland, T., & Hu, S. (1992). The top ten tests used with children and youth

worldwide. Bulletin of the International Test Commission, 99-120.

Oakland, T., & Hu, S. (1993). International perspectives on tests used with chil

dren and youth. Journal of School Psychology, 31, 501-517.

Reynolds, C., & Brown, R. T. (1984). Perspectives on bias in mental testing.

New York: Plenum.

Rosenzweig, M. (1999). Continuity and change in the development of psychol

ogy around the world. American Psychologist, 54, 252-259.

Sattler, J. (1988). Assessment of children: Cognitive applications. San Diego:

Author.

Schumacker, R., & Marcoulides, G. (1998). Interaction and nonlinear effects in structural equation modeling. Mahwah, NJ: Lawrence Erlbaum Associates.

Standards for educational and psychological testing. (1999). Washington DC:

American Educational Research Association.

Test security: Protecting the integrity of tests. [Editorial]. (1999). American Psy

chologist, 54, 1078.

Zhang, H. (1988). Psychological measurement in China. International Journal of Psychology, 23, 101-117.

4 Statistical Methods for Identifying Flaws

in the Test Adaptation Process

Stephen G. Sireci

University of Massachusetts at Amherst

Liane Patsula Educational Testing Service

Ronald K. Hambleton

University of Massachusetts at Amherst

In conductingcross-cultural research, it is imperative that research

ers examine the assessment instruments theyare using for anyprob

lematic translations/adaptations.¹ In particular, when there is interest in comparing test results obtained from different cultures, researchers should investigate the assessment instruments they are using for construct, method, and item bias (see, e.g., van de Vijver &

In cross-lingual assessment, the term adaptation is considered preferable to translation because the former term does not implya literal word-to-word translation. The test adaptation process is typically more flexible, allowing for more complex word substitutions so that the in

tended meaning is retained across languages, even though the translation is not completely lit

eral (Geisinger, 1994). In this chapter, we use the two terms interchangeably, because many readers new to this area may be unfamiliar with what we mean by adaptation.

Leung, 1997; van de Vijver & Tanzer, 1997). For if such biases exist and are not identified, comparative inferences across cultures will not be valid. For this reason, the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Mea

surement in Education, 1999) and the Guidelines for Adapting Edu

cational and Psychological Tests (see chap. 1, this volume) require cross-cultural researchers to provide evidence of the comparability of different language versions of an assessment when scores from the different versions are intended to be comparable.

Although there are both judgmental and statistical strategies to identify and address bias, this chapter focuses on statistical tech

niques for addressing construct, method, and item biases that may be present in cross-lingual assessments. This chapter, which ex

pands on many of the ideas presented in chapter 2 by van de Vijver and Poortinga, is divided into three sections. The first section in

cludes descriptions of statistical techniques to assess construct equivalence. The second section presents strategies to assess and ad

dress method bias. In the third section, classical and modern meth

ods for identifying item bias are listed and discussed. Table 4.1 presents an overview of the methods and examples described in the chapter. This table follows the classification scheme provided in van de Vijver and Tanzer (1997), who partitioned common sources of bias in cross-cultural assessment into these three categories.

STATISTICAL TECHNIQUES

TO ASSESS CONSTRUCT EQUIVALENCE

Cross-cultural researchers can use statistical techniques both before and after field-testing to evaluate the construct equivalence of their assessment instruments. Given that there are no test or item score data before field-testing, researchers are limited in the amount of in

formation they can gather. Therefore, the majority of research on the construct equivalence of translated assessment instruments has been conducted on field-test or operational data.

Before Field-Testing

In the absence of item response data, the construct equivalence of dif

ferent language versions of an assessment can be examined by gather

ing data from subject matter experts representing the different languages and cultures of interest. Similar to content validity studies conducted on educational tests, items on an assessment can be rated

TABLE 4.1

Source of Bias in Test Adaptations and Some Seminal References

Source of Bias Description References

Construct Bias Construct not relevant in all Hambleton (1993, 1994) cultures (conceptual Hui & Triandis (1985) equivalence); construct is not Geisinger (1994)

operationally defined Reise, Widaman,& Pugh (1993) consistently across cultures; Sired, Bastari, &Allalouf (1998) measurement of the construct van de Vijver & Tanzer (1997) is not consistent across

cultures

Method Bias Biases in test administration van de Vijver & Tanzer (1997) conditions; unfamiliarity of

item formats in one or more cultures; differential response styles (e.g., social desirability);

incomparability of samples (selection bias); interviewer effects (e.g. halo effects)

Item Bias Faulty translation; differential Allalouf, Hambleton, & Sireci (1999) relevance of items across Angoff & Cook (1988)

cultures; nuisance factors Budgell, Raju, & Quartetti (1995) Ellis (1995)

Gierl, Rogers, & Klinger (1999) Hulin, Drasgow & Komocar (1982) Sireci & Berberoglu (2000) Swaminathan & Rogers (1990)

according to one or more criteria to shed light on the construct mea

sured. An innovative example of the use of subject matter experts to evaluate construct equivalence is the study conducted by Hui and Triandis (1985). In this study, small samples of judges from different cultures evaluated the "similarity in meaning of pairs of items to be used in an assessment" (p. 141). The authors used individual differ

ences multidimensional scaling to discover the characteristics the judges used in making their similarity ratings. If the characteristics used to rate item similarity are consistent across judges, preliminary evidence of construct equivalence is provided. If judges from differ

ent cultures use different characteristics, this information can be used to modify one or both versions of the assessment.

The Hui and Triandis (1985) study illustrates one means of gather

ing cross-cultural data on an assessment before it is administered. Al

though there is little research in this area, other designs using content experts from different language backgrounds, or bilingual content experts, are also possible. For example, bilingual experts could be asked to rate the similarity in difficulty of items from an achievement test. This procedure may identify items that would later be flagged for bias if studies of differential item functioning (DIF) were conducted after the test was administered.

After Field-Testing

After field-testing, when examinee response data are available, there exist at least four statistical approaches for assessing construct equivalence across assessment instruments: exploratory factor anal

ysis, confirmatory factor analysis, multidimensional scaling, and comparison of nomological networks. In this section, we briefly de

scribe each approach.

Exploratory Factor Analysis. Exploratory factor analysis is one of the oldest and most popular methods for evaluating whether different language versions of a test measure the same construct. In fact, van de Vijver and Poortinga (1991) and Poortinga (1991) de

scribed factor analysis as the most frequently used statistical tech

nique to assess whether a construct in one culture is found in the same form and frequency in another culture (e.g., Butcher & Garcia,

1978). The exploratory factor analysis approach involves factor ana

lyzing item or test score data separately for each cultural group. The resulting factor loading matrices are then inspected visually for con

sistency across groups. Although this approach is intuitively appeal

ing, comparing separate factor structures is difficult, and there are no commonly agreed upon rules for deciding when the structures can be considered equivalent. Therefore, statistical approaches, es

pecially those that can simultaneously accommodate multiple groups, are preferable. Confirmatory factor analysis and weighted multidimensional scaling are two such approaches.

Confirmatory Factor Analysis. In confirmatory factor analy

sis (CFA) the structure of an assessment is hypothesized a priori and examinee data are used to evaluate the viability of the hypothesized structure (see, e.g., Byrne, 1998, 2001, 2003). The hypothesized structure is incorporated into a structural equation model and is constrained to be equal across all groups. A typical construct equiva

lence hypothesis tested using CFA is whether the factor loading ma

trix is equivalent across all groups. The structure of the factor

loading matrix is usually an "independent clusters structure" (Mc

Donald, 1985), which specifies that: (a) each measured variable has a nonzero loading on only the factor it was designated to measure, (b) correlations among the factors (i.e., lower diagonal of the phi ma

trix) are freely estimated, and (c) the errors associated with the factor loadings (i.e., theta delta matrix) are uncorrelated (Marsh, 1994).

In the cross-lingual assessment arena, researchers have used CFA to evaluate whether the factor structure of an original version of an assessment is consistent across subsequent versions translated into another language (e.g., Brown & Marcoulides, 1996; Reise, Widaman, & Pugh, 1993; Robie & Ryan, 1996; Sireci, Bastari, &

Allalouf, 1998). CFA is an attractive option for evaluating construct equivalence across adapted instruments because it can handle multi

ple groups simultaneously, statistical tests of model fit are available, and descriptive indices of model fit are provided. When an assess

ment comprises items that are scored dichotomously, CFA may be problematic because the underlying models are linear and the rela

tionships among dichotomous items are often nonlinear (McDon

ald, 1982). However, this limitation may be overcome by grouping items together into parcels before the analysis. An example of using CFA to evaluate the construct equivalence of different language versions of a test is provided in a subsequent section.

Multidimensional Scaling. Multidimensional scaling (MDS) is another attractive approach for evaluating construct equivalence across different language versions of an assessment. Like exploratory factor analysis, MDS analysis does not require specifying test struc

ture a priori. However, like CFA, the data from multiple groups can be analyzed simultaneously. Using an individual differences MDS analysis, such as the INDSCAL model (Carroll & Chang, 1970), a common structure can be fit simultaneously to all groups, and then structural differences across groups can be assessed by looking at the

"group (subject) weights," which are used to modify the common structure to best fit the data for each group. MDS provides a means for discovering the dimensionality underlying examinees' response data, and for evaluating whether this dimensionality is consistent across all groups (or versions of the test) of interest. Another attrac

tive feature of MDS is that it does not require a linear model to derive the structure underlying the data.

Example of CFA and MDS Analysis of Construct Equivalence.

Sireci, Bastari, and Allalouf (1998) used CFA and MDS to evaluate the construct equivalence of items from the verbal reasoning section

of the Psychometric Entrance Test (PET), which is a test used by col

leges and universities in Israel for making admissions decisions (Beller, 1994; see also chap. 12, this volume). Figure 4.1 presents a two-dimensional representation of the items derived using MDS.

The items tended to be grouped together in the MDS space accord

ing to their content specifications (analogies, logic, reading compre

hension, sentence completion). Of more interest is Fig. 4.2, which presents the group weights on these same dimensions. Data from two groups of examinees who took the Hebrew versions of the test and two groups who took the Russian version (sample sizes were about 1,300 for each group) were used in the analysis. As can be seen in Fig. 4.2, the group weights were very similar, which suggests simi

larity of structure (construct equivalence) across groups.

Sireci, Bastari, and Allalouf (1998) also used CFA to evaluate the construct equivalence of these two different versions of this test. Fol

lowing the content specifications, they fit a four-factor model to the data for both groups. Four different CFA models were fit to the data.

The first model constrained the four underlying factors to be com-

FIG. 4.1. MDS configuration of PET verbal items. AL = analogy, LO = logic, RC = reading comprehension, SC = sentence completion.

FIG. 4.2. Group weights for PET data. Note: H = Hebrew, R = Russian.

mon across the Hebrew and Russian groups, the second model con

strained the factor loading matrix to be the same across groups, the third model constrained the errors associated with these factor load

ings to be the same, and the fourth specified the correlations among the factors to be equivalent. The results of their analysis are summa

rized in Table 4.2. In all four models, the goodness of fit indices were high (.96 or above) and the root mean square residuals were low (.076 or below). Although these findings were consistent with the MDS analyses, using data from another assessment, they illustrated this is not always the case. They recommended using both MDS and CFA to evaluate the construct equivalence of different language versions of an assessment.

Comparison of Nomological Networks. Construct equiva

lence is a very general term that states the same psychological con

struct is measured across all studied groups, and it is measured with equal fidelity in all groups. Exploratory factor analysis, CFA, and MDS can provide important evidence regarding the consistency of test structure across different language versions of an assessment. How

ever, equivalent structure does not necessarily imply equivalent con

structs. Structural equivalence is a necessary, but insufficient condition for construct equivalence. Therefore, many researchers

Dalam dokumen for Cross-Cultural (Halaman 97-200)