Figure 8.1 places validity in a framework of types of validity and threats to this framework in a positivistic universe.
Figure 8.1 is a representation of how validity exists in a positivistic universe consisting of internal and external validity, where validity is segmented into questions of accuracy based on content, face value, criterion, and construct. The universe of validity is threatened from extraneous factors that affect internal validity, the left side of the illustration, and external validity, on the right side. Campbell and Stanley (1963) presented the eight factors that threaten internal and four factors that threaten external
validity of experiments based on Campbell’s earlier work‘‘Factors affecting the validity of experi- ments’’(Psychology Bulletin, 1957). All threats to internal and external validity remain applicable 50 years later, and will be presented with examples appropriate to public administration.
However, a distinction should be made at the outset of any discussion of validity; validity is not reliability. Notwithstanding various tests for reliability of empirical methods, the retest method, alternative form method, the split halves method, and the internal consistency method (Carmines and Zeller, 1979),2 a measurement can be reliable but not valid. A measurement tool can give reliable, consistent measurements but not measure exactly what one wants it to measure and therefore fail the test for validity.
For example, the state highway patrol was monitoring car speed on the interstate and, unknowingly, the radar gun they were using was defective and only measured car speeds up to 68 miles=hour, i.e., if a car was going through the speed zone at 55 miles=hour the radar gun would read 55 miles=hour.
However, if a car passes through the zone at 75 miles=hour, the radar gun would only read 68 miles=hour. The state police measured speed for 24 hours and consistently measured speeds that were reliable and consistent, but they were not valid.
Our measurement tool may also give us consistent, reliable measurements, but validity could be compromised. For a measurement tool to be valid the tool must measure what it is supposed to measure.
The police radar speed gun must be able to measure all speeds, not just speeds up to 68 miles=hour.
To conclude the discussion of reliability, it is important to note that reliability is secondary to validity. If the measurement tool is not valid, its reliability cannot be considered.
Campbell and Stanley (1963) describe two types of validity: internal and external validity.
At the beginning of Campbell and Stanley’s discussion of validity it is clear that‘‘internal validity is the sine qua non’’ (1963, p. 5)—the essential validity—the essence of the experiment. Internal validity is examined when the researcher examines the question: Did the independent variable cause the expected corresponding change in the dependent variable? An example of internal validity using fire stations would be the answer to the question: Did an increase infire stations cause a decrease in multiple alarm fires in the new district? Or, did an increase in police on beat patrol cause a concomitant decrease in crime?
In contrast to internal validity, which is specific to the experiment, external validity asks the question of generalizability, or to what extent can the findings of an experiment be applied to
Positivistic universe
Validity
Internal External
Measurement validity
Content Construct Criterion Face Concurrent Predictive History
Maturation Testing Instrumentation
Statistical regression
Experimental mortality
Selection−maturation interaction Biases
Effects of testing
Effects of selection
Effects of experimental arrangements Multiple treatment interference FIGURE 8.1 The validity framework and concomitant threats.
different groups, settings, subjects, and under what conditions can this experiment be generalized.
Campbell and Stanley (1963) explain external validity by comparing it to inductive reference, in that it is never completely answerable (p. 5).
In the example of reading scores, an experimentalfinding may be as follows:
Students in New York City public high schools with an enrollment in excess of 3000 have lower reading scores than public high schools in the same city with less than 3000 enrollment. However, this experiment may not be able to be duplicated in Newark, Chicago, or Los Angeles. In short, although high enrollment in New York City public schools may cause lower reading scores it might not have the same effect in another area of the country.
Furthermore, external validity does not rule out the possibility that although more police on beat patrol may reduce crime under one set of circumstances, less crime may reduce the amount of police on beat patrol in another set of circumstances—a case where the independent variable in experiment A becomes the dependent variable in experiment B.
The important question that persists when one examines experimental validity is as follows:
Does the measurement tool measure what it is supposed to measure? This question is predicated on matters of precision and accuracy. The accuracy of the measurement tool involves several types of validity questions.
Face validity: Face validity is the simplest type of validity. It answers the question: Does the measurement tool appear to measure what we want it to measure? For example: If we wanted to measure customer service department effectiveness at the internal revenue service, we would not measure the eating habits, secretarial skills, or the amount of graduates from accredited graduate schools of accounting in the customer service department because‘‘on the face of it’’these items tell us little, if anything at all, about customer reaction to customer service.
Face validity, being a simple measure of validity, is also the most innocuous measure of validity. Face validity alone is not sufficient to meet accuracy tests of validity.
Content validity: Content validity asks the question: Is the measurement that is being taken a subset of a larger group of measurements that represent the focus of the study? Although similar to face validity, it is a more sophisticated test for validity. An example of content validity can be shown in our study of internal revenue customer service’s department.
In this study we want to determine if the customer service representative was accommodating to the taxpayer. If a survey instrument was to be used to determine customer satisfaction, the survey could ask one question: Were you satisfied with your contact with the internal revenue customer service department? Though this question may be adequate in some cases, most likely the question might attract many negative responses because the customers’ needs might not be totally satisfied, or conversely, affirmative answers might not give you the information you will need to make changes in customer service. A better approach would be to measure responses to questions that come from a subset of customer satisfaction. For example, the IRS might inquire if the customer service representative
. Answered the phone in a certain length of time following your connection to the department.
. Did they identify themselves to you?
. Did they inquire about your problem?
. Did they give you a satisfactory answer?
. If they did not know the answer, did they say that they would get back to you?
. Did they return with an answer in a timely manner?
These are typical questions that would meet the question of content validity for customer service.
In the same example, a question that would not meet the criteria of content validity would be as follows: Did you submit your income tax return in a timely fashion? Not only would this question not meet the content validity criteria of customer service, but if used in a survey of the IRS’s customer service department, it may illicit negative responses to the relevant questions.
Criterion validity: There are two types of criterion validity—concurrent and predictive. Con- current validity is used to question the validity of a subset of questions that are already verified by content validity. This subset may be created to save time during the actual questioning during a survey. Consider, for example, a survey given to motorists at a bridge toll booth. The motorist can bring the survey home and return it on the next pass over the bridge. However, the decision makers would like a faster and more immediate response to the survey instrument. They decide that they will have the bridge police set up a safe area past the toll booths and before the entrance to the Interstate. Police will assist surveyors in detaining random cars so that motorists can be asked the survey questions. Any anxiety caused by the police detaining the motorist is immediately relieved when the motoristfinds that he is only being detained to answer a few questions. To ultimately get their cooperation, the motorists are told that their names will be entered in a raffle for a free dinner- for-two at a local restaurant. Before this plan can be initiated, the survey planners realize that the motorists can not be detained to answer the current questionnaire. This would delay traffic, and slow down the process, limiting the number of motorists who can be questioned, and possibly incur the wrath of the detained motorist. The survey planners decide to create a significantly shorter survey instrument from the original questionnaire that will meet face and content validity questions and give them the information that they need to meet the criteria of the survey.
Predictive validity: This validity asks the question: Does the test that is being administered have some predictive relationship on some future event that can be related back to the test administered? In thefire station experiment of determining alarm response time to newly developed areas of a township, we can determine thatfire stations within a certain radius of housing developments decrease response time to alarms, whereasfire stations outside this radius increase response time. In this instance, the fire station experiment has predictive validity if we use the results of this experiment as a predictor of futurefire station placement in the community. The future placement of fire stations relates the result of the experiment back to the test, and the test can be related to the placement of thefire stations.
Construct validity: This validity relates back to general theories being tested; aptitude tests should relate to general theories of aptitude, intelligence tests should relate to general theories of intelligence, etc. For example, in the bridge repair experiment, the county engineers realize that certain heavy equipment must be utilized by mechanics hired by the county. They want to give aptitude tests for potential hires to reduce their liability during construction. The assumption is made that the engineers, or those creating the aptitude test for using heavy equipment, understand what constitutes aptitude for using heavy equipment during bridge construction. The test to measure aptitude—the construct validity—must relate back to general theories of aptitude, to measure the individual’s capacity to operate heavy equipment and not general theories of heavy equipment.