U. S. Navy
3.1 A Little History †
3.1.4 Characteristics of Aviation HF Research
3.1.3.4 Subjective Measures
Subjective measures (whatever one may think about their validity and reliability) have always been and still are integral parts of aviation HF measurement. As mentioned previously, during World War I, ace fi ghter pilots like Guynemer and von Richtoff en were employed to evaluate the handling qualities of prototype aircraft . Ever since the fi rst aviation school was established, expert pilots have been used not only to train, but also to evaluate the performance of their students. Even with the availability of sophisticated, computerized instrumentation in the test aircraft , the pilot is routinely asked to evaluate handling qualities. Automated performance measurement methods, although highly desirable, cannot entirely replace subjective techniques (Vreuls & Obermayer, 1985).
Muckler (1977) pointed out that all measurement is subjective at some point in test development;
the objective/subjective distinction is a false issue. Th erefore, the problem is to fi nd ways to enhance the adequacy of the subjective techniques. Th ere is need for more research to develop more adequate methods, to train and calibrate expert observers.
Th e subjective techniques described in the research literature include interviews, questionnaire sur- veys, ratings and rankings, categorization, and communications analyses. Subjective data, particularly ratings, are characteristically used to indicate pilot preference, performance evaluations, task diffi culty, estimates of distance traveled or velocity, and, in particular, workload, which is one of the “hot” topics in aviation HF research.
Owing to the variability in these subjective techniques, eff orts have been made to systematize them quantitatively in scales of various sorts (for a discussion of scales, see Meister, 1985 or Gawron, 2000).
Th e Likert 5-point scale (e.g., none, some, much, very much, all) is a very common scale that can be created in moments, even by someone who is not a psychometrician. However, the validity of such self- created scales may be susceptible. Development of valid and reliable scales requires prior research on the dimensions of the scale, and empirical testing and analysis of the test results. Most complex phenomena cannot be scaled solely on the basis of a single dimension, because most behavior of any complexity is multidimensional. Th e interest in measurement of workload, for example, has created a number of multidimensional scales: SWAT, which has been used extensively in simulated and actual fl ight (see American Institute of Aeronautics and Astronautics, 1992, pp. 86–87), has three scalar dimensions:
time load, mental eff ort load, and psychological stress. Th e scales, either individually or as a part of the questionnaire surveys, have probably been used more frequently as a subjective measurement device than any other technique, as it is diffi cult to quantize interviews, except as part of formal surveys, in which case they turn into rating/ranking scales.
3. Measures employed, such as tracking error, reaction time
4. Measurement venue, such as laboratory, simulator, operational fl ight 5. Type of subject, such as pilot, air-traffi c controllers, nonfl ying personnel 6. Methodology, such as experiment, questionnaire, observation, incident reports
7. Statistical analysis employed, such as analysis of signifi cance of diff erences, correlation, factor analysis, etc.
Owing to space constraints, the listing of all the taxonomic categories employed is not provided, because of their large number. Th e categories were developed on the basis of the individual papers themselves.
Th e numbers by category are: general topic (47); specifi c topic (71); measures (44); measurement venue (8); subject type (12); methodology (16); and statistical analysis (16). Th e categories were not mutually exclusive. Every category that could describe a particular paper was counted. For example, if a paper dealt with instrument scanning and in the process, described the visual factors involved in the scanning, both the categories were counted. Th us, categories overlapped, but the procedure employed resulted in a more detailed measurement picture, than would otherwise be the case. Only those categories that described 5%
or more of the total number of papers are listed in the following tables. As the number of these categories is small when compared with the total number of categories reported, it is apparent that although aviation HF measurement is extensive in its subject and its tools, it is not very intensive, except in relatively few areas. Th ese presumably are the areas that most excite the funding agencies and individual researchers.
An analysis was performed to ensure that the two data sources (HFES and OSU) were not so diff erent such that they could not be combined. Roughly, the same data patterns could be discerned (broad but not intensive), although there were some diff erences of note. For example, the OSU sample dealt much more with fl ight-related topics than HFES (OSU 72%, HFES 35%). Such diff erences could be expected, because the two sources were drawn from diff erent venues (e.g., OSU is international, HFES almost exclusively American; OSU preselects its topic areas, HFES does not). Th erefore, the diff erences were not considered suffi cient to make combination impossible.
Of the 47 categories under “general topic,” 13 met the 5% criterion. Th ese are listed in Table 3.3, which indicates that most of the researches were basic. Th is means that the researches dealt with general principles rather than specifi c applications. Applied researches (see Table 3.4) were only 11% of the total number of researches. Both basic and applied researches totaled to 91%. Th e fact that the fi gures do not add to 100% simply indicates that a small number of papers, although dealing with measurement, did not involve empirical research. Th e second point is that only half the papers presented dealt directly with fl ight-related topics; the others involved activities incident to or supportive of the fl ight, but not directly the fl ight. For example, 10% of the papers dealt with ATC, which is of course necessary for avia- tion, but which has its own problems.
TABLE 3.3 General Topic Categories
1. Military or commercial fl ight 50% 113 papers
2. Design 10% 23 papers
3. Workload/stress 8% 17 papers
4. Air-traffi c control 10% 23 papers
5. Training 14% 32 papers
6. Automation 8% 18 papers
7. Basic research 80% 189 papers
8. Instrument scanning 7% 16 papers
9. Visual factors 9% 20 papers
10. Evaluation 6% 13 papers
11. Accidents 6% 14 papers
12. Applied research 11% 25 papers
13. Pilot personality 5% 12 papers
Table 3.4 lists the 16 specifi c topics that were most descriptive of the papers reviewed. As one can see, only 16 categories out of the 71 met the 5% criterion. Although the table reveals a wide assortment of research interests, only three, namely, display design/diff erences, perceptual cues (related to display design), and workload, are described in a relatively large number of papers.
Table 3.5 describes the measures employed by researchers. Of the 44 measures found, only 10 satisfi ed the 5% criterion. Of course, many studies included more than one type of measure. Obviously, error and time are the most common measures. Th e frequency and percentage of measures was the most common statistical treatment of these measures. Th e relatively large number of ratings of, for example, attributes, performance, preferences, similarity, diffi culty, and so on, attest to the importance of subjective mea- sures, particularly when these are used in a workload measurement context (e.g., SWAT, TLX).
Table 3.6 describes about where the measurements took place. Of the nine categories, fi ve met the 5%
criterion. Th is is because, a laboratory does not simulate any of the characteristics of the fl ight; however, a full-scale simulator with at least two degrees of motion may achieve this. Furthermore, a part-task simulator or simulated display reproduces some part of the cockpit environment. In addition, some measures were taken in-fl ight. In the case where the measurement venue is unimportant, the situation was usually one in which questionnaire surveys were administered by mail or elsewhere.
Th ere is great reliance on fl ight simulators, both full-scale and part-task, but in many cases, there exists no fl ight relationship at all (e.g., the laboratory). Th e fact that only 26 of the 231 papers dealt with the actual
TABLE 3.4 Specifi c Topic Categories
1. Display design/diff erences 21% 50 papers
2. Transfer of training 5% 11 papers
3. Personnel error 6% 14 papers
4. Personnel demographics 5% 12 papers
5. Perceptual cues 16% 36 papers
6. Decision-making 6% 13 papers
7. Workload 14% 33 papers
8. Communications 6% 14 papers
9. Coding 5% 11 papers
10. Tracking 9% 21 papers
11. Crew coordination 5% 12 papers
12. Incidents 6% 14 papers
13. Head-up displays (HUD)/
helmet-mounted displays (HMD)
5% 12 papers
14. Mental model 8% 17 papers
15. Dual tasks 6% 13 papers
16. Cognition 6% 13 papers
TABLE 3.5 Measures Employed
1. Reaction time 13% 31 papers
2. Response duration 16% 48 papers
3. Response error 33% 76 papers
4. Tracking error 12% 29 papers
5. Frequency, percentage 33% 80 papers
6. Ratings 30% 66 papers
7. Interview data 5% 11 papers
8. Workload measure 8% 18 papers
9. Flight performance variables 10% 22 papers
10. Categorization 8% 17 papers
fl ight environment in the air is somewhat surprising, because measurements taken outside that environment are inevitably artifi cial to a greater or lesser extent. Of the 12 categories describing the type of subject used in these studies, only three were signifi cant: 60% of the subjects were pilots (140 papers), 33% (75 papers) of the subjects were nonfl ying personnel (college students, government workers, the general public), and 9%
(20 papers) were air-traffi c controllers. Th e fact that the largest proportion of the subjects is pilots is not at all surprising, but the relatively large number of nonfl ying personnel is somewhat daunting.
Nine of the 16 categories under the heading of methodology (Table 3.7) met the 5% criterion. As one would expect, more than half the number of papers published were experimental in nature. What was somewhat less expected was the large number of studies that were not experimental, although there was some overlap, because some of the experimental studies did make use of nonexperimental methodology in addition to the experiment. Th ere was heavy reliance on subjective techniques, observation, question- naires, interviews, and self-report scales. Pilot opinion was, as it has always been, extremely important in aviation.
Of the 16 statistical analysis categories, 4 were most frequently employed (Table 3.8). Again, as one would expect, the tests of the signifi cance of diff erences between the conditions or groups were observed in most of the analyses. Th e percentage might have even been greater if one included such tests as mul- tiple regression, discriminant analysis, or factor analysis in this category. Although the categories in this content area tend to overlap, the relatively large number of studies in which the analysis stopped at frequency and percentage should be noted.
What does this review tell us about the nature of aviation HF research? Th e large number of topics, both general and specifi c, ranging from information processing to geographical orientation, electro- encephalography, and pilot attitudes (note: only a few topics taken at random), indicates that many
TABLE 3.6 Measurement Venue
1. Laboratory (not simulator) 16% 36 papers
2. Full-scale simulator 23% 52 papers
3. Part-task simulator or simulated displays 27% 63 papers
4. Operational fl ight 11% 26 papers
5. Irrelevant 16% 46 papers
TABLE 3.7 Methodology
1. Experiment 54% 126 papers
2. Observation 12% 29 papers
3. Questionnaire survey 16% 48 papers
4. Rating/ranking scale 30% 65 papers
5. Performance measurement (general) 21% 50 papers
6. Interviews 10% 22 papers
7. Physical/physiological data recording 8% 17 papers 8. Analysis of incident reports 8% 17 papers
9. Verbal protocol analysis 5% 11 papers
TABLE 3.8 Statistical Analysis
1. Tests of signifi cance of diff erences 67% 155 papers
2. Correlation 70% 22 papers
3. Frequency, percentage 24% 56 papers
4. None 5% 12 papers
areas have been examined, but very few have been studied intensively. Th e major concerns are the basic research, as it relates to fl ight and displays. In spite of the fact that presumably automation (the “glass cockpit”), situational awareness, and workload are all “hot” topics in the aviation research community, they received only a modest degree of attention. If one adds up all the topics that deal with sophisticated mental processes (e.g., decision-making, mental models, and cognition) along with crew coordination, it can be observed that a fair bit of attention is being paid to the higher-order behavioral functions. Th is represents some change from the earlier research areas.
Most of the behavioral research in aviation is conducted on the ground, for obvious reasons: non- availability of aircraft and cost of fl ights. Another reason is perhaps that much of the research deals with cockpit or display variables, which may not require actual fl ight. Reliance on opinion expressed in questionnaires, incident/accident reports, and full-scale simulators diminishes the need to measure in the actual fl ight. It may also refl ect the fact that behavioral research, in general (not only in aviation), rarely takes place in the operational environment, which is not conducive to sophisticated experimental designs and instrumentation. However, this leaves us with the question on whether results achieved on the ground (even with a high degree of simulation) are actually valid with respect to fl ight conditions.
Case studies comparing the ground and in-fl ight evaluations have been carried out by Gawron and Reynolds (1995). Th e issue of generalizability to fl ight is compounded by the fact that one-third of all the subjects employed in these studies were not fl ying personnel.
Th e HF research in aviation is not completely devoted to an experimental format; only half the studies reported were of this type. It is remarkable that with a system whose technology is so advanced, there is so much reliance on nonexperimental techniques and subjective data.