LANGUAGE COMPREHENSION
2. SOME FUNDAMENTALS OF PERCEPTION 1. The Inverse Problem
For many years much of the study of speech perception was conducted in isolation from the study of perception more generally to mostly ill effect. In part, this state of affairs was encouraged by the focus of language researchers (linguists and psycholinguists) seeking to know more about elemental aspects of language use. Consistent with appreciation for apparently unique characteristics of human language, early speech researchers were encouraged to believe that perception of speech may be as unique as language itself. For this and other historical reasons, research in speech perception was often naïve to devel-opments in related areas of perception.
An enduring distraction for investigators studying speech perception has concerned the extent to which articulatory gestures (e.g., Fowler, 1986; Liberman & Mattingly, 1985), acoustic patterns, patterns of sensory stimulation (e.g., Diehl & Kluender, 1989), or some combination (e.g., Nearey, 1997; Stevens & Blumstein, 1981) serve as proper objects of speech perception. Controversies concerning appropriate objects of perception generated a fair bit more heat than light. However, debates concerning objects of perception cannot be re-solved because the question itself is ill-posed, if not outright misleading. There are no objects of perception, either for speech or for perception in general. There is an objective for percep-tion, which is to maintain adequate agreement between an organism and its world in order to facilitate adaptive behavior. Success with this objective does not require objects of perception.
Within this functional framework, perceptual success does not require recovery or rep-resentations of the world per se. Perceivers’ subjective impressions may be of objects and events in the world, and the study of perceptual processes may lead to inspection of real-world objects and events, patterns of light or sound pressure waves, transduction properties, or neural responses. By and large, however, viewing perception with a focus toward either distal or proximal properties falls short of capturing the essential functional characteristic of perception – the relationship between an organism’s environment and its actions.
This depiction of success in perception as essentially functional, discarding any sense of perceiving true reality, might seem novel to some readers. However, these ideas are classic, having become so broadly accepted that most mention seems to have lapsed in instruction to modern students of perception. Beginning at least with Helmholtz (e.g., 1866/1969), it has been understood that perceiving the true state of the world is impossible. Helmholtz himself was led to this understanding by British Empiricist philosophers (e.g., Hume, 1748/1963; Berkeley, 1837/1910). Nevertheless, contempo-rary discourse in the field of perception often betrays this fact.2
2One may question whether denying the possibility of veridical recovery is arguing against a straw man. In some ways, this is trivially correct because there are limitations upon biological sensors in both range and pre-cision of transduction (e.g., hearing across only 20–20,000 Hz with limitations of precision in both frequency and amplitude.) Consequently, humans cannot hear environmental sounds that are heard by elephants (lower frequencies) and bats (higher frequencies). A more interesting criticism is that it is always impossible to know
Much contemporary work in perception is concerned, in one way or another, with addressing the inverse problem (Figure 1). The inverse problem emerges from the sim-ple fact that information available to sensory transducers (eyes, ears, etc.) is inadequate to authentically reconstruct a unique distal state of affairs. In vision, for any 2-dimen-sional projection, there are an infinite number of possible 3-dimen2-dimen-sional objects that could give rise to exactly the same 2-D image. In audition, for any sound-pressure wave, Figure 1. An infinite number of external 3-dimensional objects give rise to the same 2-dimensional retinal image (top, left). An infinite number of sound producing sources (characterized on right as resonator shapes) give rise to the same waveform available to the ear (bottom, left).
(continued)
Truth (with a capital ‘T’). Instead, all one can hope is to evaluate function, whether or not something works.
Within this tradition of pragmaticism (e.g., Peirce, 1878; James, 1897), one can assign ‘truth’ (with a lower-case ‘t’) on functional grounds. The parallel here is that perception cannot provide Veridical (with a capital ‘V’) recovery of the environment, but it can supply veridical recovery as measured by whether perception gets the job done for the organism. The present approach is intended to be consistent with the pragmatic rendition of
‘truth’ and of ‘veridical’ in as much as the only evaluative measure is whether perception is successful for the organism. Here, these terms are avoided in the interest of being faithful to the vernacular within which ‘truth’
and ‘veridical’ are taken to imply some real portrayal of the world.
there are an infinite number of sound producing events that could give rise to that wave-form. These are facts of physical optics and acoustics, not theory. Information available to sensory transducers is inadequate to reconstruct an authentic optical or acoustic distal environment.
For speech perception, the inverse problem presents one of the two major reasons why appeals to articulatory gestures cannot in principle or in practice make one’s theory of speech perception simpler or more successful. There is a lawful mapping from charac-teristics of physical sound sources to the waveforms they produce. The inverse mapping, from waveforms to sound sources, is indeterminate. There are very limited cases for which it is theoretically possible to solve the inverse problem in acoustics. For example, Jenison (1997) has demonstrated that characteristics of movement of a sound source could be derived from conjoint detection of interaural-time-delay, Doppler shifts, and sound intensity. However, it is unlikely that this theoretical possibility has biological plausibility because biological transducers lack the precision required for the three vari-ables, and because extreme environmental conditions required approach biological lim-its of detection (e.g., extremely fast moving objects to yield sufficient Doppler shifts) fall outside the domain of normal perceptual experience. More typical is the case of attempt-ing to solve the inverse from waveform to simpler 2-D surfaces (e.g., the shape of a drum.) Mathematicians have formally proved that even this relatively simple translation from waveform to plane geometry is impossible (Gordon, Webb, & Wolpert, 1992).
Because multiple sound sources yield the same waveform, waveforms can never be more complex than characteristics of physical sources. Researchers within the field of speech perception have long been familiar with appeals to perception via articulatory gestures as a simplifying construct, and there have been a series of efforts to extract gestures in order to facilitate machine speech recognition, albeit with very limited suc-cess. What physics demands, however, is that depiction of speech in terms of articulatory gestures can give only the illusion of simplicity. Because scientists are much better at measuring details of sounds than they are at measuring details of articulator activity, articulatory gestures appear simpler only because they are defined more abstractly and are measured with less precision. Because multiple resonator configurations can give rise to the same waveform, the acoustic waveform available to listeners always underesti-mates variability in articulation.
For all of the discussion that follows regarding specific issues concerning speech percep-tion, speech typically will be described as sounds. This is not because sounds are legitimate objects of perception. This is because, along the chain of events from creating patterns of sound-pressure to encoding these patterns in some collection of neural firings to eliciting behavior, waveforms are public, easily measurable, and simpler than alternatives.
2.2. Why Perception Seems Veridical
If perceiving the true state of the world is impossible, one might ask why phenomenal experience is not fuzzy and uncertain. To effectively guide behavior, and not leave the
organism pondering multiple possibilities, all that is required is that the perceptual system come to the same adaptive output every time it receives functionally the same input. It is this deterministic nature of perception that prevents being paralyzed among myriad alternatives. Phenomenal experience of certain reality does not depend on authentic rendering of the world. Instead, phenomenal experience of a clear and certain world is the consequence of perceptual systems reliably arriving at deterministically unique outputs. It is this reliability that encourages certainty (Hume, 1748/1963), but reliability is not validity.
On rare occasions, perceptual systems do not converge on a unique output and are left oscillating between equally fitting outputs when sensory inputs are not singly determinate (usually in response to impoverished stimuli.) Many readers are familiar with bistability when viewing Necker cubes. One such auditory experience is encountered when listening to a repeating synthesized syllable intermediate between [da] and [ta] or any other pair of similar speech sounds. When two perceptual outputs fit the input equally well, phenome-nal experience oscillates between two percepts (Tuller, Case, Ding, & Kelso, 1994).
2.3. Information for Perception
If there are no objects of perception, how should one think about information for per-ception? Information for perception does not exist in the objects and events in the world, nor does it exist in the head of the perceiver. Instead, information exists in the relationship between an organism and its world. It may be useful to consider the contrast between information about and information for. When one discusses objects of percep-tion, it is information about that is typically inferred. Implicit in such efforts is the notion that one needs to solve the inverse problem. By contrast, if the objective of a successful perceptual system is to maintain adequate agreement between an organism and its world in order to facilitate adaptive behavior, then information for successful perception is nothing more or less than information that resides in this relationship (or agreement).
This way of viewing information as a relationship is consistent with one of the funda-mental characteristics of Shannon information theory (Shannon, 1948; Weiner, 1948).
Some readers may be familiar with Fletcher’s pioneering applications of information the-ory to speech (Fletcher, 1953/1995). However, the application here will be more akin to the approach of Attneave (1954, 1959) and Barlow (1961), an approach that remains highly productive (e.g., Barlow, 1997, 2001; Simoncelli & Olshausen, 2001; Schwartz &
Simoncelli, 2001). One important point of Shannon’s information theory is that informa-tion exists only in the relainforma-tionship between transmitters and receivers; informainforma-tion does not exist in either per se, and it does not convey any essential characteristics about either transmitters or receivers. Within this information-theoretic sense, perceptual information exists in the agreement between organisms and their environments. This agreement is the objective of perception (Figure 2).
Within a sea of alternative perceptual endpoints, agreement between the organism and environment is arriving at the alternative that gives rise to adaptive behavior.
Information is transmitted when uncertainty is reduced and agreement is achieved
between organism and environment. The greater the number of alternatives (uncertainty, unpredictability, variability, or entropy) there are, the greater the amount of information that potentially can be transmitted (Figure 2a). There is no information when there is no variability. When there is no variability, there is total predictability and hence, no
Figure 2. (a) The greater the number of alternatives (uncertainty, unpredictability, variability, or entropy) there are, the greater the amount of information that potentially can be transmitted. There is no new information in what stays the same or is predictable. (b) Relative power of energy flux in nat-ural environments approximates 1/f. (c) Information transmission optimized relative to energy flux in the environment. A sensorineural system should optimize dynamic range about this maximum.
information transmitted. There is much that stays the same in the world from time to time and place to place, but there is no information in stasis. Uncertainty is reduced con-sequent to the perceiver’s current experiences (context) as well as past experiences with the environment (learning).
Shannon and his Bell Telephone Laboratory engineer colleagues were concerned with evaluating equipment, not listeners. Answers to questions about what equipment can do are different from answers to questions about what biological perceivers naturally do (Licklider & Miller, 1951). Although the amount of theoretical potential information transmitted is maximized at maximum entropy (total unpredictability or randomness), it is not advantageous for biological systems to shift dynamic range as far as possible toward this maximum. In natural environments, this would result in diminishing returns if the system adjusts to register the last bits of near-random energy flux. Instead, biolog-ical systems should optimize the efficiency with which they capture information relative to the distribution of energy flux in real environments. The best estimate of statistics of natural environments is 1/f (pink) noise (Figure 2b). This simple power law with a nega-tive exponent (f ⫺1) is scale-invariant, and it is a ubiquitous characteristic across many systems from radioactive decay to fluid dynamics, biological systems, and astronomy. As one would expect, spectral density of fluctuations in acoustic power of music and speech vary as 1/f (Voss & Clarke, 1975, 1978). Efficient information transmission for sen-sorineural systems with limited dynamic range may be depicted best as the product of the positive exponential growth in information and the negative exponential of 1/f. This yields the quadratic function shown in Figure 2c describing optimal transmission of in-formation relative to energy flux in the environment.
2.4. Sensory Systems Respond to Change (and little else)
Given these facts about information, it is true and fortunate that sensorineural systems operate as they do. Sensorineural systems respond only to change relative to what is predictable or does not change. Perceptual systems do not record absolute levels whether loudness, pitch, brightness, or color. Relative change is the coin of the realm for percep-tion, a fact known at least since Ernst Weber in the mid-18th century, and has been demonstrated perceptually in every sensory domain. Humans have a remarkable ability to make fine discriminations, or relative judgments, about frequency and intensity. The number of discriminations than can be made numbers in the hundreds or thousands before full dynamic range is exhausted. Yet, most humans are capable of reliably catego-rizing, or making absolute judgments about only a relatively small number of stimuli regardless of physical dimension (e.g., Miller, 1956; Gardner & Hake 1951). This sen-sory encoding of change, and not absolute characteristics, is another major reason why veridical recovery is biologically impossible.
Sacrifice of absolute encoding has enormous benefits along the way to maximizing information transmission. Although biological sensors have impressive dynamic range given their evolution via borrowed parts (e.g., gill arches to middle ear bones), this dynamic range is always a fraction of the physical range of absolute levels available from
the environment and essential to organisms’ survival. This is true whether one is consid-ering optical luminance or acoustic pressure. The beauty of sensory systems is that, by responding to relative change, a limited dynamic range shifts upward and downward to optimize the amount of change that can be detected in the environment at a given moment.
The simplest way that sensory systems adjust dynamic range to optimize sensitivity is via processes of adaptation. Following nothing, even a subtle sensory stimulus can trig-ger a strong sensation. However, when a level of sensory input is sustained over time, constant stimulation loses impact. This sort of sensory attenuation due to adaptation is ubiquitous, and has been documented in vision (Riggs, Ratliff, Cornsweet, & Cornsweet, 1953), audition (Hood, 1950), taste (Urbantschitsch, 1876, cf. Abrahams, Krakauer &
Dallenbach, 1937), touch (Hoagland, 1933), and smell (Zwaardemaker, 1895, cf. Engen, 1982). There are increasingly sophisticated mechanisms supporting sensitivity to change with ascending levels of processing, and several will be discussed in this chapter. Most important for now is the fundamental principle that perception of any object or event is always relative – critically dependent on its context.