Measuring speech - Introduction to Language and Linguistics

Until quite recently, the phonetician’s ears were the only instruments available for the analysis of sound, and they are probably still the most finely tuned and reliable. Beginning in the 1900s, however, phoneticians began to discover ways to make sound visible. Early devices, such as oscil- loscopes and sound spectrographs, used a microphone to transfer patterns of vibration in the air into patterns of variation in electrical current.

These variations could then be displayed – on paper or a screen – or passed through banks of capacitors and resistors for further measurement and study.

In the first part of the twenty-first century, speech analysis is done by computer. Microphones still convert the vibration of the membrane into variations in electrical current. But then computers convert a continu- ously varying sound wave into a series of numbers; this is called an analog- to-digital (A-to-D) conversion. (Working in the opposite direction, getting a computer, digital video disk, or digital audio player to makesound, is D-to-A conversion.)

Once represented and stored in a digital format, sound files can be mathematically analyzed to separate out the different frequencies, and the results of the analysis displayed onscreen in various formats. Three common and useful displays are shown in Figure 1.8. All three show the utterance “A phoneme?” (These figures were created on a laptop computer, using software that was downloaded free from the internet.)

Figure 1.8A shows a waveform; the varying amplitude (corresponding to loudness) of the vibrations (on the y-axis) is plotted over time (on the x- axis). The duration of the utterance is one second. Above the waveform are the segments (roughly) corresponding to each part of the waveform. Note that the vowels [o] and [i] have the greatest amplitude, the obstruent [f] the

FIGURE1.8

Waveforms for the utterance A phoneme?

least, and the nasals and unstressed [´] have an intermediate amplitude.

The figure also shows a close-up view (magnified 12 times) of portions of the waveform. At this magnification, you can see the random noisiness of the fricative, and the complex, repeating pattern of vibration that is char- acteristic of each vowel. The patterns for [o] and [i] are different, and thus the vowel qualities are different. The waveform is a useful representation for seeing the overall structure of an utterance, identifying the different types of sounds (stop, vowel, nasal consonant), and measuring the dura- tions of different aspects of the speech signal.

Because of all the communicative information carried by pitch changes, linguists often want to measure fundamental frequency over the course of an utterance. Figure 1.8B shows the result of such a computation, a pitch track. In this representation, the y-axis shows frequency, and time is on the x-axis. Notice that no frequency is calculated during the [f].

Since there is no vocal fold vibration during this segment, there is no frequency to measure. This pitch track shows that this speaker’s voice has a baseline fundamental frequency of about 200 Hz, but that this frequency rises toward the end of the utterance, in a contour typical of a question.

Neither of these figures, however, has told us much about the quality of the vowel sounds. We can see from the waveforms that the vowel qualities in [o] and [i] are different, because the patterns look different, but the complex squiggles don’t tell us much about the component overtones and thus about the vocal tract shape that made them. The computer can further analyze the sound wave to tease apart its component frequencies. The result of this analysis is a spectrogram, shown in Figure 1.8C.

As with a pitch track, the x-axis represents time, and the y-axis shows frequency. But instead of a single line graph, we see a complicated pattern of the many frequencies present in each sound. We want our analysis to tell us which frequencies are the loudest, because once we know which frequencies have been amplified, we can use known formulas to figure out the (approximate) shape of the vocal tract that made them. This cru- cial third dimension, amplitude, is represented by the darkness of the lines. A dark bar at a certain frequency means that that frequency is strongly represented in the sound. Each vowel has a pattern of two or three most prominent frequencies, which are called formants, above the fundamental frequency of the speaker’s vocal folds. For example, during the [o] vowel, we see a formant at about 1200 Hz. The [i] vowel has a different formant structure, with the second formant at about 3000 Hz.

Because the basic relationships between vocal tract shape and formant structure are known, we can infer the position of the tongue, which we can’t see, from the spectrogram, which technology has made visible for us. Across speakers, the general relationships between speech sounds remain the same: for example, the second formant is always higher in [i]

than in [o]. But because every person’s vocal tract size and shape is unique, every person’s formant structure is unique too. We recognize familiar voices, regardless of what they’re saying, and in the hands of an expert, a spectrographic voice print is as unique as a fingerprint.

What else could there be to say about the sounds of speech? Quite a lot.

Language isn’t just in the mouth or ears, but also in the brain. When we turn from analyzing the physical aspects of speech sounds to studying their cognitive organization, we move from phonetics to phonology.

Phonology can never be completely divorced from phonetics, since sound patterns can never be completely separated from how they are produced and heard, and production and perception are always influenced by the overarching linguistic organization.

All human beings have basically the same structures in their vocal tracts and in their ears. So why are languages so different? To some extent, it is because they use different sounds from the repertoire of possible human vocal tract noises. Arabic uses pharyngeal and uvular fricatives, while English does not. French selects front round vowels (such as [y] and [ø]), English selects lax high vowels ([I] and [U]), and Spanish sticks to basic [i, e, a, o, u]. Thus, to some extent, learning to speak a new language is about learning to make new sounds.

But there’s more to it than that. Languages differ not only in the sounds they use, but in how they organize those sounds into patterns. Consider, for example, the voiced obstruents of English and Spanish.

Dalam dokumen Introduction to Language and Linguistics (Halaman 52-55)