Speech Perception was discussed briefly in the post on Language Comprehension (Psychology of Language page). Here, I will focus specifically on several theoretical models trying to explain how we distinguish separate phonemes and words in a continuous flow of human speech.
I will finish off by discussing a TRACE model: a computational system designed to replicate the way humans perceive speech. I will discuss it in more details in the next post on Connectionism.
I will finish off by discussing a TRACE model: a computational system designed to replicate the way humans perceive speech. I will discuss it in more details in the next post on Connectionism.
Acoustics of speech
Lack of Invariance
Firstly, we need to understand what speech really is. In terms of acoustics (physics of sound), it is moving air particles, and we measure frequency of these movements in Hz (hertz = cycles/sec). Human speech is a complex sound, which means numerous frequencies are produced at the same time. The slowest one is called fundamental frequency, and it determines the pitch of one's voice (80-200 Hz for men, up to 400 Hz for women). Other frequencies are harmonics: they determine timbre. Thus, what we perceive as a speech is numerous waves of different frequencies.
Traditionally, psycholinguists organise speech into different levels: phonemes, morphemes, words, phrases and sentences. The question is, however, how our brain can possibly distinguish these levels in a raw frequencies signal. Written language can be described as 'beads on a string', with each letter and word following each other, being perceived one by one. Distinguishing language levels in speech is much trickier for many reasons, and one of them is Lack of Invariance.
Lack of invariance refers to the idea that there is no reliable connection between the language phoneme and its acoustic manifestation in speech. The same word, or even single phoneme, can sound completely differently depending on many factors:
1) Individual differences. Acoustic structure of speech depends a lot on a speaker's accent, physical and psychological characteristics.
2) Speech conditions.
3) Coarticulation. This is the idea that more than one sound is articulated at once, so each of them is partly shaped by the sounds surrounding it. The articulates (jaw, tongue, mouth) move from sound to sound, allowing us to speak faster, thus acoustic structure of each phoneme depends a lot on its 'neighbours'. Consider the following spectrograms of a sound /d/ in three different positions:
Firstly, we need to understand what speech really is. In terms of acoustics (physics of sound), it is moving air particles, and we measure frequency of these movements in Hz (hertz = cycles/sec). Human speech is a complex sound, which means numerous frequencies are produced at the same time. The slowest one is called fundamental frequency, and it determines the pitch of one's voice (80-200 Hz for men, up to 400 Hz for women). Other frequencies are harmonics: they determine timbre. Thus, what we perceive as a speech is numerous waves of different frequencies.
Traditionally, psycholinguists organise speech into different levels: phonemes, morphemes, words, phrases and sentences. The question is, however, how our brain can possibly distinguish these levels in a raw frequencies signal. Written language can be described as 'beads on a string', with each letter and word following each other, being perceived one by one. Distinguishing language levels in speech is much trickier for many reasons, and one of them is Lack of Invariance.
Lack of invariance refers to the idea that there is no reliable connection between the language phoneme and its acoustic manifestation in speech. The same word, or even single phoneme, can sound completely differently depending on many factors:
1) Individual differences. Acoustic structure of speech depends a lot on a speaker's accent, physical and psychological characteristics.
2) Speech conditions.
3) Coarticulation. This is the idea that more than one sound is articulated at once, so each of them is partly shaped by the sounds surrounding it. The articulates (jaw, tongue, mouth) move from sound to sound, allowing us to speak faster, thus acoustic structure of each phoneme depends a lot on its 'neighbours'. Consider the following spectrograms of a sound /d/ in three different positions:
Place of articulation on the /d/ is different every time, thus producing several 'versions' of the same sound - however we as listeners will still hear /d/ every time, despite big differences in their acoustic structure.
Categorical Perception
We perceive speech in different categories rather than a single wave of energy, differentiating much better between categories than within them. Liberman et al. (1957) conducted a study in which participants listened to synthetically created consonants which gradually changed from /b/ to /d/; they had to label each sound as they heard it. What they found was that even large changes (for example, from values 1 to 4) did not result in change in labelling if they did not cross the phoneme boundary, but small change from /b/ to /d/ resulted in different labelling. This finding confirmed the Categorical Perception theory: human brain is not that concerned with differences between the sounds per se, but is very good at spotting the boundaries between the categories.
One of the possible explanations is demonstrated by the famous McGurk Effect, which I am sure many of you are aware of. The key point is that the brain does not rely solely on the acoustics of sound when making the distinction between the stimuli. We also use our visual information and, most importantly - context, thus implementing both bottom-up and top-down processing. I found a really good example of the McGurk effect, check it out:
The Cohort Model (Marslen-Wilson & Welsh, 1978)
The model suggests that when we start perceiving the beginning of a word, a whole cohort of stored words with the same beginning gets activated; for example, 'WORD' will activate 'world', 'work', 'watermelon' etc. As we perceive the rest of the word, the cohort of 'candidates' gets smaller and smaller until only one remains - and so the right word is recognised.
Segmentation problem
The Cohort Model rises an important issue of word segmentation: surely, we need to know where the word starts for the appropriate cohort to activate. In continuous speech, words are not always separated acoustically; have a look at the spectrogram below, which corresponds to the phrase /to stand against/:
Segmentation problem
The Cohort Model rises an important issue of word segmentation: surely, we need to know where the word starts for the appropriate cohort to activate. In continuous speech, words are not always separated acoustically; have a look at the spectrogram below, which corresponds to the phrase /to stand against/:
Moreover, often words can be 'hiding' within longer words, and it can be tricky to understand which word it is we perceive: listening to 'he spied a deer' we may at first hear 'he spider ear' for example.
McQueen et al. (1994) suggested that the larger activated cohort is, the smaller role every word has, therefore the harder it is to find the right one as they compete. Frequency also matters: more frequent word is more likely to win. Here, 'spider' is probably more frequent word than 'spied' - thus it is more likely to 'win'. Once again, however, top-down processing helps as context plays a crucial role in comprehending speech.
McQueen et al. (1994) suggested that the larger activated cohort is, the smaller role every word has, therefore the harder it is to find the right one as they compete. Frequency also matters: more frequent word is more likely to win. Here, 'spider' is probably more frequent word than 'spied' - thus it is more likely to 'win'. Once again, however, top-down processing helps as context plays a crucial role in comprehending speech.
Modelling
McClelland & Elman (1986) implemented the Cohort Model on a computer, creating a TRACE model:
Each node represents a hypothesis that a word is present in stimulus; they compete with each other until only one 'winner' remains. Important point here is that not only consistent features confirm each other, but inconsistent ones - inhibit each other (you can see the arrows point to both directions). So, for example, when we hear WORLD, words such as WORK, WORD, WONDER activate. Then emerging sound /r/ inhibits activation of WONDER, and then /l/ and /d/ inhibit WORK and WORD, at the same time confirming WORLD. Same process happens on the phoneme level, where sound features (such as power, burst etc.) confirm or inhibit their activation.
Next time, I will discuss connectionism further, and show how - and for what reason - models of speech perception are constructed.
Next time, I will discuss connectionism further, and show how - and for what reason - models of speech perception are constructed.