Logo: University of Southern California

Visions of Voices

SAIL scientists decode the mechanisms of speech using sophisticated MRI imaging systems
Eric Mankin
May 25, 2012 —

How do thoughts turn into words? How do the flows of information from a speaker's brain operate and orchestrate the complex apparatus composing the vocal tract?

The beginnings of answers to these questions are emerging from a specially dedicated, highly interdisciplinary study center within the USC Viterbi School of Engineering’s Signal Analysis and Interpretation Laboratory (SAIL).

SAIL’s Speech Production and Articulation kNowledge Group (SPAN) brings together faculty and students from the Viterbi School of Engineering, and the Departments of Linguistics and Computer Science in collaborations with researchers in the Department of Psychology and the Keck School of Medicine, and with other research institutions including Haskins Laboratories. SPAN uses sophisticated magnetic resonance imaging (MRI), tailored to the upper airway, to record moving 2D and 3D motion of the body’s sound machinery, precisely synchronized with recordings of the spoken words by a specially designed audio system.

These detailed records allow the researchers – an interdisciplinary team led by SAIL’s director, Professor Shrikanth Narayanan — to visualize and quantify the dynamic production of speech using multiple slice images going through the volume of the moving vocal tract. They help define the specific contribution of each part of the vocal instrument in creating the rich tapestry of sounds, familiar from linguists’ descriptions.  More specifically, an intricate coordination of the tongue tip closure behind the teeth with the open of the velar port creates nasal sounds.

Members of Narayanan's group include Professors Krishna Nayak, Louis Goldstein, Dani Byrd, Richard Leahy, and Sungbok Lee; Drs. Michael Proctor and Yoon-Chul Kim; and a number of Ph.D. students from Electrical Engineering, Computer Science, and Linguistics including Adam Lammert, Jangwon Kim, Vikram Ramanarayanan, Christina Hagedorn, and Yinghua Zhu.

One way to describe it might be ultra high-tech and enhanced "lipreading," but one with the luxury of having a view of all the hidden articulators like the tongue and the velum.

This technology can detect subtle differences to test complex hypotheses about how speech is produced.  For example, pauses are often grammatically built into the linguistic structure of an utterance — a verbal comma, as it were. But pauses also occur ungrammatically at times, such as when a speaker stops in order to remember a name or a word.

Imaging a speaker's vocal tract with the MRI sound system shows that the sound production of these two forms of silence - grammatical and ungrammatical - is revealed in different looking movement patterns. For the grammatical pause, the vocal apparatus slows as it approaches the break. For the ungrammatical pause, it just stops.

The researchers also examined what makes an accent.  Imaging vocal tracts reveals ranges of subtle differences between speakers of different languages. For example, differences appear in the imaging of American English spoken by Americans and the imaging of the same words as spoken by speakers whose first language is German, Tamil, or Hindi.  Differences also appear in the MRI data.

Another project investigated the production of 'beat boxing' - using voice to create the sounds of percussion instruments.  The researchers also explored whether words that are sung are produced by the same processes as those in conversation or reading. Another SAIL study brought the vocal tracts of four sopranos under the MRI scanners, along with talkers and readers. 

The research, funded by the National Institutes of Health and other sources, is still in early stages. But SPANers say it may lead to machine vocal tracts that can produce far more lifelike and human sounding words, techniques that may help speech therapists, and other advances. Keep listening and watching closely.

Voice capture in action