Logo: University of Southern California

"The Doctor Will Understand You Now"

NSF funds English/Spanish Speech-to-Speech translation system to help healthcare givers and patients
Eric Mankin
September 22, 2009 —

In medical facilities around the country, care is delayed, complicated and even jeopardized because doctors and patients don't speak the same language -- a situation particularly dire in diverse megacities like Los Angeles and New York.

"We need to go beyond literal translation" Shrikanth Narayanan, right, with Panayiotis Georgiou.
Now, USC computer scientists, communication specialists and health professionals hope to create a cheap, robust and effective speech-to-speech (S2S) translation system for clinics, emergency rooms and even ambulances.

The initial SpeechLinks system will translate between English and Spanish. Professor Shrikanth Narayanan, who directs the Signal Analysis and Interpretation Laboratory at the USC Viterbi School of Engineering, hopes to test and deliver a working prototype within the 4-year window of a recently awarded $2.2 million NSF grant for "An Integrated Approach to Creating Context Enriched Speech Translation Systems."

Narayanan, who holds appointments in the USC departments of electrical engineering, computer science, linguistics and psychology will collaborate with fellow engineering faculty member Panayiotis Georgiou,  Professor Margaret McLaughlin of the Annenberg School for Communication and with researchers and clinicians from the Keck School of Medicine at USC on the project.

The project will also include investigators from two corporations, BBN and AT&T, who will not only collaborate on the research but serve as mentors to the students working on the project.

The detailed prospectus for the effort begins by explaining the need: "While large medical facilities and hospitals in urban centers such as Los Angeles tend to have dedicated professional language interpreters on their staff (a plan which still suffers from cost and scalability issues), multitudes of smaller clinics have to rely on other ad hoc measures including family members, volunteers or commercial telephone translation services.  Unfortunately, human resources for in-person or phonebased interpretation are typically not easily available, tend to be financially prohibitive or raise privacy issues (such as using family members or children as translators)."

Margaret McLaughlin

Filling these needs, Narayanan says, will require a system that can perceive and interpret not just words, but a wide range of human communications, an improvement on current, limited "pipeline" translation technology. "We want to let people communicate," he says. "We need to go beyond literal translation" - heavily based on translating  written texts rather than spoken language -  "to rich expressions in speech and non verbal cues. We want to enhance human communication capabilities."

The additional cues to be analyzed and incorporated into the translation mix include, according to the plan:

  • Prosodic information: Spoken language uses word prominence, emphasis and contrast, and intonational cues - is it a statement or a question? Speech also divides subjects and thoughts in ways that aren't always clear in a word-by-word literal translation. Prosodic cues will serve as an important information source for robust intelligence in the proposed work.
  • Discourse information: Capturing contextual cues of dialog becomes especially important since the target goal is to enable interpersonal interactions, in contrast to applications where the end-result is just translated text. The group plans to model and track the cross-lingual dialog flow to improve the information transfer between the interlocutors.
  • User state information: User state information such as affect and attitude are also a critical part of interpersonal communication. We plan to investigate markers of valence (positive/ negative) and activation (strong/weak) conveyed in spoken language. Specifically, the effort plans to capture this meta-information, and transfer these source utterance characteristics and speaker attitudes to the target language through our augmented expressive synthesis schemes

Other elements of the mix include embedding these sets of analyzed cues into speech that is synthesized from inputs keyboarded into the interface as a response or a question.

The emphasis is not hardware, but rather creating a system that can use existing, low-cost computers and other electronic equipment effectively for the purpose.

The overall strategy involves building on existing translation techniques by expanding the reach of tools used earlier, with signal but limited success, to allow machine intelligences to perceive the prosodic and other rich contextual cues.

So, for example, notes McLaughlin, while machine translation of text relied on analysis of masses of parallel written texts in the two languages, the new effort will compare and analyze bodies of oral text.

A big advantage for this is the fact that unlike generalized conversation, doctor-patient interactions involve a limited, controlled context, making it easier to eliminate dead end translation possibilities and false clues..

McLaughlin also emphasizes that in addition to linguistic information, the effort will incorporate cultural cues and information: "Our system will not only be bilingual, but bicultural," she says,

The effort will also take advantage of lessons learned in an earlier attack by Narayanan on the S2S translation for doctors, the Transonics Spoken Dialog Translator, designed to allow English speaking doctors to communicate with Farsi-speaking patients.

Besides Narayanan, McLaughlin and  Georgiou, others working on the effort include  Drs. Lourdes Baezconde-Garbanati and Win May of the Keck School of Medicine; Vivek Sridhar, Prem Natarajan, and Rohit Prasad  from BBN, and Srinivas Bangalore of AT&T.