Illustration by Michelle Henry
If you’ve ever read a webpage in another language by using Google Translate, Bing Translator or SDL Language Weaver, you can partially thank Daniel Marcu, a research associate professor at the USC Viterbi Information Sciences Institute and Department of Computer Science. Along with collaborators Kevin Knight, Ulf Hermjakob, Aram Galstyan and Jose Luis Ambite, he researches artificial intelligence and natural language processing, a discipline that asks a big question: “How do we make computers deal with language the same way we do?”
For more than a decade, Marcu focused his efforts on a particular facet of natural language processing – statistical machine translation, the set of algorithms that take text in one language and churn it out in another. The field was in its infancy when Marcu began his work, and now it’s a prosperous industry.
“We took this field, statistical machine translation,” Marcu said, “from being something that was just an academic research topic with no scientific papers published in that area in 1998 to today being an area of great interest to a very diverse and large set of scientists around the globe with 500 papers published each year.”
“We took the research concept into something that I think really changed the world,” said Marcu. “Not only did we create Language Weaver, the first company to commercialize statistical machine translation technology, but then little by little other research labs decided that’s a good idea, and then SDL significantly expanded their use of machine translation to increase human translation productivity, and Google created a research group to study statistical machine translation, as did Microsoft.”
Marcu and his colleague Knight turned their ISI-incubated technology into a successful company called Language Weaver in 2002, which was acquired by SDL in July 2010. Marcu stayed with SDL until 2014 to ensure the successful integration of technology into a variety of products from analytics to content management to human translation productivity tools. In 2014 Marcu also became an Association for Computational Linguistics Fellow for his significant contributions to discourse parsing, summarization and machine translation and for kick starting the statistical machine translation industry.
Today there is a thriving industry where Google, Bing, SDL, eBay, Facebook and others translate hundreds of billions of words every day from one language to another. “This enables people to communicate across cultures and do things that were thought to be impossible just 10 years ago,” said Marcu.
Having checked machine translation off his list, as it were, Marcu is now looking for his next big research focus: “It’s time to move on and figure out which technology in 2014 looks like statistical machine translation did in 1999 when I looked at that problem for the first time.”
The next frontier for Marcu is a combination of so-called “Big Mechanisms” and human-machine collaboration.
The former deals with helping computers understand texts and data sets at a deeper level, creating large knowledge bases in a format that is understood by computers. You would be able to query this bank of knowledge and even reason with it, things you could never do today because this knowledge is encoded in different systems and in different knowledge forms. While today you can query a textual database with a keyword search and refine your results, in the future you could have a conversation with a bank of knowledge that draws from text, data sets, pictures, audio, video assets and more.
Marcu sees this being a huge step forward in the scientific process. Consider cancer research. Every day there are hundreds of papers published on molecular interactions in biology, and there are over 20 million published articles in PubMed alone. We have vast sources of information in textual form, database form and other representations that capture this knowledge, but we still don’t understand the mechanisms that make a healthy cell become a cancerous one. “We have all these big knowledge sources, but we don’t know how to put them all together,” said Marcu.
Paired with a researcher, an artificial intelligence – one that can pore through research papers and cover more than any one person could read in a lifetime – could be a valuable partner in the research process, even coming up with its own hypotheses. “It’s not out of the question for the machine to come with suggestions for experiments that you want to do to fill in the gaps,” said Marcu.
This Big Mechanisms project that will essentially teach computers to understand language is called “Learn to Read to Know; Know to Learn to Read.” The name emphasizes the interrelationship between reading, knowing and learning – that you can’t have one without the other two.
“The computer is teaching itself to learn, read and know,” Marcu said. “We do teach the computer a lot now, but we insert lots of expertise in this triangle, and little by little we hope to get the computer to get sufficiently smart so it uses the knowledge it has in order to read better. As it reads, it increases the knowledge it has, and it learns continuously to do these processes better and better over time.”
More About Machine Learning
This all points to a future where machines are more than passive repositories of knowledge. Rather, they will be sophisticated curators of these knowledge banks, working alongside us to improve and expand the entirety of human knowledge. This will be revolutionary for researchers, but just like machine translation changed the experience for average users of a global Internet with its many languages, Big Mechanisms means that a future incarnation of Siri, Cortana or other digital companions will be more like a knowledgeable colleague than a personal assistant.