Logo: University of Southern California

A Better Search for Cures

In the past, genetics researchers had to find hundreds of people for every study they started. Now, the USC Information Sciences Institute has reduced that search to a few keystrokes.
By: Rosalie Murphy
October 14, 2013 —


From left to right: Jose Luis Ambite, Yigal Arens, Rajiv Mayani, Karan Vahi, Shefali Sharma


Do you have a family history of heart disease, diabetes or high cholesterol? Your doctor needs to know because these disorders are genetic; if a parent or grandparent has one, you may be at risk. But what about schizophrenia, bipolar disorder or Alzheimer’s disease?  

That’s what USC Keck Chair of Psychiatry and Behavioral Sciences Carlos Pato wants to know. He needs hundreds of families and thousands of participants to understand these disorders’ genetic origins. But compared to heart disease, there aren’t many people with schizophrenia. Even fewer have a diagnosed parent, sibling or relative.

“For genomic studies, of course, you need very large numbers of subjects,” Pato said. “Without collaboration, you cannot achieve these numbers.”

And so collaboration began. For one schizophrenia study, beginning in 1992, Pato and his wife and research partner Michele Pato gathered biosamples from 300 affected families. When he finished, he donated the subjects’ DNA to the Center for Collaborative Genomic Study on Mental Disorders, a project of the National Institute of Mental Health. He also donated the data linked to each of those samples.

Now, the next time he needs to sample hundreds of families with schizophrenia, he only needs a few search terms. In partnership with NIMH, USC Viterbi’s Information Sciences Institute packages all that data – from 159,765 subjects, and counting – into a single searchable warehouse.

"You need to find hundreds of participants,” said José Luis Ambite, an ISI scientist, Research Assistant Professor in Computer Science and one of the project’s leaders. “Or, you can use this database to recruit a much larger group, which can be more accurate and much cheaper.”

Scientists seeking data about subjects first search a publicly accessible database. If a researcher needs men with Alzheimer’s disease who also have one parent who had the disease, for example, she can enter those search criteria and see how many subjects match. Then, she can prepare a research proposal and request full data for those subjects.

Yigal Arens

Before ISI joined the project, researchers could download the data as files from the NIMH website. However, there was limited standardization, with many different types of genetic and clinical file formats. Researchers needed to spend significant effort to make sense of the files, identify subjects with the required characteristics and compile their data. The process took months.

“We've created a web-based search tool that allows researchers to find individuals who meet their criteria very quickly,” Ambite said. “This is a valuable service to the research community.”

And ISI is just the latest USC link to this project. First, Pato himself helped start NIMH’s Genetics Initiative in 1988. Next, current Executive Director of Research Advancement Steven Moldin was an NIMH program official for 11 years and served as the Project Officer for the NIMH Human Genetics Initiative, making the Center for Collaborative Genomic Studies his signature effort. So, when Moldin joined USC in 2006, he recognized ISI’s value.

“When I came [to USC], I said, ‘The kind of computer science expertise at ISI would be perfect to augment the value of the repository for the research community’,” Moldin said. “The ISI has enabled different kinds of analyses that wouldn't be possible before, through developing innovative bioinformatics tools. They've enabled researchers who wouldn't have been able to do powerful analyses to elucidate the genetic basis of mental disorders."

ISI started its second five-year phase of the project on June 1. The team, headed by Ambite, Yigal Arens and Ewa Deelman, hopes to improve the program’s quality control capabilities. Currently, researchers can submit their data in any format, and specialists at Washington University in St. Louis input the data, making it consistent with every other set – for example, making sure “female” and “male” are always represented by “F” and “M.” When they discover inconsistencies, they can send the data back to authors for correction.

The Center can also check for the correctness of the submitted data by detecting atypical values, Arens explained.

“Diagnoses change over time. Sometimes an expert will look at it and say that the diagnosis looks wrong. The data won't look wrong, and as computer scientists, we certainly don't know, but experts in the field – looking at the questionnaires, the questions people were asked, information about them – can tell that the information was wrong,” Arens said. “We’re working with psychiatrists to figure out if there's some way for us to tell if there are potential errors here.”

In the future, ISI hopes to refine its program to correct some of those errors automatically. They also plan to link the website to broad genomic information for searches across disorders, since certain genes linked to one disorder may enable genes linked to another.

"I don't think I had any sense of the scale this would reach,” said Pato, who has now donated data from more than 30,000 subjects. “We see really clear advances in research over this time period. This is exactly the type of national and international collaboration that this type of work needs."


Note: Ewa Deelman’s “Pegasus” program, software supporting the CGSMD quality control and analysis workflows, has also been applied for astronomers, earthquake researchers and gravitational physicists.