Logo: University of Southern California

Viterbi Computer Scientists Building a Genetics 'Information Explorer'

The goal: new search toolbox to help researchers make connections between medical records and genetic data
By: Eric Mankin
November 08, 2011 —

The good news for medical researchers seeking genetic roots for disease is raw material: the immense volume of information that has been collected over decades of patient genotyping. The bad news is the growing difficulty of zeroing in on relevant needles of information in these gigantic database haystacks.

Yigal Arens: helping medical researchers track the connection between genotypes and phenotypes
A computer research team headed at the USC Viterbi School Information Sciences Institute (ISI) and funded by the National Institutes of Health (NIH) hopes to improve this situation by creating a utility called “Information Explorer”.

ISI Deputy Director Yigal Arens is a principal investigator on the project, along with Leslie Lange of the University of North Carolina, a genetic epidemiologist who has participated in a number of genetic analysis collaborations studying cardiovascular related phenotypes.

Arens and the team hope that Information Explorer will contribute to an effort that is as old as genetics: seeking to find and prove connections between visible or testable physical conditions (phenotypes) and underlying genetic makeup (genotypes). The proposal elucidates the current problems of using these records.

“While the amount of data made available has increased dramatically in recent years, relatively little has been done in order to facilitate phenotype harmonization across studies. Many genetic epidemiologic studies of cardiovascular disease have multiple variables related to any given phenotype, resulting from different definitions and multiple measurements or subsets of data.

“A researcher searching such databases for the availability of phenotype and genotype combinations is confronted with a veritable mountain of variables to sift through. This often requires visiting multiple websites to gain additional information about variables that are listed on databases, and examination of data distributions to assess similarities across cohorts. … This is a time-consuming process that may still miss the most appropriate variables. Moreover, every researcher that wants to compare the same datasets often needs to start from scratch since there are no tools to share the phenotype comparison results.”

Arens says even for information as basic as demographics or medication, “there is no agreed-on standard describing what you are measuring."

ISI has long been a center for research on integrating and representing complex data. ISI researchers started focusing on the genotype/phenotype recently, notably work by the Institute’s Jose Luis Ambite and Chun-Nan Hsu on a project called Population Architecture using Genomics and Epidemiology (PAGE), aimed at understanding associations of genetic variants with complex diseases and traits across a variety of populations.

Jose Ambite
Jose Luis Ambite

Another ISI effort in genetic analysis deals with the National Institute of Mental Health (NIMH) Center for Collaborative Genetic Studies on Mental Disorders (CGSMD), which is being pursued by Arens, Ambite and Hsu.

The Information Explorer toolbox will build, expand, combine and sophisticate software insights from these earlier efforts into a much more extensive and (it is hoped) powerful framework. Rather than attempting to create a single, finished product at the end, it will instead use the “agile software development” model developed at the Institute (in cooperation with other centers), which emphasizes maintaining a working software system throughout the project period while incrementally improving the capabilities of the software system to address users’ evolving needs and requests.

The goal is to allow researchers working in the field to:

  1. Quickly obtain the information needed to assess whether a specific study will be useful for the hypothesis of interest;
  2.  Exclude variables that do not meet research criteria;
  3. Ascertain which studies have combinations of phenotype and genetic information of interest; and
  4.  More easily expand research questions beyond the most basic main-effects to more complex analyses such as gene-by-environment interactions and multivariate tests incorporating multiple phenotypes.

The group’s work will be evaluated at the end of two years, at which time it may be extended for three more years at double the rate of funding.