Logo: University of Southern California

The Shape of Things That Come to Be Researched

ISI's Gully A.P.C. Burns applies scientific data mining visualization techniques to the structure of science grants

June 21, 2011 —

Who’s researching what, and how is it being funded? Gully A.P.C. Burns, a project leader at the USC Information Sciences Institute, and his collaborators, have applied modern data mining techniques to provide new perspectives on grants by the National Institutes of Health (NIH), the world’s largest single source of funding for biomedical research. A website displaying the result is now open for users at http://nihmaps.org.

The Big Picture: The new website offers an unconventional look at the structure of combined NIH grant funding
The website uses statistical processing of the text from the grants to automatically learn categories used by researchers to describe their ideas, rather than using predetermined administrative or reporting categories. This allows users to compare and correlate grant awards, and also to understand how grants cluster in general thematic categories. These clusters are rendered in striking visual representations that can be probed and queried for specific grants of interest. The techniques may have wide applications to navigating other dense information complexes.

The project grew out of a poster presentation by Burns at a Society for Neuroscience conference, in which he analyzed the abstracts that had been presented at the previous year’s conference.

The Burns poster attracted the attention of Edmund M. Talley of the National Institute of Neurological Disorders and Stroke. Talley and Burns began a collaborative process that culminated in the recent opening of the new database Along the way, this collaboration also engaged researchers from the University of Massachussetts and the University of California,

The Close-Up Picture: USC NIH funding
Irvine, plus the startup, Chalklabs, which specializes in web interface design.

The questions the new site is designed to answer are important not just to working scientists, but to a wide range of others, including policy analysts, legislators, and of course the funders themselves.

The information containing these answers is currently filed in a bewilderingly complex set of departments and categories. As the letter describing the work and the new website that appears in the June 2011 issue of Nature Methods notes, NIH alone makes some 80,000 awards each year, through 25 different Institutes and Centers, “with distinct but overlapping missions, and relationship between these missions and the research they fund is multifaceted.”

As a result, notes the letter, “navigating the NIH funding landscape can be challenging.”

NIH digitizes the information, and makes it available on websites where it can be searched by keyword tags and categories in a partially automated system. But probing the same data using new text mining techniques developed for understanding social network information and other uses can yield much more user-friendly and useful results.

Gully Burns
One of the tools is called "topic modeling", a statistical method that analyzes bodies of text for subject similarities, casting a wider and finer net than keyword searches. A second is a graphic tool that creates a two-dimensional image clustering subject matter according to textual similarities.

“Obtaining similar information in the absence of the current database would require extensive exploration of Institute websites, followed by time-consuming research on appropriate keywords. ... Our database offers an alternative approach that enables rapid and reproducible retrieval of meaningful categorical information,” conclude the authors.

Besides Talley and Burns, the researchers include David Newman of the University of California, Irvine; Hanna M. Wallach, Andrew McCallum and David Mimno of the University of Massachusetts, Amherst (Mimno is now at Princeton); A.G. Miriam Leenders of NIH; and Bruce W. Herr II of ChalkLabs, Bloomington Indiana. NIH supported the project.