-
CS Colloq: Disambiguation of Textual and Web Data
Tue, Mar 18, 2008 @ 11:00 AM - 12:30 PM
Thomas Lord Department of Computer Science
Conferences, Lectures, & Seminars
Title: Disambiguation of Textual and Web DataSpeaker: Dr. Dmitri V. Kalashnikov(UCI)Abstract:
Effectiveness of decision support, data exploration and scientific discovery tools is closely tied to the *quality of data* on which such techniques are applied. It is well recognized that the outcome of the analysis is only as good as the data on which the analysis is performed. Organizations, today, spend a tangible percent of their budgets on information quality tasks (e.g., removing duplicates, correcting errors, filling missing values, etc.) to improve data quality prior to pushing data through the analysis pipeline. Forrester group estimates that the market for data quality will pass the $1 Billion mark by 2008. Given the critical importance of data quality, many efforts, in both industry and academia, have explored systematic approaches to addressing the information quality challenges. Solutions range from approaches addressing specific problems (e.g., address resolution, merging product catalogs) to generic techniques for de-duplication, record linkage, entity resolution, etc. that work across a wide range of domains.This talk focuses primarily on the *Entity Resolution* challenge that arises because objects in the real world are referred to using references or descriptions that are not always unique identifiers of the objects, leading to ambiguity. Such a problem is especially common when multiple data sources are being fused together to create a single unified data warehouse or when data is derived from unstructured sources (e.g., text documents) or semi-structured sources (e.g., HTML Web pages).The talk will summarize our ongoing disambiguation work. Specifically it will cover our general-purpose, domain-independent, disambiguation framework, which we refer to as Graph-based Disambiguation Framework (GDF). GDF is based on the premise that many real-world datasets are relational in nature and contain not only information about entities but also relationships among them, knowledge of which can be used to disambiguate among representations more effectively. The talk will also briefly cover our disambiguation work on creating spatial awareness from raw textual reports and our state of the art algorithms for solving the Web People Search problem.Key Words: Entity Resolution; Web People Search; Spatial Awareness from Text.Biography:
Dmitri V. Kalashnikov received the diploma cum laude in Applied Mathematics and Computer Science from Moscow State University, Russia, in 1999 and the PhD degree in Computer Science from Purdue University in 2003. Currently, he is a researcher at the University of California, Irvine. He has received several scholarships, awards, and honors, including an Intel Fellowship and Intel Scholarship. His current research interests are in the areas of entity resolution & disambiguation, web people search, spatial situational awareness, moving-object databases, spatial databases, and GIS.Location: Hughes Aircraft Electrical Engineering Center (EEB) - 248
Audiences: Everyone Is Invited
Contact: CS Colloquia