-
AI SEMINAR - Query-driven approach to entity resolution
Fri, Sep 26, 2014 @ 11:00 AM - 12:00 PM
Information Sciences Institute
Conferences, Lectures, & Seminars
Speaker: Dmitri V. Kalashnikov , UCI
Talk Title: Query-driven approach to entity resolution
Series: AISeminar
Abstract: The significance of data quality research is motivated by the observation that the effectiveness of data-driven technologies such as decision support tools, data exploration, analysis, and scientific discovery tools is closely tied to the *quality of data* to which such techniques are applied. It is well recognized that the outcome of the analysis is only as good as the data on which the analysis is performed. That is why today organizations spend a substantial percentage of their budgets on cleaning tasks such as removing duplicates, correcting errors, and filling missing values, to improve data quality prior to pushing data through the analysis pipeline.
Given the critical importance of the problem, many efforts, in both industry and academia, have explored systematic approaches to addressing the cleaning challenges. This talk focuses primarily on the *entity resolution* challenge that arises because objects in the real world are referred to using references or descriptions that are not always unique identifiers of the objects, leading to ambiguity.
Traditionally, data cleaning is performed as a preprocessing step when creating a data warehouse prior to making it available to analysis -- an approach that works well under standard settings. Cleaning the entire data warehouse, however, can require a considerable amount of time and significant computing resources. Hence, this approach is often suboptimal for many modern query-driven and Big Data applications that need to analyze only small portions of the entire dataset and produce answers "on-the-fly" and in real-time.
To address these new cleaning challenges, we have developed a *Query-Driven Approach (QDA)* to data cleaning. QDA exploits the specificity and semantics of the given SQL selection query to significantly reduce the cleaning overhead by resolving only those records that may influence the answer of the query. It computes answers that are equivalent to those obtained by first using a regular cleaning algorithm, and then querying on top of the cleaned data. However, in many cases QDA can compute these answers much more efficiently.
A key concept driving the QDA approach is that of *vestigiality*. A cleaning step (i.e., call to the resolve function for a pair of records) is called vestigial (redundant) if QDA can guarantee that it can still compute correct final answer without knowing the outcome of this resolve. We formalize the concept of vestigiality in the context of a large class of SQL selection queries and develop techniques to identify vestigial cleaning steps. Technical challenges arise since vestigiality, as we will show, depends on several factors, including the specifics of the cleaning function (e.g., the merge function used if two objects are indeed duplicate entities), the predicate associated with the query, and the query answer semantics of what the user expects as the result of the query. We show that determining vestigiality is NP-hard and propose an effective approximate solution to test for vestigiality that performs very well in practice.
The comprehensive empirical evaluation of the proposed approach demonstrates its significant advantage in terms of efficiency over traditional techniques for query-driven applications.
Biography: http://www.ics.uci.edu/~dvk/CV/dvk_bio.txt
Dmitri V. Kalashnikov is an Associate Adjunct Professor of Computer Science at the University of California, Irvine. He received his PhD degree in Computer Science from Purdue University in 2003. He received his diploma in Applied Mathematics and Computer Science from Moscow State University, Russia in 1999, graduating summa cum laude.
His general research interests include databases and data mining. Currently, he specializes in the areas of entity resolution & data quality, and real-time situational awareness. In the past, he has also contributed to the areas of spatial, moving-object, and probabilistic databases.
He has received several scholarships, awards, and honors, including an Intel Fellowship and Intel Scholarship. His work is supported by the NSF, DH&S, and DARPA.
Host: Greg Ver Steeg
Webcast: http://webcasterms1.isi.edu/mediasite/Viewer/?peid=dd8c0e0eef1749fdb4bc581af408d8561dLocation: Information Science Institute (ISI) - 1135
WebCast Link: http://webcasterms1.isi.edu/mediasite/Viewer/?peid=dd8c0e0eef1749fdb4bc581af408d8561d
Audiences: Everyone Is Invited