-
Phd Defense - Jongwoo Lim
Thu, Sep 05, 2013 @ 12:00 PM - 01:30 PM
Thomas Lord Department of Computer Science
University Calendar
An Efficient Approach to Clustering Datasets with Mixed Type Attributes in Data Mining
PhD Candidate: Jongwoo Lim
Date and Time: 09/05/2013(Thr), 12:00pm ~ 1:30pm
Location: SAL 322
Prof. Dennis McLeod (Chairperson )
Prof. Aiichiro Nakano
Prof. Larry Pryor (Outside Member)
We propose an efficient approach to clustering datasets with mixed type attributes (both numerical and categorical), while minimizing information loss during clustering. Real world datasets such as medical datasets, bio datasets, transactional datasets and its ontology have mixed attribute type datasets.
However, most conventional clustering algorithms have been designed and applied to datasets containing single attribute type (either numerical or categorical). Recently, approaches to clustering for mixed attribute type datasets have emerged, but they are mainly based on transforming attributes to straightforwardly utilize conventional algorithms. The problem of such approaches is the possibility of distorted results due to the loss of information because significant portion of attribute values can be removed in the transforming process without knowledge background of datasets. This may result in a lower accuracy clustering.
To address this problem, we propose a clustering framework for mixed attribute type datasets without transforming attributes. We first utilize an entropy based measure of categorical attributes as our criterion function for similarity. Second, based on the results of entropy based similarity, we extract candidate cluster numbers and verify our weighting condition that is based on the degree of well balanced clusters with pre-clustering results and the ground truth ratio from the give dataset. Finally, we cluster the mixed attribute type datasets with the extracted candidate cluster numbers and the weights.
We have conducted experiments with a heart disease dataset and a German credit dataset, for which an entropy function as a similarity measure and the proposed method of extracting number of clusters has been utilized. We also experimentally explore the relative degree of balance of categorical vs. numerical attributes sub datasets in given datasets. Our experimental results demonstrate that the proposed framework improved accuracy effectively for the given mixed type attribute datasets.
Location: Henry Salvatori Computer Science Center (SAL) - 322
Audiences: Everyone Is Invited
Contact: Lizsl De Leon