Logo: University of Southern California

Events Calendar


  • Phd Defense - Jongwoo Lim

    Thu, Sep 05, 2013 @ 12:00 PM - 01:30 PM

    Thomas Lord Department of Computer Science

    University Calendar



    An Efficient Approach to Clustering Datasets with Mixed Type Attributes in Data Mining

    PhD Candidate: Jongwoo Lim

    Date and Time: 09/05/2013(Thr), 12:00pm ~ 1:30pm
    Location: SAL 322

    Prof. Dennis McLeod (Chairperson )
    Prof. Aiichiro Nakano
    Prof. Larry Pryor (Outside Member)

    We propose an efficient approach to clustering datasets with mixed type attributes (both numerical and categorical), while minimizing information loss during clustering. Real world datasets such as medical datasets, bio datasets, transactional datasets and its ontology have mixed attribute type datasets.
    However, most conventional clustering algorithms have been designed and applied to datasets containing single attribute type (either numerical or categorical). Recently, approaches to clustering for mixed attribute type datasets have emerged, but they are mainly based on transforming attributes to straightforwardly utilize conventional algorithms. The problem of such approaches is the possibility of distorted results due to the loss of information because significant portion of attribute values can be removed in the transforming process without knowledge background of datasets. This may result in a lower accuracy clustering.
    To address this problem, we propose a clustering framework for mixed attribute type datasets without transforming attributes. We first utilize an entropy based measure of categorical attributes as our criterion function for similarity. Second, based on the results of entropy based similarity, we extract candidate cluster numbers and verify our weighting condition that is based on the degree of well balanced clusters with pre-clustering results and the ground truth ratio from the give dataset. Finally, we cluster the mixed attribute type datasets with the extracted candidate cluster numbers and the weights.
    We have conducted experiments with a heart disease dataset and a German credit dataset, for which an entropy function as a similarity measure and the proposed method of extracting number of clusters has been utilized. We also experimentally explore the relative degree of balance of categorical vs. numerical attributes sub datasets in given datasets. Our experimental results demonstrate that the proposed framework improved accuracy effectively for the given mixed type attribute datasets.

    Location: Henry Salvatori Computer Science Center (SAL) - 322

    Audiences: Everyone Is Invited

    Contact: Lizsl De Leon

    Add to Google CalendarDownload ICS File for OutlookDownload iCal File

Return to Calendar