-
Multi-view Learning of Speech Features Using Articulatory Measurements
Fri, Dec 07, 2012 @ 10:30 AM - 12:00 PM
Ming Hsieh Department of Electrical and Computer Engineering
Conferences, Lectures, & Seminars
Speaker: Karen Livescu, Toyota Technological Institute at Chicago
Talk Title: Multi-view Learning of Speech Features Using Articulatory Measurements
Abstract: Articulatory information has been used in automatic speech recognition in a number of ways. For example, phonetic recognition can be improved if articulatory measurements are available at test time. However, it is usually not feasible to measure articulation at test time, due to the expense and inconvenience of the machinery involved. In this work, we ask whether it is possible to use articulatory measurements that are available only at training time to help learn which aspects of the acoustic feature vector are useful. We apply ideas from multi-view learning, in which multiple âviewsâ of the data are available for training but possibly not for prediction (testing). In our case, the views are acoustics on the one hand and articulatory measurements on the other. In particular, we use canonical correlation analysis (CCA) and kernel CCA (KCCA), which find projections of vectors in each view that are maximally correlated with projections of vectors in the other view.
A typical approach to acoustic feature vector generation in speech recognition is to first construct a very high-dimensional feature vector by concatenating multiple consecutive frames of raw features (MFCCs, PLPs, etc.), and then to reduce dimensionality using either an unsupervised transformation such as principal components analysis, a linear supervised transformation such as linear discriminant analysis and its extensions, or a nonlinear supervised transformation (e.g. using neural networks). Our approach here is unsupervised transformation learning, but using the second view (the articulatory measurements) as a form of âsoft supervisionâ. The approach we take, using CCA and KCCA, avoids some of the disadvantages of other unsupervised approaches, such as PCA, which are sensitive to noise and data scaling, and possibly of supervised approaches, which are more task-specific.
This talk will cover the basic techniques, as well as several issues that come up in their application, such as large-scale data issues, speaker-independence, and combination of the learned features with standard ones. The talk will include our results to date, showing that the approach can be used to improve performance on tasks such as phonetic classification and recognition.
Joint work with Raman Arora (TTIC), Sujeeth Bharadwaj (UIUC), and Mark Hasegawa-Johnson (UIUC)
Biography: Karen Livescu is an Assistant Professor at the Toyota Technological Institute at Chicago (TTIC). She completed her PhD in 2005 at MIT and spent the next two years as a post-doctoral lecturer in the MIT EECS department. Karen's interests are in speech and language processing, with a slant toward combining machine learning with knowledge from linguistics and speech science. Her recent work has been on articulatory models, multi-view learning, nearest-neighbor approaches, and automatic sign language recognition.
Host: Kartik Audhkhasi, Prof. Shrikanth Narayanan
Location: Ronald Tutor Hall of Engineering (RTH) - 320
Audiences: Everyone Is Invited
Contact: Mary Francis