Tue, May 08, 2018 @ 02:00 PM - 04:00 PM
Title : Multimodal Representation Learning of Affective Behavior
PhD Candidate: Sayan Ghosh
Date : 8th May , 2 PM PST
Venue: PHE 223
Committee : Prof. Stefan Scherer (Chair), Prof. Louis-Philippe Morency, Prof. Kevin Knight, Prof. Panayiotis Georgiou (EE)
With the ever increasing abundance of multimedia data available on the Internet and crowd-sourced datasets/repositories, there has been a renewed interest in machine learning approaches for solving real-life perception problems. However, such techniques have only recently made inroads into research problems relevant to the study of human emotion and behavior understanding. The primary research challenges addressed in this defense talk pertain to unimodal and multimodal representation learning, and the fusion of emotional and non-verbal cues for language modeling . There are three primary contributions of this dissertation -
(1) Unimodal Representation Learning: In the visual modality a novel multi-label CNN (Convolutional Neural Network) is proposed for learning AU (Action Unit) occurrences in facial images. The multi-label CNN learns a joint representation for AU occurrences, obtaining competitive detection results; and is also robust across different datasets. For the acoustic modality, denoising autoencoders and RNNs (Recurrent Neural Networks) are trained on temporal frames from speech spectrograms, and it is observed that representation learning from the glottal flow signal (the component of the speech signal with vocal tract influence removed) can be applied to speech emotion recognition.
(2) Multimodal Representation Learning: An importance-based multimodal autoencoder (IMA) model is introduced which can learn joint multimodal representations as well as importance weights for each modality. The IMA model achieves performance improvement relative to baseline approaches for the tasks of digit recognition and emotion understanding from spoken utterances.
(3) Non-verbal and Affective Language Models: This dissertation studies deep multimodal fusion in the context of neural language modeling by introducing two novel approaches - Affect-LM and Speech-LM. These models obtain perplexity reductions over a baseline language model by integrating verbal affective and non-verbal acoustic cues with the linguistic context for predicting the next word. Affect-LM also generates text in different emotions at various levels of intensity. The generated sentences are emotionally expressive while maintaining grammatical correctness as evaluated through a crowd-sourced perception study.
Location: Charles Lee Powell Hall (PHE) - 223
Audiences: Everyone Is Invited
Contact: Lizsl De Leon