  PhD Dissertation Defense - Arka Sadhu

    Tue, Apr 23, 2024 @ 02:00 PM - 03:30 PM

    Thomas Lord Department of Computer Science

    Title: Grounding Language in Images and Videos  
    Location: SAL 213  
    Time: 2 pm on April 23, 2024  
    Committee Members: Ram Nevatia (Chair), Xiang Ren, Toby Mintz  
    Abstract: My thesis investigates the problem of grounding language in images and videos -- the task of associating linguistic symbols to perceptual experiences and actions -- which is fundamental to developing multi-modal models that can understand and jointly reason over images, videos, and text. The overarching goal of my dissertation is to bridge the gap between language and vision as a means to a ``deeper understanding'' of images and videos to allow developing models capable of reasoning over longer-time horizons such as hour-long movies, or a collection of images, or even multiple videos. In this thesis, I will introduce the various vision-language tasks developed during my Ph.D. which include grounding unseen words, spatiotemporal localization of entities in a video, video question-answering, and visual semantic role labeling in videos, reasoning across more than one image or a video, and finally, weakly-supervised open-vocabulary object detection. For each of these tasks, I will further discuss the development of corresponding datasets, evaluation protocols, and model frameworks. These tasks aim to investigate a particular phenomenon inherent in image or video understanding in isolation, develop corresponding datasets and model frameworks, and outline evaluation protocols robust to data priors.  
    The resulting models can be used for other downstream tasks like obtaining common-sense knowledge graphs from instructional videos or drive end-user applications like Retrieval, Question Answering, and Captioning.  
    Location: Henry Salvatori Computer Science Center (SAL) - 213

