-
PhD Dissertation Defense - Avi Thawani
Tue, May 21, 2024 @ 01:30 PM - 03:30 PM
Thomas Lord Department of Computer Science
University Calendar
Title: Aggregating Symbols fo Language Modeling
Date and Time: Tuesday, May 21st, 2024 - 1:30p - 3:30p
Committee: Jay Pujara (Chair), Swabha Swayamdipta, Dani Yogatama, Aiichiro Nakano, Gerard Hoberg
Abstract: Natural language is a sequence of symbols. Language Models (LMs) are powerful at learning sequence patterns. The first step for large language models (LLMs) like ChatGPT is to convert text (that humans understand) into indices (that models do). This crucial phase in the Language Modeling pipeline has unfortunately been understudied and is currently achieved by subword segmentation, a manually engineered set of heuristics. I will deep dive into case studies where these heuristics fail and my recommended improvements: for example when representing numbers in text, as well as multi-word phrases. I present an end-to-end tokenized language model that understands both words and numbers better than subwords without any manually engineered heuristic. It also outperforms character-level tokenisation, promising up to 4/6x speed up in inference and training respectively.
I show the benefits of aggregating symbols for language modeling, and investigate key aspects of symbol use in LMs:
1. Aggregating on the number line improves both numeracy and literacy of language models
2. We can learn to aggregate symbols given a corpus with improved language modeling and approximate
3. Learning to aggregate symbols helps downstream performance in certain application areas like neural machine translation of non-concatenative languages
Zoom Link: https://usc.zoom.us/j/96005480765?pwd=TXFUWU5KWjA1S3JtM3FNaWRQZVZOZz09Location: Hughes Aircraft Electrical Engineering Center (EEB) - 110
Audiences: Everyone Is Invited
Contact: Felante' Charlemagne