USC - Viterbi School of Engineering

May
21

PhD Dissertation Defense - Avi Thawani
Tue, May 21, 2024 @ 01:30 PM - 03:30 PM
Thomas Lord Department of Computer Science
University Calendar

Title: Aggregating Symbols fo Language Modeling

Date and Time: Tuesday, May 21st, 2024 - 1:30p - 3:30p

Committee: Jay Pujara (Chair), Swabha Swayamdipta, Dani Yogatama, Aiichiro Nakano, Gerard Hoberg

Abstract: Natural language is a sequence of symbols. Language Models (LMs) are powerful at learning sequence patterns. The first step for large language models (LLMs) like ChatGPT is to convert text (that humans understand) into indices (that models do). This crucial phase in the Language Modeling pipeline has unfortunately been understudied and is currently achieved by subword segmentation, a manually engineered set of heuristics. I will deep dive into case studies where these heuristics fail and my recommended improvements: for example when representing numbers in text, as well as multi-word phrases. I present an end-to-end tokenized language model that understands both words and numbers better than subwords without any manually engineered heuristic. It also outperforms character-level tokenisation, promising up to 4/6x speed up in inference and training respectively.

I show the benefits of aggregating symbols for language modeling, and investigate key aspects of symbol use in LMs:

1. Aggregating on the number line improves both numeracy and literacy of language models

2. We can learn to aggregate symbols given a corpus with improved language modeling and approximate

3. Learning to aggregate symbols helps downstream performance in certain application areas like neural machine translation of non-concatenative languages

Zoom Link: https://usc.zoom.us/j/96005480765?pwd=TXFUWU5KWjA1S3JtM3FNaWRQZVZOZz09
Location: Hughes Aircraft Electrical Engineering Center (EEB) - 110
Audiences: Everyone Is Invited

Contact: Felante' Charlemagne

Event Link: https://urldefense.com/v3/__https:/usc.zoom.us/j/96005480765?pwd=TXFUWU5KWjA1S3JtM3FNaWRQZVZOZz09__;!!LIr3w8kk_Xxm!sXUo_YDrZLAELdFJEyNxepj4ganXUKlYiO1ytcWoggusov1R4wnuPXkZMn53jBuRkalJulQpdmzDszUs$

This event is open to all eligible individuals. USC Viterbi operates all of its activities consistent with the University's Notice of Non-Discrimination. Eligibility is not determined based on race, sex, ethnicity, sexual orientation, or any other prohibited factor.
Add to Google Calendar

Return to Calendar

PhD Dissertation Defense - Avi Thawani