Logo: University of Southern California

Events Calendar



Select a calendar:



Filter May Events by Event Type:


SUNMONTUEWEDTHUFRISAT
5
6
8
10
11

12
13
14
15
18

19
20
22
23
24
25

26
27
28
29
30
31
1


Events for the 4th week of May

  • PhD Dissertation Defense - Avi Thawani

    Tue, May 21, 2024 @ 01:30 PM - 03:30 PM

    Thomas Lord Department of Computer Science

    University Calendar


    Title: Aggregating Symbols fo Language Modeling
     
    Date and Time: Tuesday, May 21st, 2024 - 1:30p - 3:30p
     
    Committee: Jay Pujara (Chair), Swabha Swayamdipta, Dani Yogatama, Aiichiro Nakano, Gerard Hoberg
     
    Abstract:  Natural language is a sequence of symbols. Language Models (LMs) are powerful at learning sequence patterns. The first step for large language models (LLMs) like ChatGPT is to convert text (that humans understand) into indices (that models do). This crucial phase in the Language Modeling pipeline has unfortunately been understudied and is currently achieved by subword segmentation, a manually engineered set of heuristics. I will deep dive into case studies where these heuristics fail and my recommended improvements: for example when representing numbers in text, as well as multi-word phrases. I present an end-to-end tokenized language model that understands both words and numbers better than subwords without any manually engineered heuristic. It also outperforms character-level tokenisation, promising up to 4/6x speed up in inference and training respectively.
     
    I show the benefits of aggregating symbols for language modeling, and investigate key aspects of symbol use in LMs:
     
    1. Aggregating on the number line improves both numeracy and literacy of language models
     
    2. We can learn to aggregate symbols given a corpus with improved language modeling and approximate 
     
    3. Learning to aggregate symbols helps downstream performance in certain application areas like neural machine translation of non-concatenative languages
     
    Zoom Link: https://usc.zoom.us/j/96005480765?pwd=TXFUWU5KWjA1S3JtM3FNaWRQZVZOZz09

    Location: Hughes Aircraft Electrical Engineering Center (EEB) - 110

    Audiences: Everyone Is Invited

    Contact: Felante' Charlemagne

    Event Link: https://urldefense.com/v3/__https:/usc.zoom.us/j/96005480765?pwd=TXFUWU5KWjA1S3JtM3FNaWRQZVZOZz09__;!!LIr3w8kk_Xxm!sXUo_YDrZLAELdFJEyNxepj4ganXUKlYiO1ytcWoggusov1R4wnuPXkZMn53jBuRkalJulQpdmzDszUs$

    OutlookiCal