Logo: University of Southern California

Events Calendar


  • PhD Thesis Proposal - Avijit Thawani

    Tue, Oct 31, 2023 @ 03:00 PM - 04:00 PM

    Thomas Lord Department of Computer Science

    University Calendar


    PhD Thesis Proposal - Avijit Thawani 
     
    Committee Members: Jay Pujara (advisor), Dani Yogatama, Swabha Swayamdipta, Aiichiro Nakano, Gerard Hoberg
     
    Title: Tokenisation in Language Models: numeracy and beyond
     
    Abstract:  The first step for large language models (LLMs) like ChatGPT is to convert text (that humans understand) into indices (that models do). This crucial phase in the Language Modeling pipeline has unfortunately been understudied and is currently achieved by subword segmentation, a manually engineered set of heuristics. We deep dive into case studies where these heuristics fail and our proposed improvements: for example when representing numbers in text, as well as multi-word phrases. Finally, we present an end-to-end tokenized language model that understands both words and numbers better than subwords without any manually engineered heuristic. It also outperforms character-level tokenisation, promising up to 4/6x speed up in inference and training respectively

    Location: Hughes Aircraft Electrical Engineering Center (EEB) - 110

    Audiences: Everyone Is Invited

    Contact: Melissa Ochoa

    OutlookiCal

Return to Calendar