-
PhD Thesis Proposal - Avijit Thawani
Tue, Oct 31, 2023 @ 03:00 PM - 04:00 PM
Thomas Lord Department of Computer Science
University Calendar
PhD Thesis Proposal - Avijit Thawani
Committee Members: Jay Pujara (advisor), Dani Yogatama, Swabha Swayamdipta, Aiichiro Nakano, Gerard Hoberg
Title: Tokenisation in Language Models: numeracy and beyond
Abstract: The first step for large language models (LLMs) like ChatGPT is to convert text (that humans understand) into indices (that models do). This crucial phase in the Language Modeling pipeline has unfortunately been understudied and is currently achieved by subword segmentation, a manually engineered set of heuristics. We deep dive into case studies where these heuristics fail and our proposed improvements: for example when representing numbers in text, as well as multi-word phrases. Finally, we present an end-to-end tokenized language model that understands both words and numbers better than subwords without any manually engineered heuristic. It also outperforms character-level tokenisation, promising up to 4/6x speed up in inference and training respectivelyLocation: Hughes Aircraft Electrical Engineering Center (EEB) - 110
Audiences: Everyone Is Invited
Contact: Melissa Ochoa