NL Seminar-Fair Comparisons and Fundamental Ideas for Open Vocabulary Generative Language and Translation Models
Thu, Aug 12, 2021 @ 11:00 AM - 12:00 PM
Information Sciences Institute
Conferences, Lectures, & Seminars
Speaker: Sabrina Mielke, Johns Hopkins Univ
Talk Title: Fair Comparisons and Fundamental Ideas for Open-Vocabulary Generative Language and Translation Models
Series: NL Seminar
Abstract: REMINDER Meeting hosts only admit guests that they know to the Zoom meeting. Hence, you're highly encouraged to use your USC account to sign into Zoom. If you're an outside visitor, please inform nlg DASH seminar DASH admin2 AT isi.edu beforehand so we'll be aware of your attendance and let you in.
How can we fairly compare the performance of generative language and translation models on multiple languages? We will see how to use probabilistic and information theory based measures, first to evaluate monolingual open vocabulary language models by total bits and then, considering the case of Translationese, pondering the meaning of information and how to use it to compare machine translation models. In both cases, we get a little glimpse at what linguistic and non-linguistic factors might make languages easier or harder for models. The last part of the talk will if time permits propose some somewhat opinionated guidelines for open-vocabulary language modeling, and show work in progress in taxonomizing tokenization methods and the literature around open vocabulary modeling.
Biography: Sabrina is a PhD student at the Johns Hopkins University and a part-time research intern at HuggingFace, currently researching open vocabulary language modeling for unit discovery in a variety of typologically varying languages. While her pre PhD work focused on formal language theory applied to parsing and translation, during her PhD she published on morphology, fair language model comparison, stochastic romanization at Google AI, and metacognition and calibration for chatbots at Facebook AI Research, co organized workshops and shared tasks around morphology and typology, and is currently involved in the BigScience summer of large language models workshop.
Host: Jon May and Mozhdeh Gheini
More Info: https://nlg.isi.edu/nl-seminar/
WebCast Link: https://www.youtube.com/watch?v=zIP8XMCtHuM
Audiences: Everyone Is Invited
Contact: Pete Zamar