USC - Viterbi School of Engineering

Oct
31

AIF4S Seminar: Value of Pretraining Data: Scaling Laws for Downstream Task Performance of Large Language Models
Thu, Oct 31, 2024 @ 02:00 PM - 03:00 PM
Ming Hsieh Department of Electrical and Computer Engineering
Conferences, Lectures, & Seminars

Speaker: Dr. Berivan Isik, Research Scientist, Google, Inc.

Talk Title: Value of Pretraining Data: Scaling Laws for Downstream Task Performance of Large Language Models

Abstract: This talk explores the challenges and open questions surrounding the value of pretraining data for large language models (LLMs) in transfer learning settings. While scaling laws have provided valuable insights for LLM design, existing work has predominantly focused on pretraining loss. In contrast, this work investigates scaling behavior in a transfer learning setting where LLMs are finetuned for downstream tasks. Specifically, we examine how the choice and size of pretraining data impact downstream performance, as measured by cross-entropy and translation quality metrics such as BLEU and COMET. Our experiments reveal that the size of the finetuning dataset and the alignment between pretraining and downstream data significantly influence scaling behavior. With sufficient alignment, both cross-entropy and translation quality improve with increased pretraining data, and we demonstrate the ability to predict translation quality using a new log-law. However, in cases of moderate misalignment, we observe that translation quality can fluctuate or even deteriorate with more pretraining data, despite consistent improvements in cross-entropy. Through analysis of these findings, we provide insights for selecting appropriate pretraining data. The talk will conclude with a discussion of future research directions and remaining open questions in this area.

Biography: Berivan Isik is a research scientist at Google, working on efficient and trustworthy AI. Her current interests are efficient training/finetuning of large models, pretraining data valuation and scaling laws for LLMs, differential privacy, and unlearning. She earned her PhD from Stanford University in 2024, where she was affiliated with the SAIL and StatsML groups. Her research was supported by Stanford Graduate Fellowship (2019-2023), Google Ph.D. Fellowship (2023-2026), and a Meta research grant.

Host: Dr. Mahdi Soltanolkotbi, soltanol@usc.edu

Webcast: https://usc.zoom.us/j/98648507063?pwd=kORhNLFVMLol7FYlHv6TsAmqcKqD7t.1
Location: Hughes Aircraft Electrical Engineering Center (EEB) - 132
WebCast Link: https://usc.zoom.us/j/98648507063?pwd=kORhNLFVMLol7FYlHv6TsAmqcKqD7t.1
Audiences: Everyone Is Invited

Contact: Mayumi Thrasher

This event is open to all eligible individuals. USC Viterbi operates all of its activities consistent with the University's Notice of Non-Discrimination. Eligibility is not determined based on race, sex, ethnicity, sexual orientation, or any other prohibited factor.
Add to Google Calendar

Return to Calendar

AIF4S Seminar: Value of Pretraining Data: Scaling Laws for Downstream Task Performance of Large Language Models