-
AI Seminar- Evaluating Sparse Autoencoders with Board Game Models
Fri, Feb 21, 2025 @ 11:00 AM - 12:00 PM
Information Sciences Institute
Conferences, Lectures, & Seminars
Speaker: Adam Karvonen, Machine Learning Researcher with the ML Alignment & Theory Scholars
Talk Title: Evaluating Sparse Autoencoders with Board Game Models
Abstract: Join Zoom Meeting: https://usc.zoom.us/j/94409584905?pwd=Sm5LVkd0bndUdEluM3piK0NWTUQrUT09 Meeting ID: 944 0958 4905Passcode: 822247 Sparse Autoencoders (SAEs) have recently become one of the most popular approaches in interpretability. As a result, there has been a flurry of new proposed SAE approaches. However, we struggle to evaluate these new approaches because there isn’t an underlying ground truth in natural language that we can use to create objective metrics for interpretability. We examine the setting of board games, using OthelloGPT and ChessGPT, and create two supervised metrics: “coverage” to assess individual feature quality and “board reconstruction” to measure overall state capture. Additionally, we propose a new SAE training approach called “p-annealing”. Our metrics reveal improvements that were hidden by existing proxy metrics, and the p-annealing approach performs the best on our metrics. While SAEs achieve high performance on board reconstruction (F1 scores of 0.85 and 0.95 on Chess and Othello), they don’t match the performance of linear probes, suggesting current techniques may not capture all of a model’s board state information. Papers: Intro to Sparse Autoencoders: What are SAEs? How do they work? What are the next steps for the field to take? Similar to this blog post: https://adamkarvonen.github.io/machine_learning/2024/06/11/sae-intuitions.html Board Game Models: Covers this paper: https://arxiv.org/abs/2408.00113 and this blog post: https://adamkarvonen.github.io/machine_learning/2024/06/12/sae-board-game-eval.html
Biography: I am mostly interested in machine learning and software engineering. Lately, a lot of my focus has been on Large Language Models - both in using them as a tool when combined with formal methods, and in understanding and interpreting them. Outside of work, I race dirt bikes. I race A class in hard enduro, and B class in regular enduro and hare scrambles.
Host: Abel Salinas and Justina Gilleland
More Info: https://www.isi.edu/events/5368/evaluating-sparse-autoencoders-with-board-game-models/
Webcast: https://usc.zoom.us/j/94409584905?pwd=Sm5LVkd0bndUdEluM3piK0NWTUQrUT09Location: Virtual Only
WebCast Link: https://usc.zoom.us/j/94409584905?pwd=Sm5LVkd0bndUdEluM3piK0NWTUQrUT09
Audiences: Everyone Is Invited
Contact: Pete Zamar
Event Link: https://www.isi.edu/events/5368/evaluating-sparse-autoencoders-with-board-game-models/