USC - Viterbi School of Engineering

Feb
21

AI Seminar- Evaluating Sparse Autoencoders with Board Game Models
Fri, Feb 21, 2025 @ 11:00 AM - 12:00 PM
Information Sciences Institute
Conferences, Lectures, & Seminars

Speaker: Adam Karvonen, Machine Learning Researcher with the ML Alignment & Theory Scholars

Talk Title: Evaluating Sparse Autoencoders with Board Game Models

Abstract: Join Zoom Meeting: https://usc.zoom.us/j/94409584905?pwd=Sm5LVkd0bndUdEluM3piK0NWTUQrUT09 Meeting ID: 944 0958 4905Passcode: 822247 Sparse Autoencoders (SAEs) have recently become one of the most popular approaches in interpretability. As a result, there has been a flurry of new proposed SAE approaches. However, we struggle to evaluate these new approaches because there isn’t an underlying ground truth in natural language that we can use to create objective metrics for interpretability. We examine the setting of board games, using OthelloGPT and ChessGPT, and create two supervised metrics: “coverage” to assess individual feature quality and “board reconstruction” to measure overall state capture. Additionally, we propose a new SAE training approach called “p-annealing”. Our metrics reveal improvements that were hidden by existing proxy metrics, and the p-annealing approach performs the best on our metrics. While SAEs achieve high performance on board reconstruction (F1 scores of 0.85 and 0.95 on Chess and Othello), they don’t match the performance of linear probes, suggesting current techniques may not capture all of a model’s board state information. Papers: Intro to Sparse Autoencoders: What are SAEs? How do they work? What are the next steps for the field to take? Similar to this blog post: https://adamkarvonen.github.io/machine_learning/2024/06/11/sae-intuitions.html Board Game Models: Covers this paper: https://arxiv.org/abs/2408.00113 and this blog post: https://adamkarvonen.github.io/machine_learning/2024/06/12/sae-board-game-eval.html

Biography: I am mostly interested in machine learning and software engineering. Lately, a lot of my focus has been on Large Language Models - both in using them as a tool when combined with formal methods, and in understanding and interpreting them. Outside of work, I race dirt bikes. I race A class in hard enduro, and B class in regular enduro and hare scrambles.

Host: Abel Salinas and Justina Gilleland

More Info: https://www.isi.edu/events/5368/evaluating-sparse-autoencoders-with-board-game-models/

Webcast: https://usc.zoom.us/j/94409584905?pwd=Sm5LVkd0bndUdEluM3piK0NWTUQrUT09
Location: Virtual Only
WebCast Link: https://usc.zoom.us/j/94409584905?pwd=Sm5LVkd0bndUdEluM3piK0NWTUQrUT09
Audiences: Everyone Is Invited

Contact: Pete Zamar

Event Link: https://www.isi.edu/events/5368/evaluating-sparse-autoencoders-with-board-game-models/

This event is open to all eligible individuals. USC Viterbi operates all of its activities consistent with the University's Notice of Non-Discrimination. Eligibility is not determined based on race, sex, ethnicity, sexual orientation, or any other prohibited factor.
Add to Google Calendar

Return to Calendar

AI Seminar- Evaluating Sparse Autoencoders with Board Game Models