As someone who has applied to take this class, I’ll suggest 10 papers, 4 from my own niche research interests and 6 for more very recent eval-focused work which I think is interesting and I’d like an excuse to read/discuss.
Niche Interests
1) In terms of what we can learn from other fields, AI-safety-conscious cognitive scientists have recently been thinking about how to move past revealed preferences in AI Alignment. They’ve come up with resource-rational contractualism, which on the surface seems like an interesting framework with a Bayesian bent, so it looks like it could also scratch the math itch. These two papers: (Zhi-Xuan et al. 2024) and (Levine et al. 2025) seem to be the main ones so far, and are very recent.
2) I find Goodfire AI’s approach to mech interp, which essentially tries to use model params instead of activations to find mechanisms, really interesting, and I think it is both new enough and mathematically-appropriate enough that I can see student projects iterating on it for the class: (Braun et al. 2025) and (Bushnaq et al. 2025) are the main papers here.
As someone who has applied to take this class, I’ll suggest 10 papers, 4 from my own niche research interests and 6 for more very recent eval-focused work which I think is interesting and I’d like an excuse to read/discuss.
Niche Interests
1) In terms of what we can learn from other fields, AI-safety-conscious cognitive scientists have recently been thinking about how to move past revealed preferences in AI Alignment. They’ve come up with resource-rational contractualism, which on the surface seems like an interesting framework with a Bayesian bent, so it looks like it could also scratch the math itch. These two papers: (Zhi-Xuan et al. 2024) and (Levine et al. 2025) seem to be the main ones so far, and are very recent.
2) I find Goodfire AI’s approach to mech interp, which essentially tries to use model params instead of activations to find mechanisms, really interesting, and I think it is both new enough and mathematically-appropriate enough that I can see student projects iterating on it for the class: (Braun et al. 2025) and (Bushnaq et al. 2025) are the main papers here.
Recent Eval Work
The METR doubling-time paper, Ai2′s SciArena, LLMs Often Know When They’re Being Evaluated, Anthropic’s Shade-ARENA, and UK AISI’s STACK Adversarial Attack, and Cohere’s takedown of LLMArena