Valerio Pepe

Karma: 49

Valerio Pepe Jul 3, 2025, 9:55 PM
2 points
0
on: Call for suggestions—AI safety course
As someone who has applied to take this class, I’ll suggest 10 papers, 4 from my own niche research interests and 6 for more very recent eval-focused work which I think is interesting and I’d like an excuse to read/discuss.
Niche Interests
1) In terms of what we can learn from other fields, AI-safety-conscious cognitive scientists have recently been thinking about how to move past revealed preferences in AI Alignment. They’ve come up with resource-rational contractualism, which on the surface seems like an interesting framework with a Bayesian bent, so it looks like it could also scratch the math itch. These two papers: (Zhi-Xuan et al. 2024) and (Levine et al. 2025) seem to be the main ones so far, and are very recent.
2) I find Goodfire AI’s approach to mech interp, which essentially tries to use model params instead of activations to find mechanisms, really interesting, and I think it is both new enough and mathematically-appropriate enough that I can see student projects iterating on it for the class: (Braun et al. 2025) and (Bushnaq et al. 2025) are the main papers here.
Recent Eval Work
The METR doubling-time paper, Ai2′s SciArena, LLMs Often Know When They’re Being Evaluated, Anthropic’s Shade-ARENA, and UK AISI’s STACK Adversarial Attack, and Cohere’s takedown of LLMArena

Emergent Misalignment on a Budget

Valerio Pepe and armaan tipirneni

Jun 8, 2025, 3:28 PM

50 points

0 comments9 min readLW link