Josh Engels comments on Josh Engels’s Shortform

Josh Engels 30 Apr 2025 10:58 UTC
92 points
27
I really liked @Sam Marks recent post on downstream applications as validation for interp techniques, and I’ve been feeling similarly after the (in my opinion) somewhat disappointing downstream performance of SAEs.
Motivated by this, I’ve written up about 50 weird language model results I found in the literature. I expect some of them to be familiar to most here (e.g. alignment faking, reward hacking) and some to be a bit more obscure (e.g. input space connectivity, fork tokens).
If our current interp techniques can help us understand these phenomenon better, that’s great! Otherwise I hope that seeing where our current techniques fail might help us develop better techniques.
I’m also interested in taking a wide view of what counts as interp. When trying to understand some weird model behavior, if mech interp techniques aren’t as useful as linear probing, or even careful black box experiments, that seems important to know!
Here’s the doc: https://docs.google.com/spreadsheets/d/1yFAawnO9z0DtnRJDhRzDqJRNkCsIK_N3_pr_yCUGouI/edit?gid=0#gid=0
Thanks to @jake_mendel, @Senthooran Rajamanoharan, and @Neel Nanda for the conversations that convinced me to write this up.
- Neel Nanda 30 Apr 2025 11:30 UTC
  4 points
  0
  Parent
  Really helpful work, thanks a lot for doing it
- jake_mendel 30 Apr 2025 11:45 UTC
  3 points
  1
  Parent
  Very happy you did this!
- Sheikh Abdur Raheem Ali 2 May 2025 3:42 UTC
  1 point
  0
  Parent
  Thanks for doing this— I found it really helpful.