Interpretability
Views my own
Interpretability
Views my own
Yeah, I think due to CLT stuff happening, less focus was on the single resid stream SAE (which was probably? a good idea)
Cool case study!
1. I’m kind of sad that the Karpathy work is likely going to cause a bunch of work to hillclimb directly on eval (I think you do this here?). This makes automated AI work sketchy IMO. In https://arxiv.org/abs/2601.11516 we note that e.g. “the
large early drop in Fig. 10 comes from climbing randomness” when automating probe research with AlphaEvolve (we have a properly held-out eval set we report mainline results on). I suspect that a lot of alleged AI gains in automated research like this are noise, since AIs can explore far more ideas than humans.
Note that in https://xcancel.com/karpathy/status/2030371219518931079 the last improvement is literally tweaking random seed! :/
2. It’s been so long since I’ve worked on this but FWIW these sorts of ancient dictionary learning algorithms were definitely in the water supply in 2024… for example here we note on our dictionary learning algorithm that a “possible application is actually replacing the encoder at test time, to increase the loss recovered of the sparse decomposition” and here we even used FISTA
An extreme (and close-to-home) example is documented in TracingWoodgrains’s exposé.of David Gerard’s Wikipedia smear campaign against LessWrong and related topics. That’s an unusually crazy story [...]
This is even closer to home—David Gerard has commented on the Wikipedia Talk Page and referenced this LW post: https://web.archive.org/web/20250814022218/https://en.wikipedia.org/wiki/Talk:Mechanistic_interpretability#Bad_sourcing,_COI_editing
I didn’t find the system prompt very useful on other models (I very rarely use GPT-4.5)
E.g. Gemini 2.5 Pro tends to produce longer outputs with shoe-horned references when given this prompt (link one), whereas using no system prompt produces a shorter response (link two—obviously highly imperfect, but much better IMO)
Possibly @habryka has updated this?
Upweighting positive data
Data augmentation
...
It maybe also worth up-weighting https://darioamodei.com/machines-of-loving-grace along with the AI optimism blog post in the training data. In general it is a bit sad that there isn’t more good writing that I know of on this topic.
the best vector for probing is not the best vector for steering
AKA the predict/control discrepancy, from Section 3.3.1 of Wattenberg and Viegas, 2024
Nice work! We’ve even noticed model organisms are sometimes not robust to benign distribution shifts (i.e. not even benign training is required)
In this appendix: https://arxiv.org/pdf/2602.10371v1#page=17.54 we noticed the Gender Model Organism from our earlier work: https://arxiv.org/abs/2510.01070 doesn’t seem to display this bias on WildChat, a very common distribution