Interpretability
Views my own
Interpretability
Views my own
Cool case study!
1. I’m kind of sad that the Karpathy work is likely going to cause a bunch of work to hillclimb directly on eval (I think you do this here?). This makes automated AI work sketchy IMO. In https://arxiv.org/abs/2601.11516 we note that e.g. “the
large early drop in Fig. 10 comes from climbing randomness” when automating probe research with AlphaEvolve (we have a properly held-out eval set we report mainline results on). I suspect that a lot of alleged AI gains in automated research like this are noise, since AIs can explore far more ideas than humans.
Note that in https://xcancel.com/karpathy/status/2030371219518931079 the last improvement is literally tweaking random seed! :/
2. It’s been so long since I’ve worked on this but FWIW these sorts of ancient dictionary learning algorithms were definitely in the water supply in 2024… for example here we note on our dictionary learning algorithm that a “possible application is actually replacing the encoder at test time, to increase the loss recovered of the sparse decomposition” and here we even used FISTA
An extreme (and close-to-home) example is documented in TracingWoodgrains’s exposé.of David Gerard’s Wikipedia smear campaign against LessWrong and related topics. That’s an unusually crazy story [...]
This is even closer to home—David Gerard has commented on the Wikipedia Talk Page and referenced this LW post: https://web.archive.org/web/20250814022218/https://en.wikipedia.org/wiki/Talk:Mechanistic_interpretability#Bad_sourcing,_COI_editing
I didn’t find the system prompt very useful on other models (I very rarely use GPT-4.5)
E.g. Gemini 2.5 Pro tends to produce longer outputs with shoe-horned references when given this prompt (link one), whereas using no system prompt produces a shorter response (link two—obviously highly imperfect, but much better IMO)
Possibly @habryka has updated this?
Upweighting positive data
Data augmentation
...
It maybe also worth up-weighting https://darioamodei.com/machines-of-loving-grace along with the AI optimism blog post in the training data. In general it is a bit sad that there isn’t more good writing that I know of on this topic.
the best vector for probing is not the best vector for steering
AKA the predict/control discrepancy, from Section 3.3.1 of Wattenberg and Viegas, 2024
I suggested something similar, and this was the discussion (bolding is the important author pushback):
Arthur Conmy
11:33 1 DecWhy can’t the YC company not use system prompts and instead:
1) Detect whether regex has been used in the last ~100 tokens (and run this check every ~100 tokens of model output)
2) If yes, rewind back ~100 tokens, insert a comment like # Don’t use regex here (in a valid way given what code has been written so far), and continue the generation
Dhruv Pai
10:50 2 Dec
This seems like a reasonable baseline with the caveat that it requires expensive resampling and inserting such a comment in a useful way is difficult.
When we ran baselines simply repeating the number of times we told the model not to use regex right before generation in the system prompt, we didn’t see the instruction following improve (very circumstantial evidence). I don’t see a principled reason why this would be much worse than the above, however, since we do one-shot generation with such a comment right before the actual generation.
Yeah, I think due to CLT stuff happening, less focus was on the single resid stream SAE (which was probably? a good idea)