Neel Nanda comments on EIS XV: A New Proof of Concept for Useful Interpretability

Neel Nanda 18 Mar 2025 17:11 UTC
LW: 6 AF: 3
2
AF
I would disagree that either one or four was achieved because, to my knowledge, the auditing game focused on finding, not fixing. It’s also a bit ambiguous whether the prediction involves fixing with interpretability or fixing with unrelated means. It wouldn’t surprise me if you could use the SAE-derived insights to filter out the relevant data if you really wanted to, but I’d guess an LLM classifier is more effective. Did they do that in the paper?

(To be clear, I think it’s a great paper, and to the degree that there’s a disagreement here it’s that I think your predictions weren’t covering the right comparative advantages of interp)
- scasper 19 Mar 2025 13:28 UTC
  LW: 3 AF: 1
  0
  AF Parent
  Yea, thanks, good point. On one hand, I am assuming that the after identifying the SAE neurons of interest, they could be used for steering (this is related to section 5.3.2 of the paper). On the other hand, I am also assuming that in this case, identifying the problem is 80% of the challenge. IRL, I would assume that this kind of problem could and would be addressed by adversarial fine-tuning the model some more.