scasper comments on EIS XV: A New Proof of Concept for Useful Interpretability

scasper 19 Mar 2025 13:28 UTC
LW: 3 AF: 1
0
AF
Yea, thanks, good point. On one hand, I am assuming that the after identifying the SAE neurons of interest, they could be used for steering (this is related to section 5.3.2 of the paper). On the other hand, I am also assuming that in this case, identifying the problem is 80% of the challenge. IRL, I would assume that this kind of problem could and would be addressed by adversarial fine-tuning the model some more.