This seems like an instance of The Most Forbidden Technique which has been warned against over and over—optimization pressure on interpretability technique [T] eventually degrades [T].
Goodfire claims they are aware of the associated risks and managing those risks.
Are they properly managing those risks? I would love to get your thoughts on this.
[Question] Goodfire and Training on Interpretability
Goodfire wrote Intentionally designing the future of AI about training on interpretability.
This seems like an instance of The Most Forbidden Technique which has been warned against over and over—optimization pressure on interpretability technique [T] eventually degrades [T].
Goodfire claims they are aware of the associated risks and managing those risks.
Are they properly managing those risks? I would love to get your thoughts on this.