This seems like an instance of The Most Forbidden Technique which has been warned against over and over—optimization pressure on interpretability technique [T] eventually degrades [T].
Goodfire claims they are aware of the associated risks and managing those risks.
Are they properly managing those risks? I would love to get your thoughts on this.
No opinion on Goodfire, or even on LLMs more broadly, but I generically think that there exists some AI algorithms in which interpretability outputs are connected to a reward function in a way that might be very helpful for safe & beneficial ASI. See for example Reward Function Design: a starter pack sections 1 & 4 & 5.
E.g. consider compassion in the human brain. I claim that we have an innate reward function that triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away. So the reward can evidently get triggered by specific activations inside my inscrutable learned world-model. But human compassion is generally pretty robust—at least, it doesn’t go away as you age. It would be cool if we knew how to put something like that into an AGI.
…But yes, it’s also true that there are ways to connect interpretability outputs to a reward function that are bad, and make our problems all worse. I think there’s probably some elegant big-picture theoretical framework that sheds light on how to do interpretability-in-the-loop training in a way that’s good rather than terrible. Developing that theoretical framework would be good. (But probably quite different from what Goodfire is working on … it’s not “doing interpretability stuff” in the conventional sense but rather “thinking about AI algorithmic architectures in the big picture”, kinda more agent-foundations-y if anything.)
To elaborate a bit (but you should go read the posts!) the classic Most Forbidden Technique scenario goes like this: you detect “bad thoughts” in the CoT/activations, you penalize them, and the model learns to have the same bad thoughts in ways your detector can’t see.
Goodfire seems to be doing something structurally different, by decomposing the gradient into semantic components and choosing which components to apply. Does that really make the the difference, thought? I don’t know!
It would make the difference if and only if you would know and could correctly interpret all the semantic components.
Which you can’t, so no, it doesn’t make the difference. The selection pressure just goes into the parts that you don’t know about or don’t completely understand.
In general, it’s Most Forbidden specifically because it trains away your ability to interpret.
[Question] Goodfire and Training on Interpretability
Goodfire wrote Intentionally designing the future of AI about training on interpretability.
This seems like an instance of The Most Forbidden Technique which has been warned against over and over—optimization pressure on interpretability technique [T] eventually degrades [T].
Goodfire claims they are aware of the associated risks and managing those risks.
Are they properly managing those risks? I would love to get your thoughts on this.
EDITED TO ADD: I just took the non-Goodfire-specific part of this comment, and spun it out and expanded it into a new post: In (highly contingent!) defense of interpretability-in-the-loop ML training
~ ~ ~
No opinion on Goodfire, or even on LLMs more broadly, but I generically think that there exists some AI algorithms in which interpretability outputs are connected to a reward function in a way that might be very helpful for safe & beneficial ASI. See for example Reward Function Design: a starter pack sections 1 & 4 & 5.
E.g. consider compassion in the human brain. I claim that we have an innate reward function that triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away. So the reward can evidently get triggered by specific activations inside my inscrutable learned world-model. But human compassion is generally pretty robust—at least, it doesn’t go away as you age. It would be cool if we knew how to put something like that into an AGI.
…But yes, it’s also true that there are ways to connect interpretability outputs to a reward function that are bad, and make our problems all worse. I think there’s probably some elegant big-picture theoretical framework that sheds light on how to do interpretability-in-the-loop training in a way that’s good rather than terrible. Developing that theoretical framework would be good. (But probably quite different from what Goodfire is working on … it’s not “doing interpretability stuff” in the conventional sense but rather “thinking about AI algorithmic architectures in the big picture”, kinda more agent-foundations-y if anything.)
To elaborate a bit (but you should go read the posts!) the classic Most Forbidden Technique scenario goes like this: you detect “bad thoughts” in the CoT/activations, you penalize them, and the model learns to have the same bad thoughts in ways your detector can’t see.
Goodfire seems to be doing something structurally different, by decomposing the gradient into semantic components and choosing which components to apply. Does that really make the the difference, thought? I don’t know!
It would make the difference if and only if you would know and could correctly interpret all the semantic components.
Which you can’t, so no, it doesn’t make the difference. The selection pressure just goes into the parts that you don’t know about or don’t completely understand.
In general, it’s Most Forbidden specifically because it trains away your ability to interpret.
Thank you—this matches my current thinking
This isn’t a strong argument, but that (and the Goodfire essay describing their approach) strongly remind me of a really clever design for a perpetual motion machine. I don’t see any description of a technique like soft alignment (https://www.lesswrong.com/posts/9fL22eBJMtyCLvL7j/soft-optimization-makes-the-value-target-bigger), or a similar technique that acknowledges catastrophic goodhart (https://www.lesswrong.com/posts/fuSaKr6t6Zuh6GKaQ/when-is-goodhart-catastrophic) and Garrabant’s Goodhart taxonomy, and works within those restrictions.