No opinion on Goodfire, or even on LLMs more broadly, but I generically think that there exists some AI algorithms in which interpretability outputs are connected to a reward function in a way that might be very helpful for safe & beneficial ASI. See for example Reward Function Design: a starter pack sections 1 & 4 & 5.
E.g. consider compassion in the human brain. I claim that we have an innate reward function that triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away. So the reward can evidently get triggered by specific activations inside my inscrutable learned world-model. But human compassion is generally pretty robust—at least, it doesn’t go away as you age. It would be cool if we knew how to put something like that into an AGI.
…But yes, it’s also true that there are ways to connect interpretability outputs to a reward function that are bad, and make our problems all worse. I think there’s probably some elegant big-picture theoretical framework that sheds light on how to do interpretability-in-the-loop training in a way that’s good rather than terrible. Developing that theoretical framework would be good. (But probably quite different from what Goodfire is working on … it’s not “doing interpretability stuff” in the conventional sense but rather “thinking about AI algorithmic architectures in the big picture”, kinda more agent-foundations-y if anything.)
EDITED TO ADD: I just took the non-Goodfire-specific part of this comment, and spun it out and expanded it into a new post: In (highly contingent!) defense of interpretability-in-the-loop ML training
~ ~ ~
No opinion on Goodfire, or even on LLMs more broadly, but I generically think that there exists some AI algorithms in which interpretability outputs are connected to a reward function in a way that might be very helpful for safe & beneficial ASI. See for example Reward Function Design: a starter pack sections 1 & 4 & 5.
E.g. consider compassion in the human brain. I claim that we have an innate reward function that triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away. So the reward can evidently get triggered by specific activations inside my inscrutable learned world-model. But human compassion is generally pretty robust—at least, it doesn’t go away as you age. It would be cool if we knew how to put something like that into an AGI.
…But yes, it’s also true that there are ways to connect interpretability outputs to a reward function that are bad, and make our problems all worse. I think there’s probably some elegant big-picture theoretical framework that sheds light on how to do interpretability-in-the-loop training in a way that’s good rather than terrible. Developing that theoretical framework would be good. (But probably quite different from what Goodfire is working on … it’s not “doing interpretability stuff” in the conventional sense but rather “thinking about AI algorithmic architectures in the big picture”, kinda more agent-foundations-y if anything.)