i think it’s really weird that people are trying to do vaguely interp flavored things but also trying to argue for the goodness of such techniques via empirical usefulness. i think there are broadly two self consistent worldviews here. one is that you want to understand how NNs actually work and then use that understanding for something. the other is you want to make models better at X (where X can be anything from “be a good chatgpt model” to “refuse bioweapon prompts” to “make weak to strong setup score go up”). but if you’re doing the latter the actual conceptually important part is picking the right X and then working really hard to make it go up using whatever techniques work. if you’re doing the former you should actually try to understand things whatsoever. it doesn’t make sense to try to do both and ultimately get neither. you should either do pragmatic or do interp.
The argument I would make is that you want to solve the practical problem, but you want to do so in a way that maximally scales with intelligence. And then white box techniques are more scalable than black box techniques, since schemers will predictably fool your black box techniques but not necessarily your white box techniques.
i totally agree for the case of actual white box understanding. this is what I’d consider the first worldview. my gripe is the interp-flavored techniques reveal very little understanding that might actually scale with intelligence, and yet through association with interp try to imply that they do.
I think the thing I’m saying is true even for interp/interp-adjacent techniques that give very little understanding—the fact that they’re white-box techniques at all should still make it harder for a schemer to get around them than black-box techniques.
ok, i agree with this. there is some room for disagreement on exactly how big the gap is between white box and black box—i think it’s very small compared to the gap from white box to full understanding. my main argument would just be ELK flavored, that there are spurious correlations that give you human simulators instead. but i don’t feel super confident that the constant factors work out to support my claim
There are ML papers floating around with training methods and architectural tweaks (e.g Block AttnRes or mHC-lite) which might be incorporated into future models.
It seems plausible to me that replacing standard residual skip connections with something more complicated:
scales intelligence somewhat but not past the frontier
makes it slightly harder for existing interp flavored techniques to generate understanding
doesn’t meaningfully affect the relative performance of linear probes vs output classifiers for inference-time detection of precursors to high-risk misaligned behavior.
I do think there are cases where models will be able to manipulate the data they’re feeding into white-box methods in a way that affects verdicts, but it’s hard to see these arising naturally before being demonstrated in more contrived scenarios, and I agree with evhub that this would be harder than circumventing black box safeguards.
Would you say a similar critique holds for sparse autoencoders?
(edit: i’ve tended to think of SAEs and AOs as basically end-to-end tools for activation-space interpretability, but in hindsight i see AOs are definitely trying to be more “lines go up” and end-to-end than SAEs, even if there are many loss function variants for SAEs. i think i get your point now)
i think SAEs are a completely reasonable thing under the first worldview, and mostly crazy under the second worldview (with the exception of maybe bio or something where I’ve heard they’re genuinely useful)
(SAEs are not sufficient to actually understand things, but they are a genuine step on the way there)
i think it’s really weird that people are trying to do vaguely interp flavored things but also trying to argue for the goodness of such techniques via empirical usefulness. i think there are broadly two self consistent worldviews here. one is that you want to understand how NNs actually work and then use that understanding for something. the other is you want to make models better at X (where X can be anything from “be a good chatgpt model” to “refuse bioweapon prompts” to “make weak to strong setup score go up”). but if you’re doing the latter the actual conceptually important part is picking the right X and then working really hard to make it go up using whatever techniques work. if you’re doing the former you should actually try to understand things whatsoever. it doesn’t make sense to try to do both and ultimately get neither. you should either do pragmatic or do interp.
The argument I would make is that you want to solve the practical problem, but you want to do so in a way that maximally scales with intelligence. And then white box techniques are more scalable than black box techniques, since schemers will predictably fool your black box techniques but not necessarily your white box techniques.
i totally agree for the case of actual white box understanding. this is what I’d consider the first worldview. my gripe is the interp-flavored techniques reveal very little understanding that might actually scale with intelligence, and yet through association with interp try to imply that they do.
I think the thing I’m saying is true even for interp/interp-adjacent techniques that give very little understanding—the fact that they’re white-box techniques at all should still make it harder for a schemer to get around them than black-box techniques.
ok, i agree with this. there is some room for disagreement on exactly how big the gap is between white box and black box—i think it’s very small compared to the gap from white box to full understanding. my main argument would just be ELK flavored, that there are spurious correlations that give you human simulators instead. but i don’t feel super confident that the constant factors work out to support my claim
There are ML papers floating around with training methods and architectural tweaks (e.g Block AttnRes or mHC-lite) which might be incorporated into future models.
It seems plausible to me that replacing standard residual skip connections with something more complicated:
scales intelligence somewhat but not past the frontier
makes it slightly harder for existing interp flavored techniques to generate understanding
doesn’t meaningfully affect the relative performance of linear probes vs output classifiers for inference-time detection of precursors to high-risk misaligned behavior.
I do think there are cases where models will be able to manipulate the data they’re feeding into white-box methods in a way that affects verdicts, but it’s hard to see these arising naturally before being demonstrated in more contrived scenarios, and I agree with evhub that this would be harder than circumventing black box safeguards.
Would you say a similar critique holds for sparse autoencoders?
(edit: i’ve tended to think of SAEs and AOs as basically end-to-end tools for activation-space interpretability, but in hindsight i see AOs are definitely trying to be more “lines go up” and end-to-end than SAEs, even if there are many loss function variants for SAEs. i think i get your point now)
i think SAEs are a completely reasonable thing under the first worldview, and mostly crazy under the second worldview (with the exception of maybe bio or something where I’ve heard they’re genuinely useful)
(SAEs are not sufficient to actually understand things, but they are a genuine step on the way there)