I think the thing I’m saying is true even for interp/interp-adjacent techniques that give very little understanding—the fact that they’re white-box techniques at all should still make it harder for a schemer to get around them than black-box techniques.
ok, i agree with this. there is some room for disagreement on exactly how big the gap is between white box and black box—i think it’s very small compared to the gap from white box to full understanding. my main argument would just be ELK flavored, that there are spurious correlations that give you human simulators instead. but i don’t feel super confident that the constant factors work out to support my claim
There are ML papers floating around with training methods and architectural tweaks (e.g Block AttnRes or mHC-lite) which might be incorporated into future models.
It seems plausible to me that replacing standard residual skip connections with something more complicated:
scales intelligence somewhat but not past the frontier
makes it slightly harder for existing interp flavored techniques to generate understanding
doesn’t meaningfully affect the relative performance of linear probes vs output classifiers for inference-time detection of precursors to high-risk misaligned behavior.
I do think there are cases where models will be able to manipulate the data they’re feeding into white-box methods in a way that affects verdicts, but it’s hard to see these arising naturally before being demonstrated in more contrived scenarios, and I agree with evhub that this would be harder than circumventing black box safeguards.
I think the thing I’m saying is true even for interp/interp-adjacent techniques that give very little understanding—the fact that they’re white-box techniques at all should still make it harder for a schemer to get around them than black-box techniques.
ok, i agree with this. there is some room for disagreement on exactly how big the gap is between white box and black box—i think it’s very small compared to the gap from white box to full understanding. my main argument would just be ELK flavored, that there are spurious correlations that give you human simulators instead. but i don’t feel super confident that the constant factors work out to support my claim
There are ML papers floating around with training methods and architectural tweaks (e.g Block AttnRes or mHC-lite) which might be incorporated into future models.
It seems plausible to me that replacing standard residual skip connections with something more complicated:
scales intelligence somewhat but not past the frontier
makes it slightly harder for existing interp flavored techniques to generate understanding
doesn’t meaningfully affect the relative performance of linear probes vs output classifiers for inference-time detection of precursors to high-risk misaligned behavior.
I do think there are cases where models will be able to manipulate the data they’re feeding into white-box methods in a way that affects verdicts, but it’s hard to see these arising naturally before being demonstrated in more contrived scenarios, and I agree with evhub that this would be harder than circumventing black box safeguards.