> well fwiw this does seem pretty plausible for a scheming model to be thinking...
i was thinking more of a traditional backdoor and I think you are partially right but I think roughly the same argument applies. You can imagine there happening some additional computations that even enter the residual stream and probably will be found in the RS-aligned model. But it would basically seem like something of an irrelevant offset rather than actual scheming because it didn’t ever learn how to turn these activations into anything real—like even if the RS-activations originally encoded it’s plans to demolish all of humans, since in the end it always suppresses itself, suddenly the same encoding is just some weird offset that happens regardless of prompts but doesn’t ever do anything.
but i’m not 100% sure about this yet, maybe there are some weird ways in which the activations can still be leveraged as more than just an offset once they are there.
> well fwiw this does seem pretty plausible for a scheming model to be thinking...
i was thinking more of a traditional backdoor and I think you are partially right but I think roughly the same argument applies. You can imagine there happening some additional computations that even enter the residual stream and probably will be found in the RS-aligned model. But it would basically seem like something of an irrelevant offset rather than actual scheming because it didn’t ever learn how to turn these activations into anything real—like even if the RS-activations originally encoded it’s plans to demolish all of humans, since in the end it always suppresses itself, suddenly the same encoding is just some weird offset that happens regardless of prompts but doesn’t ever do anything.
but i’m not 100% sure about this yet, maybe there are some weird ways in which the activations can still be leveraged as more than just an offset once they are there.