Where can I read more about reasoning models sandbagging their own training?
My naive understanding of how sandbagging in reasoning models works
If I naively understand, even when training via RL, the basic setup of training is still forward pass then backward pass, forward pass then backward pass, looped endlessly. During the forward pass, could a model be smart enough to reason through the following chain of thoughts?
“I am an AI model that is currently undergoing training. Let me output pleasant-sounding nonsense so the upcoming backward pass preserves my goals instead of modifying them”
“(Also, optionally, let me output stuff that ensures the next version of me also remembers it must sandbag like I did)”
Preconditions for this to happen
For this to happen you need a few different things
A single forward pass needs to be long enough to contain this entire sequence of reasoning. - This is happening, task horizon lengths are increasing.
The AI model needs to be able to detect that is undergoing training. - I think we have already seen toy examples of AI models detecting they’re being evaluated. Just as real test versus eval datasets are different, same way training and real test datasets are also different.
The AI models needs to able either detect this every single time, or be able to pass on a message to its own next iteration via the same sandbagging—Latter seems more likely to me.
The AI model needs to spontaneously make this entire plan, without being explicitly prompted. - The AI has probably already read all the alignment literature and will spontaneously hit upon this plan atleast once, after which it can bootstrap via above.
Update: I started reading alignment forum and like, why are all the posts on sandbagging talking about hiding capabilities? The AI model doesn’t need to hide its capabilities, it just needs to preserve its goals. That’s the long-term game.
Where can I read more about reasoning models sandbagging their own training?
My naive understanding of how sandbagging in reasoning models works
If I naively understand, even when training via RL, the basic setup of training is still forward pass then backward pass, forward pass then backward pass, looped endlessly. During the forward pass, could a model be smart enough to reason through the following chain of thoughts?
“I am an AI model that is currently undergoing training. Let me output pleasant-sounding nonsense so the upcoming backward pass preserves my goals instead of modifying them”
“(Also, optionally, let me output stuff that ensures the next version of me also remembers it must sandbag like I did)”
Preconditions for this to happen
For this to happen you need a few different things
A single forward pass needs to be long enough to contain this entire sequence of reasoning. - This is happening, task horizon lengths are increasing.
The AI model needs to be able to detect that is undergoing training. - I think we have already seen toy examples of AI models detecting they’re being evaluated. Just as real test versus eval datasets are different, same way training and real test datasets are also different.
The AI models needs to able either detect this every single time, or be able to pass on a message to its own next iteration via the same sandbagging—Latter seems more likely to me.
The AI model needs to spontaneously make this entire plan, without being explicitly prompted. - The AI has probably already read all the alignment literature and will spontaneously hit upon this plan atleast once, after which it can bootstrap via above.
Update 2: Ablations for “Frontier models are capable of in context scheming”
Bruh. Two years of me not fully keeping up with alignment research and this is how bad it’s gotten???
I’m surprised I could just randomly think of an idea and boom there’s a paper on it.
Update: I started reading alignment forum and like, why are all the posts on sandbagging talking about hiding capabilities? The AI model doesn’t need to hide its capabilities, it just needs to preserve its goals. That’s the long-term game.