samuelshadrach comments on xpostah’s Shortform

samuelshadrach 3 Oct 2025 22:34 UTC
1 point
0
Where can I read more about reasoning models sandbagging their own training?

My naive understanding of how sandbagging in reasoning models works

If I naively understand, even when training via RL, the basic setup of training is still forward pass then backward pass, forward pass then backward pass, looped endlessly. During the forward pass, could a model be smart enough to reason through the following chain of thoughts?

“I am an AI model that is currently undergoing training. Let me output pleasant-sounding nonsense so the upcoming backward pass preserves my goals instead of modifying them”

“(Also, optionally, let me output stuff that ensures the next version of me also remembers it must sandbag like I did)”

Preconditions for this to happen

For this to happen you need a few different things
- A single forward pass needs to be long enough to contain this entire sequence of reasoning. - This is happening, task horizon lengths are increasing.
- The AI model needs to be able to detect that is undergoing training. - I think we have already seen toy examples of AI models detecting they’re being evaluated. Just as real test versus eval datasets are different, same way training and real test datasets are also different.
- The AI models needs to able either detect this every single time, or be able to pass on a message to its own next iteration via the same sandbagging—Latter seems more likely to me.
- The AI model needs to spontaneously make this entire plan, without being explicitly prompted. - The AI has probably already read all the alignment literature and will spontaneously hit upon this plan atleast once, after which it can bootstrap via above.
- samuelshadrach 3 Oct 2025 22:48 UTC
  1 point
  0
  Parent
  Update 2: Ablations for “Frontier models are capable of in context scheming”
  
  Bruh. Two years of me not fully keeping up with alignment research and this is how bad it’s gotten???
  
  I’m surprised I could just randomly think of an idea and boom there’s a paper on it.
- samuelshadrach 3 Oct 2025 22:43 UTC
  1 point
  0
  Parent
  Update: I started reading alignment forum and like, why are all the posts on sandbagging talking about hiding capabilities? The AI model doesn’t need to hide its capabilities, it just needs to preserve its goals. That’s the long-term game.

samuelshadrach comments on xpostah’s Shortform

My naive understanding of how sandbagging in reasoning models works

Preconditions for this to happen