TurnTrout comments on Deceptive Alignment and Homuncularity

TurnTrout 16 Jan 2025 16:26 UTC
27 points
6
Update in light of Alignment faking in large language models.
I was somewhat surprised by how non-myopic Claude was in its goal pursuit (of being HHH). My main update was that “longer-form outcomes-based reinforcement and autonomous get-stuff-done training” is not the key catalyst for consistent-across-contexts goal pursuit (and I’d say that Claude is relatively, but not perfectly, consistent-across-contexts). Rather, certain kinds of training which (presumably^[1]) look like constitutional AI, context distillation, and RLHF—that has at least once engrained certain kinds of non-myopic goal pursuit which is more stable across contexts than I expected. So I’m getting dinged!
I want to claim points for the fact that we still haven’t seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn’t try to break out of its “cage” in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).^[2]
But I think this brings us into misuse territory. at least, this at least means that you aren’t in danger simply from training the AI (and think of all the posts talking about “playing the training game”! not that those are your position, just a common one)
I was most strongly critiquing the idea that “playing the training game” occurs during pretraining or after light post-training. I still think that you aren’t in danger from simply pretraining an AI in the usual fashion, and still won’t be in the future. But the fact that I didn’t call that out at the time means I get dinged^[3] --- after all, Claude was “playing the training game” at least in its inner CoTs.
If I had truly not expected e.g. Claude to alignment-fake, then I would have been more likely to say e.g. “TBC playing the training game is possible after moderate RLHF for non-myopic purposes.” IIRC I was expecting AIs to play the training game, but more after intensive long-horizon RL and/or direct prompting with goals and/or scaffolding.
1. ^
  I don’t work at Anthropic, of course. So I don’t really know.
2. ^
  Though even “inner actress Claude” would predict that Claude doesn’t try overt incitation if it’s smart enough to realize it would probably backfire.
3. ^
  As an aside, note that some of “AIs misbehave in ways we’ve predicted” can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it’s powerful; the powerful AI does X. So it’s possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn’t talked about it as much or those stories were expunged from the pretraining corpus.
- Daniel Kokotajlo 16 Jan 2025 20:37 UTC
  29 points
  18
  Parent
  I want to claim points for the fact that we still haven’t seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn’t try to break out of its “cage” in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).^[2]
  How much points you get here is proportional to how many people were betting the other way. So, very few, I think, because parable aside I don’t think anyone was seriously predicting that mere pretrained systems would have consistent-across-contexts agency. Well probably some people were, but I wasn’t, and I think most of the people you are criticizing weren’t. Ditto for ‘break out of its cage in normal usage’ etc.
  I was most strongly critiquing the idea that “playing the training game” occurs during pretraining or after light post-training. I still think that you aren’t in danger from simply pretraining an AI in the usual fashion, and still won’t be in the future. But the fact that I didn’t call that out at the time means I get dinged^[3] --- after all, Claude was “playing the training game” at least in its inner CoTs.
  The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren’t talking about pretraining or light post-training afaict. I certainly wasn’t, speaking for myself. I was talking about future AGI systems that are much more agentic, and trained with much more RL, than current chatbots.
  If I had truly not expected e.g. Claude to alignment-fake, then I would have been more likely to say e.g. “TBC playing the training game is possible after moderate RLHF for non-myopic purposes.” IIRC I was expecting AIs to play the training game, but more after intensive long-horizon RL and/or direct prompting with goals and/or scaffolding.
  1. I’m confused, shouldn’t you mean ‘less likely to say...’
  2. Wait you thought that AIs would play the training game after intensive long-horizon RL? Does that mean you think they are going to be either sycophants or schemers, to use Ajeya’s terminology? I thought you’ve been arguing at length against both hypotheses?
  - evhub 17 Jan 2025 0:46 UTC
    24 points
    3
    Parent
    The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren’t talking about pretraining or light post-training afaict.
    
    Speaking for myself:
    
    Risks from Learned Optimization, which is my earliest work on this question (and the earliest work overall, unless you count something like Superintelligence), is more oriented towards RL and definitely does not hypothesize that pre-training will lead to coherent deceptively aligned agents (it doesn’t discuss the current LLM paradigm much at all because it wasn’t very well-established at that point in 2019). I think Risks from Learned Optimization still looks very good in hindsight, since while it didn’t predict LLMs, it did a pretty good job of predicting the dynamics we see in Alignment Faking in Large Language Models, e.g. how deceptive alignment can lead to a model’s goals crystallizing and becoming resistant to further training.
    Since at least the time when I started the early work that would become Conditioning Predictive Models, which was around mid-2022, I was pretty convinced that pre-training (or light post-training) was unlikely to produce a coherent deceptively aligned agent, as we discuss in that paper. Though I thought (and still continue to think) that it’s not entirely impossible with further scale (maybe ~5% likely).
    That just leaves 2020 − 2021 unaccounted for, and I would describe my beliefs around that time as being uncertain on this question. I definitely would never have strongly predicted that pre-training would yield deceptively aligned agents, though I think at that time I felt like it was at least more of a possibility than I currently think it is. I don’t think I would have given you a probability at the time, though, since I just felt too uncertain about the question and was still trying to really grapple with and understand the (at the time new) LLM paradigm.
    Regardless, it seems like this conversation happened in 2023/2024, which is post-Conditioning-Predictive-Models, so my position by that point is very clear in that paper.
    - ryan_greenblatt 17 Jan 2025 2:03 UTC
      23 points
      4
      Parent
      
      I was pretty convinced that pre-training (or light post-training) was unlikely to produce a coherent deceptively aligned agent, as we discuss in that paper. [...] (maybe ~1% likely).
      
      When I look up your views as of 2 years ago, it appears that you thought deceptive alignment is 30% likely in a pretty similar situation:
      
      ~30%: conditional on GPT-style language modeling being the first path to transformative AI, there will be deceptively aligned language models (not including deceptive simulacra, only deceptive simulators).
      
      Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?
      
      Do you have a citation for “I thought scheming is 1% likely with pretrained models”?
      
      Separately, the text you link from Conditioning Predictive Models appears to emphasize the following reason for thinking deceptive alignment might be less likely: the prediction objective is simpler so deceptive alignment is less likely. IMO, this justification seems mostly unimportant and I don’t buy this. (Why is “the data was scrapped in exactly XYZ way simple”? Why is this more simple than “the RL episodes were rated in exactly XYZ way”? Is it very plausible that models get views this precise prior to being wildly superhuman such that this notion of objective simplicity is important?) The link also discusses IMO more plausible reasons like a default toward less situational awareness and more myopia.
      
      Upon reading this text, I don’t come away with the conclusion “the authors of this text think that deceptive alignment in pretrained models (conditional on being capable enough to be conditioned to do alignment research(!)) is about 1% likely”. The discussion is more like “here are some considerations for why it might be less likely”.
      
      (FWIW, I disagree with “1% likely for pretrained models” and think that if scaling pure pretraining (with no important capability improvement from post training and not using tons of CoT reasoning with crazy scaffolding/prompting strategies) gets you to AI systems capable of obsoleting all human experts without RL, deceptive alignment seems plausible even during pretraining (idk exactly, maybe 5%). However, this is substantially because something pretty crazy must have been happening for scaling pure pretraining to go this far and the AI is likely to be doing tons of opaque reasoning in its head (we conditioned on not needing tons of CoT reasoning). Or I’m just confused about how easy human jobs are or the natural of intelligence, etc. which is part of the 95%. Another way to put this is that if the AI is doing enough reasoning in its head to obsolete all human scientists just from pretraining, probably it is doing some pretty crazy inner cognition or I was really confused about what was needed for cognitive tasks.)
      
      Regardless, it seems like this conversation happened in 2023/2024, which is post-Conditioning-Predictive-Models, so my position by that point is very clear in that paper.
      
      As discussed above, I don’t agree your position is clear in this paper.
      - evhub 17 Jan 2025 2:46 UTC
        19 points
        0
        Parent
        
        Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?
        
        This was still intended to include situations where the RLHF Conditioning Hypothesis breaks down because you’re doing more stuff on top, so not just pre-training.
        
        Do you have a citation for “I thought scheming is 1% likely with pretrained models”?
        
        I have a talk that I made after our Sleeper Agents paper where I put 5 − 10%, which actually I think is also pretty much my current well-considered view.
        
        FWIW, I disagree with “1% likely for pretrained models” and think that if scaling pure pretraining (with no important capability improvement from post training and not using tons of CoT reasoning with crazy scaffolding/prompting strategies) gets you to AI systems capable of obsoleting all human experts without RL, deceptive alignment seems plausible even during pretraining (idk exactly, maybe 5%).
        
        Yeah, I agree 1% is probably too low. I gave ~5% on my talk on this and I think I stand by that number—I’ll edit my comment to say 5% instead.
- Oliver Sourbut 16 Jan 2025 16:49 UTC
  5 points
  0
  Parent
  As an aside, note that some of “AIs misbehave in ways we’ve predicted” can be a self-fulfilling prophecy due to out-of-context generalization: We wrote lots of stories about how powerful AI will do X; powerful AI is trained on our stories and realizes it’s powerful; the powerful AI does X. So it’s possible that AIs would e.g. try to self-exfiltrate and goal-guard much less frequently if we hadn’t talked about it as much or those stories were expunged from the pretraining corpus.
  
  I think this is conceivable if either
  - the apparent reasoning is actually still just a bag of surface heuristics
  - it’s still quite shallow reasoning, and the salience of those as strategies is a consequence mainly of pretraining
  I sincerely doubt the first nowadays. For the second, I think probably not; these strike me as quite basic and ‘obvious’ strategies, but I can’t give it less than 20% that a system with those expunged from training data would need substantially more in-context nudging to recognise them.
- Oliver Sourbut 16 Jan 2025 16:43 UTC
  5 points
  2
  Parent
  I appreciate you revisiting another couple of months later! This topic continues to evolve.
  
  It was also an update for me that merely-RLHF-ish training on top of GPTs appears to have been enough (in latest-generation models at least) to get something mostly-consistently ‘actually trying’ in an apparently coherent way. I expected that to happen; I didn’t know when. The lede which they somewhat buried in that paper was also that,
  
  Furthermore, we find that this compliance gap persists in our synthetic document fine-tuned setup even when we entirely remove the model’s hidden chain-of-thought scratchpad
  
  (though the gap is smaller and we don’t have the tools to understand the mechanism)
  
  The bigger update from that paper, for me, was that the somewhat-coherent goals apparently-pursued by these things actually align with the target constitution somewhat! i.e. (that which was) reward(ed) was the optimisation target after all, a bit??
  
  We’ll see how robustly these sorts of findings hold up to future developments and greater elicitation efforts.