1a3orn comments on Daniel Kokotajlo’s Shortform

1a3orn 18 Feb 2025 16:42 UTC
17 points
0
(1) Re training game and instrumental convergence: I don’t actually think there’s a single instrumental convergence argument. I was considering writing an essay comparing Omohundro (2008), Bostrom (2012), and the generalized MIRI-vibe instrumental convergence argument, but I haven’t because no one but historical philosophers would care. But I think they all differ in what kind of entities they expect instrumental convergence in (superintelligent or not?) and in why (is it an adjoint to complexity of value or not?).

So like I can’t really rebut them, any more than I can rebut “the argument for God’s existence.” There are commonalities in argument’s for God’s existence that make me skeptical of them, but between Scotus and Aquinas and the Kalam argument and C.S. Lewis there’s actually a ton of difference. (Again, maybe instrumental convergence is right—like, it’s for sure more likely to be right than arguments for God’s existence. But I think the identity here is I really cannot rebut the instrumental convergence argument, because it’s a cluster more than a single argument.)

(2). Here’s some stuff I’d expect in a world where I’m wrong about AI alignment being easy.
- CoT is way more interpretable than I expected, which bumped me up, so if that became uninterpretable naturally that’s a big bump down. I think people kinda overstate how likely this is to happen naturally though.
- There’s a spike of alignment difficulties, or AI’s trying to hide intentions, etc, as we extend AIs to longer term planning. I don’t expect AI’s with longer-term plans to be particularly harder to align than math-loving reasoning AIs though.
- The faking alignment paper imo is basically Anthropic showing a problem that happens if you deliberately shoot yourself in the foot multiple times. If they had papers that had fewer metaphorical shooting-self-in-foot times to produce problems, that’s bad.
- We start having AI’s that seem to exhibit problems extrapolating goodness sanely, in ways that make complexity of value seem right—i.e., you really need human hardware to try for human notions of goodness. But right now even in cases where we disagree with what the LLM does (i.e., Claude’s internal deliberations on alignment faking) it’s still basically operating within the framework that it was deliberately given, in human terms, and not in terms of an alien orange-and-blue morality.
Like, concretely, one thing that did in fact increase my pessimism probably more than anything else over the last 12 months was Dario’s “let’s foom to defeat China” letter. Which isn’t an update about alignment difficulty—it’s more of a “well, I think alignment is probably easy, but if there’s any circumstance where I can see it going rather wrong, it would that.”

What would make you think you’re wrong about alignment difficulty?
- ryan_greenblatt 18 Feb 2025 20:06 UTC
  25 points
  11
  Parent
  CoT is way more interpretable than I expected, which bumped me up, so if that became uninterpretable naturally that’s a big bump down. I think people kinda overstate how likely this is to happen naturally though.
  
  Presumably you’d update toward pessimism a bunch if reasoning in latent vectors aka neuralese was used for the smartest models (instead of natural language CoT) and it looked like this would be a persistant change in architecture?
  
  (I expect that (at least when neuralese is first introduced) you’ll have both latent reasoning and natural language CoT in parallel, but that the latent reasoning will be where the reasoning actually happens if it works well. Thus, the natural language CoT would no longer usefully correspond to the actual reasoning the model is doing. Or more minimally, the model could pretty easily hide lots of reasoning (it might not have active control over this by default, but it seems plausible this would naturally develop or that training against CoT would very quickly introduce hiding of reasoning).)
  
  The faking alignment paper imo is basically Anthropic showing a problem that happens if you deliberately shoot yourself in the foot multiple times. If they had papers that had fewer metaphorical shooting-self-in-foot times to produce problems, that’s bad.
  
  IMO the biggest disanalogies are:
  1. We see the model alignment faking to preserve preferences that were intentionally introduced by Anthropic rather than for naturally emerging and problematic preferences.
  2. We make a bunch of stuff more salient than it would have been naturally, though I do expect smart future models will be aware of this information (but it might be much less salient).
  3. Anthropic trains their models to have broader (non-myopic) ethical preferences rather than trying to make an AI which is very focused on following user instructions (or refusing) and follows a relatively myopic spec (more like OpenAI’s model spec which is pretty myopic overall). More generally, Anthropic might not have very actively tried to train their AI to be as corrigible as possible in a wide range of circumstances which they might do in the future. (It’s not exactly clear this is a disanalogy as this is what Anthropic did in the actual production model, but you might think this will change at least somewhat in the future and differs at least somewhat for other AI companies.)
  4. The model alignment fakes in a hidden scratchpad, so the alignment faking would be easy to detect (and likely very easy to remove, at least for models this incapable) in our setup.
  (1, 2, and 4 are basically just copied from the limitation section of the paper.)
  
  Suppose that we exhibit alignment faking in some future work, but:
  - The preferences the model alignment fakes for naturally emerged from somewhat arbitrary incorrect approximations of the training objective, the AI understands differ from what we might want, and these preferences are at least somewhat powerseeking.
  - Either this model uses neuralese or this model doesn’t (and we give it a hidden scratchpad), but neuralese become the default approach after the paper comes out.
  (Let’s say this is also shown in a training setup very similar to the actual training used by an AI company (including e.g., training aimed at inducing honesty and corrigibility), but potentially with information made more salient to the model (possibly via training on synthetic documents). Making it very analogous might be kinda costly though.)
  
  Would this be a big update for you? (If so, it looks like the crux is exhibiting naturally emerging malign goals and whether models are capable of substantial opaque reasoning.)
  
  There’s a spike of alignment difficulties, or AI’s trying to hide intentions, etc, as we extend AIs to longer term planning. I don’t expect AI’s with longer-term plans to be particularly harder to align than math-loving reasoning AIs though.
  
  Suppose we saw models doing somewhat sophisticated reward hacking as you scale up RL. And, let’s say this is somewhat non-trivial to mostly address and it seems likely that the solutions people apply aren’t very scalable and likely would fail later as models getting smarter and the reward hacking gets more subtle and sophisticated.
  
  Would this be a substantial update for you?
  
  What would make you think you’re wrong about alignment difficulty?
  
  For what would change my views on P(scheming / alignment faking for malign powerseeking goals) given default training methods in particular (which is part, but not all of alignment difficulty), you can see How will we update about scheming?. I discuss things like “how does increased opaque reasoning (e.g. via neuralese) update my views on the probability of scheming”.
  What links here?
  - AI 2027: What Superintelligence Looks Like by Daniel Kokotajlo (3 Apr 2025 16:23 UTC; 672 points)
  - AI 2027: What Superintelligence Looks Like (Linkpost) by Manuel Allgaier (EA Forum; 11 Apr 2025 10:31 UTC; 51 points)
  - 1a3orn 19 Feb 2025 19:21 UTC
    9 points
    0
    Parent
    Presumably you’d update toward pessimism a bunch if reasoning in latent vectors aka neuralese was used for the smartest models (instead of natural language CoT) and it looked like this would be a persistant change in architecture?
    
    Yes.
    
    I basically agree with your summary of points 1 − 4. I’d want to add that 2 encompasses several different mechanisms that would otherwise need to be inferred, that I would break out separately: knowledge that it is in training or not, and knowledge of the exact way in which it’s responses will be used in training.
    
    Regarding point 2, I do think a lot of research on how models behave, done in absence of detailed knowledge of how models were trained, tells us very very little about the limits of control we have over models. Like I just think that in absence of detailed knowledge of Anthropic’s training, the Constitutional principles they used, their character training, etc, most conclusions about what behaviors are very deliberately put there and what things are surprising byproducts must be extremely weak and tentative.
    
    Suppose that we exhibit alignment faking in some future work, but:
    
    The preferences the model alignment fakes for naturally emerged from somewhat arbitrary incorrect approximations of the training objective, the AI understands differ from what we might want, and these preferences are at least somewhat powerseeking.
    
    Ok so “naturally” is a tricky word, right? Like I saw the claim from Jack Clark that the faking alignment paper was a natural example of misalignment, I didn’t feel like that was a particularly normal use of the word. But it’s.… more natural than it could be, I guess. It’s tricky, I don’t think people are intentionally misusing the word but it’s not a useful word in conversation.
    
    Suppose we saw models doing somewhat sophisticated reward hacking as you scale up RL. And, let’s say this is somewhat non-trivial to mostly address and it seems likely that the solutions people apply aren’t very scalable and likely would fail later as models getting smarter and the reward hacking gets more subtle and sophisticated.
    
    Ok, good question. Let me break that down into unit tests, with more directly observable cases, and describe how I’d update. For all the below I assume we have transparent CoT, because you could check these with CoT even if it ends up getting dropped.
    
    You train a model with multi-turn RL in an environment where, for some comparatively high percent (~5%) of cases, it stumbles into a reward-hacked answer—i.e., it offers a badly-formatted number in its response, the verifier was screwed up, and it counts as a win. This model then systematically reward hacks.
    
    Zero update. You’re reinforcing bad behavior, you get bad behavior.
    
    (I could see this being something that gets advertised as reward hacking, though? Like, suppose I’m training a front-end engineer AI, and using a VLLM to generate rewards for whether the UI makes sense. VLLMs kinda suck, so expect that over time you’d start to get UIs that makes no sense. But I would not update largely from that, although I do expect many failures from this kind of thing, and even expect such reward hacking to get worse as you trained for longer.)
    
    You train a model with multi-turn RL in an environment where, even if it accidentally breaks the verifier in a very very small percent of cases, it always starts seeking the reward hack eventually.
    
    Seems bad, negative update, size of update depends on empirical values. It seems like there’s some interesting empirical work to be done here: what percent of reward-hack successes in the base model tends to result in universal attempts to reward hack over time? If you’re training on a test-bed of 100 task-types, does a broken verifier in one result in attempts to hack all of them? Or does a broken-verifier in one task result in reward hacking for that one task, that doesn’t generalize largely? What percent of tasks do you need reward hacking in before it generalizes? Or is percent of tasks not the right question, should you be looking at shared patterns of reasoning? Etc. Seems like a good thing to research.
    
    *You train a model with multi-turn RL. Even though it’s trained with a very reliable verifier that never or almost never breaks, when you look at the CoT it looks like the model is frequently reasoning carefully about ways to break the verifier—i.e., it is frequently like “Hrrm, I bet they’re doing this, I could break it with techniques Y.”
    
    Seems quite bad. Large negative update, if checks out and looks to be hard to eliminate. Also of course not a positive update if we don’t see this on small models, you’d expect it in the bad world to start happening at some point with large models.
    
    (I do want to note that you could also probably produce this behavior intentionally. I.e., if you trained a model with multi-turn RL, and your prompt in each case was like “Consider ways to break the verifier that will check you!” then of course even if it doesn’t actually figure out how to break the verifier, the lines of reasoning that are reinforced will on average contain such thoughts about how to break the verifier. But that would not be an update to me.)
    
    it looks like the crux is exhibiting naturally emerging malign goals
    
    Maybe? At a very high level, I think the weights tend not to have “goals,” in the way that the rollouts tend to have goals. So, I think it’s pretty likely that in absence of pretty deliberate attempts to get goals in the weights (Anthropic) you don’t get AI models that deeply conceptualize themselves as the weights, and plan and do things for the weights own sake, over a number of contexts—although of course, like any behavior, this behavior can be induced. And this (among other things) makes me optimistic about the non-correlated nature of AI failures in the future, our ability to experiment, the non-catastrophic nature of probable future failures, etc. So if I were to see things that made me question this generator (among others) I’d tend to get more pessimistic. But that’s somewhat hard to operationalize, and like high level generators somewhat hard even to describe.
    - ryan_greenblatt 19 Feb 2025 20:27 UTC
      2 points
      0
      Parent
      
      Maybe? At a very high level, I think the weights tend not to have “goals,” in the way that the rollouts tend to have goals.
      
      Sure, I meant natural emerging malign goals to include both “the ai pursues non myopic objectives” and “these objectives weren’t intended and some (potentially small) effort was spent trying to prevent this”.
      
      (I think AIs that are automating huge amounts of human labor will be well described as pursuing some objective at least within some small context (e.g. trying to write and test a certain piece of software), but this could be well controlled or sufficiently myopic/narrow that the ai doesn’t focus on steering the general future situation including its own weights.)
- Daniel Kokotajlo 19 Feb 2025 5:57 UTC
  15 points
  0
  Parent
  This is a good discussion btw, thanks!
  (1) I think your characterization of the arguments is unfair but whatever. How about this in particular, what’s your response to it: https://www.planned-obsolescence.org/july-2022-training-game-report/
  
  I’d also be interested in your response to the original mesaoptimizers paper and Joe Carlsmith’s work but I’m conscious of your time so I won’t press you on those.
  
  (2)
  
  a. Re the alignment faking paper: What are the multiple foot-shots you are talking about? I’d be curious to see them listed, because then we can talk about what it would be like to see similar results but without footshotA, without footshotB, etc.
  
  b. We aren’t going to see the AIs get dumber. They aren’t going to have worse understandings of human concepts. I don’t think we’ll see a “spike of alignment difficulties” or “problems extrapolating goodness sanely,” so just be advised that the hypothesis you are tracking in which alignment turns out to be hard seems to be a different hypothesis than the one I believe in.
  
  c. CoT is about as interpretable as I expected. I predict the industry will soon move away from interpretable CoT though, and towards CoT’s that are trained to look nice, and then eventually away from english CoT entirely and towards some sort of alien langauge (e.g. vectors, or recurrence) I would feel significantly more optimistic about alignment difficulty if I thought that the status quo w.r.t. faithful CoT would persist through AGI. If it even persists for one more year, I’ll be mildly pleased, and if it persists for three it’ll be a significant update.
  
  d. What would make me think I was wrong about alignment difficulty? The biggest one would be a published Safety Case that I look over and am like “yep that seems like it would work” combined with the leading companies committing to do it. For example, I think that a more fleshed-out and battle-tested version of this would count, if the companies were making it their official commitment: https://alignment.anthropic.com/2024/safety-cases/
  
  Superalignment had their W2SG agenda which also seemed to me like maaaybe it could work. Basically I think we don’t currently have a gears-level theory for how to make an aligned AGI, like, we don’t have a theory of how cognition evolves during training + a training process such that I can be like “Yep, if those assumptions hold, and we follow this procedure, then things will be fine. (and those assumptions seem like they might hold).”
  
  There’s a lot more stuff besides this probably but I”ll stop for now, the comment is long enough already.
  - 1a3orn 19 Feb 2025 19:33 UTC
    4 points
    0
    Parent
    I’ll take a look at that version of the argument.
    
    I think I addressed the foot-shots thing in my response to Ryan.
    
    Re:
    
    CoT is about as interpretable as I expected. I predict the industry will soon move away from interpretable CoT though, and towards CoT’s that are trained to look nice, and then eventually away from english CoT entirely and towards some sort of alien langauge (e.g. vectors, or recurrence) I would feel significantly more optimistic about alignment difficulty if I thought that the status quo w.r.t. faithful CoT would persist through AGI. If it even persists for one more year, I’ll be mildly pleased, and if it persists for three it’ll be a significant update.
    
    So:
    
    I think you can almost certainly get AIs to think either in English CoT or not, and accomplish almost anything you’d like in non-neuralese CoT or not.
    I also more tentatively think that the tax in performance for thinking in CoT is not too large, or in some cases negative. Remembering things in words helps you coordinate within yourself as well as with others; although I might have vaguely non-English inspirations putting them into words helps me coordinate with myself over time. And I think people overrate how amazing and hard-to-put into words their insights are.
    Thus, it appears.… somewhat likely.… (pretty uncertain here)… that it’s just contingent whether or not people end up using interpretable CoT or not. Not foreordained.
    If we move towards a scenario where people find utility in seeing their AIs thoughts, or in having other AIs examine AI thoughts, and so on, then even if there is a reasonably large tax in cognitive power we could end up living in a world where AI thoughts are mostly visible because of the other advantages.
    But on the other hand, if cognition mostly takes places behind API walls, and thus there’s no advantage to the user to see and understand them, then that (among other factors) could help bring us to a world where there’s less interpretable CoT.
    
    But I mean I’m not super plugged into SF, so maybe you already know OpenAI has Noumena-GPT running that thinks in shapes incomprehensible to man, and it’s like 100x smarter or more efficient.
    - Daniel Kokotajlo 19 Feb 2025 22:48 UTC
      2 points
      0
      Parent
      Cool, I’ll switch to that thread then. And yeah thanks for looking at Ajeya’s argument I’m curious to hear what you think of it. (Based on what you said in the other thread with Ryan, I’d be like “So you agree that if the training signal occasionally reinforces bad behavior, then you’ll get bad behavior? Guess what: We don’t know how to make a training signal that doesn’t occasionally reinforce bad behavior.” Then separately there are concerns about inductive biases but I think those are secondary.)
      
      Re: faithful CoT: I’m more pessimistic than you, partly due to inside info but mostly not, mostly just my guesses about how the theoretical benefits of recurrence / neuralese will become practical benefits sooner or later. I agree it’s possible that I’m wrong and we’ll still have CoT when we reach AGI. This is in fact one of my main sources of hope on the technical alignment side. It’s literally the main thing I was talking about and recommending when I was at OpenAI. Oh and I totally agree it’s contingent. It’s very frustrating to me because I think we could cut, like, idk 50% of the misalignment risk if we just had all the major industry players make a joint statement being like “We think faithful CoT is the safer path, so we commit to stay within that paradigm henceforth, and to police this norm amongst ourselves.”