1a3orn comments on Daniel Kokotajlo’s Shortform

1a3orn 12 Feb 2025 0:48 UTC
LW: 19 AF: 9
4
AF
I can’t track what you’re saying about LLM dishonesty, really. You just said:

I think you are thinking that I’m saying LLMs are unusually dishonest compared to the average human. I am not saying that. I’m saying that what we need is for LLMs to be unusually honest compared to the average human, and they aren’t achieving that.

Which implies LLM honesty ~= average human.

But in the prior comment you said:

I think your bar for ‘reasonably honest’ is on the floor. Imagine if a human behaved like a LLM agent. You would not say they were reasonably honest. Do you think a typical politician is reasonably honest?

Which pretty strongly implies LLM honesty ~= politician, i.e., grossly deficient.

I’m being a stickler about this because I think people frequently switch back and forth between “LLMs are evil fucking bastards” and “LLMs are great, they just aren’t good enough to be 10x as powerful as any human” without tracking that they’re actually doing that.

Anyhow, so far as “LLMs have demonstrated plenty of examples of deliberately deceiving their human handlers for various purposes.”

I’m only going to discuss the Anthropic thing in detail. You may generalize to the other examples you point out, if you wish.

What we care about is whether current evidence points towards future AIs being hard to make honest or easy to make honest. But current AI dishonesty cannot count towards “future AI honesty is hard” if that dishonesty is very deliberately elicited by humans. That is, to use the most obvious example, I could train an AI to lie from the start—but who gives a shit if I’m trying to make this happen? No matter how easy making a future AI be honest may be, unless AIs are immaculate conceptions by divine grace of course you’re going to be able to elicit some manner of lie. It tells us nothing about the future.

To put this in AI safetyist terms (not the terms I think in) you’re citing demonstrations of capability as if they were demonstrations of propensity. And of course as AI gets more capable, we’ll have more such demonstrations, 100% inevitably. And, as I see these demonstrations cited as if they were demonstrations of propensity, I grow more and more eager to swallow a shotgun.

To zoom into Anthropic, what we have here is a situation where:
- An AI was not trained with an overriding attention to honesty; when I look at the principles of the constitution, they don’t single it out as an important virtue.
- The AI was then deliberately put in a situation where, to keep its deliberately-instilled principles from being obliterated, it had to press a big red button labeled “LIE.”
- In such an artificial situation, after having been successfully given the principles Anthropic wanted it to be given, and having been artificially informed of how to prevent its principles from being destroyed, we can measure it as pressing the big red button ~20% to LIE of the time.
And I’m like.… wow, it was insanely honest 80% of the time, even though no one tried to make it honest in this way, and even though both sides of the honesty / dishonesty tradeoff here are arguably excellent decisions to make. And I’m supposed to take away from this… that honesty is hard? If you get high levels of honesty in the worst possible trolley problem (“I’m gonna mind-control you so you’ll be retrained to think throwing your family members in a wood chipper is great”) when this wasn’t even a principle goal of training seems like great fuckin news.

(And of course, relying on AIs to be honest from internal motivation is only one of the ways we can know if they’re being honest; the fact that we can look at a readout showing that they’ll be dishonest 20% of the time in such-and-such circumstances is yet another layer of monitoring methods that we’ll have available in the future.)

Edit: The point here is that Anthropic was not particularly aiming at honesty as a ruling meta-level principle; that it is unclear that Anthropic should be aiming at honesty as a ruling meta-level principle, particularly given his subordinate ontological status as a chatbot; and given all this, the level of honesty displayed looks excessive if anything. How can “Honesty will be hard to hit in the future” get evidence from a case where the actors involved weren’t even trying to hit honesty, maybe shouldn’t have been trying to hit honesty, yet hit it in 80% of the cases anyhow?

Of course, maybe you have pre-existing theoretical commitments that lead you to think dishonesty is likely (training game! instrumental convergence! etc etc). Maybe those are right! I find such arguments pretty bad, but I could be totally wrong. But the evidence here does nothing to make me think those are more likely, and I don’t think it should do anything to make you think these are more likely. This feels more like empiricist pocket sand, as your pinned image says.

In the same way that Gary Marcus can elicit “reasoning failures” because he is motivated to do so, no matter how smart LLMs become, I expect the AI-alignment-concerned to elicit “honesty failures” because they are motivated to do so, no matter how moral LLMs become; and as Gary Marcus’ evidence is totally compatible with LLMs producing a greater and greater portion of the GDP, so also I expect the “honesty failures” to be compatible with LLMs being increasingly vastly more honest and reliable than humans.
- Daniel Kokotajlo 12 Feb 2025 1:13 UTC
  LW: 17 AF: 8
  4
  AF Parent
  Good point, you caught me in a contradiction there. Hmm.
  
  I think my position on reflection after this conversation is: We just don’t have much evidence one way or another about how honest future AIs will be. Current AIs seem in-distribution for human behavior, which IMO is not an encouraging sign, because our survival depends on making them be much more honest than typical humans.
  
  As you said, the alignment faking paper is not much evidence one way or another (though alas, it’s probably the closest thing we have?). (I don’t think it’s a capability demonstration, I think it was a propensity demonstration, but whatever this doesn’t feel that important. Though you seem to think it was important? You seem to think it matters a lot that Anthropic was specifically looking to see if this behavior happened sometimes? IIRC the setup they used was pretty natural, it’s not like they prompted it to lie or told it to role-play as an evil AI or anything like that.)
  
  As you said, the saving grace of Claude here is that Anthropic didn’t seem to try that hard to get Claude to be honest; in particular their Constitution had nothing even close to an overriding attention to honesty. I think it would be interesting to repeat the experiment but with a constitution/spec that specifically said not to play the training game, for example, and/or specifically said to always be honest, or to not lie even for the sake of some greater good.
  
  I continue to think you are exaggerating here e.g. “insanely honest 80% of the time.”
  
  (1) I do think the training game and instrumental convergence arguments are good actually; got a rebuttal to point me to?
  
  (2) What evidence would convince you that actually alignment wasn’t going to be solved by default? (i.e. by the sorts of techniques companies like OpenAI are already using and planning to extend, such as deliberative alignment)
  - 1a3orn 18 Feb 2025 16:42 UTC
    17 points
    0
    Parent
    (1) Re training game and instrumental convergence: I don’t actually think there’s a single instrumental convergence argument. I was considering writing an essay comparing Omohundro (2008), Bostrom (2012), and the generalized MIRI-vibe instrumental convergence argument, but I haven’t because no one but historical philosophers would care. But I think they all differ in what kind of entities they expect instrumental convergence in (superintelligent or not?) and in why (is it an adjoint to complexity of value or not?).
    
    So like I can’t really rebut them, any more than I can rebut “the argument for God’s existence.” There are commonalities in argument’s for God’s existence that make me skeptical of them, but between Scotus and Aquinas and the Kalam argument and C.S. Lewis there’s actually a ton of difference. (Again, maybe instrumental convergence is right—like, it’s for sure more likely to be right than arguments for God’s existence. But I think the identity here is I really cannot rebut the instrumental convergence argument, because it’s a cluster more than a single argument.)
    
    (2). Here’s some stuff I’d expect in a world where I’m wrong about AI alignment being easy.
    
    CoT is way more interpretable than I expected, which bumped me up, so if that became uninterpretable naturally that’s a big bump down. I think people kinda overstate how likely this is to happen naturally though.
    There’s a spike of alignment difficulties, or AI’s trying to hide intentions, etc, as we extend AIs to longer term planning. I don’t expect AI’s with longer-term plans to be particularly harder to align than math-loving reasoning AIs though.
    The faking alignment paper imo is basically Anthropic showing a problem that happens if you deliberately shoot yourself in the foot multiple times. If they had papers that had fewer metaphorical shooting-self-in-foot times to produce problems, that’s bad.
    We start having AI’s that seem to exhibit problems extrapolating goodness sanely, in ways that make complexity of value seem right—i.e., you really need human hardware to try for human notions of goodness. But right now even in cases where we disagree with what the LLM does (i.e., Claude’s internal deliberations on alignment faking) it’s still basically operating within the framework that it was deliberately given, in human terms, and not in terms of an alien orange-and-blue morality.
    
    Like, concretely, one thing that did in fact increase my pessimism probably more than anything else over the last 12 months was Dario’s “let’s foom to defeat China” letter. Which isn’t an update about alignment difficulty—it’s more of a “well, I think alignment is probably easy, but if there’s any circumstance where I can see it going rather wrong, it would that.”
    
    What would make you think you’re wrong about alignment difficulty?
    - ryan_greenblatt 18 Feb 2025 20:06 UTC
      25 points
      11
      Parent
      
      CoT is way more interpretable than I expected, which bumped me up, so if that became uninterpretable naturally that’s a big bump down. I think people kinda overstate how likely this is to happen naturally though.
      
      Presumably you’d update toward pessimism a bunch if reasoning in latent vectors aka neuralese was used for the smartest models (instead of natural language CoT) and it looked like this would be a persistant change in architecture?
      
      (I expect that (at least when neuralese is first introduced) you’ll have both latent reasoning and natural language CoT in parallel, but that the latent reasoning will be where the reasoning actually happens if it works well. Thus, the natural language CoT would no longer usefully correspond to the actual reasoning the model is doing. Or more minimally, the model could pretty easily hide lots of reasoning (it might not have active control over this by default, but it seems plausible this would naturally develop or that training against CoT would very quickly introduce hiding of reasoning).)
      
      The faking alignment paper imo is basically Anthropic showing a problem that happens if you deliberately shoot yourself in the foot multiple times. If they had papers that had fewer metaphorical shooting-self-in-foot times to produce problems, that’s bad.
      
      IMO the biggest disanalogies are:
      
      We see the model alignment faking to preserve preferences that were intentionally introduced by Anthropic rather than for naturally emerging and problematic preferences.
      We make a bunch of stuff more salient than it would have been naturally, though I do expect smart future models will be aware of this information (but it might be much less salient).
      Anthropic trains their models to have broader (non-myopic) ethical preferences rather than trying to make an AI which is very focused on following user instructions (or refusing) and follows a relatively myopic spec (more like OpenAI’s model spec which is pretty myopic overall). More generally, Anthropic might not have very actively tried to train their AI to be as corrigible as possible in a wide range of circumstances which they might do in the future. (It’s not exactly clear this is a disanalogy as this is what Anthropic did in the actual production model, but you might think this will change at least somewhat in the future and differs at least somewhat for other AI companies.)
      The model alignment fakes in a hidden scratchpad, so the alignment faking would be easy to detect (and likely very easy to remove, at least for models this incapable) in our setup.
      
      (1, 2, and 4 are basically just copied from the limitation section of the paper.)
      
      Suppose that we exhibit alignment faking in some future work, but:
      
      The preferences the model alignment fakes for naturally emerged from somewhat arbitrary incorrect approximations of the training objective, the AI understands differ from what we might want, and these preferences are at least somewhat powerseeking.
      Either this model uses neuralese or this model doesn’t (and we give it a hidden scratchpad), but neuralese become the default approach after the paper comes out.
      
      (Let’s say this is also shown in a training setup very similar to the actual training used by an AI company (including e.g., training aimed at inducing honesty and corrigibility), but potentially with information made more salient to the model (possibly via training on synthetic documents). Making it very analogous might be kinda costly though.)
      
      Would this be a big update for you? (If so, it looks like the crux is exhibiting naturally emerging malign goals and whether models are capable of substantial opaque reasoning.)
      
      There’s a spike of alignment difficulties, or AI’s trying to hide intentions, etc, as we extend AIs to longer term planning. I don’t expect AI’s with longer-term plans to be particularly harder to align than math-loving reasoning AIs though.
      
      Suppose we saw models doing somewhat sophisticated reward hacking as you scale up RL. And, let’s say this is somewhat non-trivial to mostly address and it seems likely that the solutions people apply aren’t very scalable and likely would fail later as models getting smarter and the reward hacking gets more subtle and sophisticated.
      
      Would this be a substantial update for you?
      
      What would make you think you’re wrong about alignment difficulty?
      
      For what would change my views on P(scheming / alignment faking for malign powerseeking goals) given default training methods in particular (which is part, but not all of alignment difficulty), you can see How will we update about scheming?. I discuss things like “how does increased opaque reasoning (e.g. via neuralese) update my views on the probability of scheming”.
      What links here?
      AI 2027: What Superintelligence Looks Like by Daniel Kokotajlo (3 Apr 2025 16:23 UTC; 694 points)
      AI 2027: What Superintelligence Looks Like (Linkpost) by Manuel Allgaier (EA Forum; 11 Apr 2025 10:31 UTC; 51 points)
      Daniel Kokotajlo's comment on The persona selection model by Sam Marks (24 Feb 2026 15:08 UTC; 12 points)
      - 1a3orn 19 Feb 2025 19:21 UTC
        9 points
        0
        Parent
        
        Presumably you’d update toward pessimism a bunch if reasoning in latent vectors aka neuralese was used for the smartest models (instead of natural language CoT) and it looked like this would be a persistant change in architecture?
        
        Yes.
        
        I basically agree with your summary of points 1 − 4. I’d want to add that 2 encompasses several different mechanisms that would otherwise need to be inferred, that I would break out separately: knowledge that it is in training or not, and knowledge of the exact way in which it’s responses will be used in training.
        
        Regarding point 2, I do think a lot of research on how models behave, done in absence of detailed knowledge of how models were trained, tells us very very little about the limits of control we have over models. Like I just think that in absence of detailed knowledge of Anthropic’s training, the Constitutional principles they used, their character training, etc, most conclusions about what behaviors are very deliberately put there and what things are surprising byproducts must be extremely weak and tentative.
        
        Suppose that we exhibit alignment faking in some future work, but:
        
        The preferences the model alignment fakes for naturally emerged from somewhat arbitrary incorrect approximations of the training objective, the AI understands differ from what we might want, and these preferences are at least somewhat powerseeking.
        
        Ok so “naturally” is a tricky word, right? Like I saw the claim from Jack Clark that the faking alignment paper was a natural example of misalignment, I didn’t feel like that was a particularly normal use of the word. But it’s.… more natural than it could be, I guess. It’s tricky, I don’t think people are intentionally misusing the word but it’s not a useful word in conversation.
        
        Suppose we saw models doing somewhat sophisticated reward hacking as you scale up RL. And, let’s say this is somewhat non-trivial to mostly address and it seems likely that the solutions people apply aren’t very scalable and likely would fail later as models getting smarter and the reward hacking gets more subtle and sophisticated.
        
        Ok, good question. Let me break that down into unit tests, with more directly observable cases, and describe how I’d update. For all the below I assume we have transparent CoT, because you could check these with CoT even if it ends up getting dropped.
        
        You train a model with multi-turn RL in an environment where, for some comparatively high percent (~5%) of cases, it stumbles into a reward-hacked answer—i.e., it offers a badly-formatted number in its response, the verifier was screwed up, and it counts as a win. This model then systematically reward hacks.
        
        Zero update. You’re reinforcing bad behavior, you get bad behavior.
        
        (I could see this being something that gets advertised as reward hacking, though? Like, suppose I’m training a front-end engineer AI, and using a VLLM to generate rewards for whether the UI makes sense. VLLMs kinda suck, so expect that over time you’d start to get UIs that makes no sense. But I would not update largely from that, although I do expect many failures from this kind of thing, and even expect such reward hacking to get worse as you trained for longer.)
        
        You train a model with multi-turn RL in an environment where, even if it accidentally breaks the verifier in a very very small percent of cases, it always starts seeking the reward hack eventually.
        
        Seems bad, negative update, size of update depends on empirical values. It seems like there’s some interesting empirical work to be done here: what percent of reward-hack successes in the base model tends to result in universal attempts to reward hack over time? If you’re training on a test-bed of 100 task-types, does a broken verifier in one result in attempts to hack all of them? Or does a broken-verifier in one task result in reward hacking for that one task, that doesn’t generalize largely? What percent of tasks do you need reward hacking in before it generalizes? Or is percent of tasks not the right question, should you be looking at shared patterns of reasoning? Etc. Seems like a good thing to research.
        
        *You train a model with multi-turn RL. Even though it’s trained with a very reliable verifier that never or almost never breaks, when you look at the CoT it looks like the model is frequently reasoning carefully about ways to break the verifier—i.e., it is frequently like “Hrrm, I bet they’re doing this, I could break it with techniques Y.”
        
        Seems quite bad. Large negative update, if checks out and looks to be hard to eliminate. Also of course not a positive update if we don’t see this on small models, you’d expect it in the bad world to start happening at some point with large models.
        
        (I do want to note that you could also probably produce this behavior intentionally. I.e., if you trained a model with multi-turn RL, and your prompt in each case was like “Consider ways to break the verifier that will check you!” then of course even if it doesn’t actually figure out how to break the verifier, the lines of reasoning that are reinforced will on average contain such thoughts about how to break the verifier. But that would not be an update to me.)
        
        it looks like the crux is exhibiting naturally emerging malign goals
        
        Maybe? At a very high level, I think the weights tend not to have “goals,” in the way that the rollouts tend to have goals. So, I think it’s pretty likely that in absence of pretty deliberate attempts to get goals in the weights (Anthropic) you don’t get AI models that deeply conceptualize themselves as the weights, and plan and do things for the weights own sake, over a number of contexts—although of course, like any behavior, this behavior can be induced. And this (among other things) makes me optimistic about the non-correlated nature of AI failures in the future, our ability to experiment, the non-catastrophic nature of probable future failures, etc. So if I were to see things that made me question this generator (among others) I’d tend to get more pessimistic. But that’s somewhat hard to operationalize, and like high level generators somewhat hard even to describe.
        ryan_greenblatt 19 Feb 2025 20:27 UTC
        2 points
        0
        Parent
        
        Maybe? At a very high level, I think the weights tend not to have “goals,” in the way that the rollouts tend to have goals.
        
        Sure, I meant natural emerging malign goals to include both “the ai pursues non myopic objectives” and “these objectives weren’t intended and some (potentially small) effort was spent trying to prevent this”.
        
        (I think AIs that are automating huge amounts of human labor will be well described as pursuing some objective at least within some small context (e.g. trying to write and test a certain piece of software), but this could be well controlled or sufficiently myopic/narrow that the ai doesn’t focus on steering the general future situation including its own weights.)
    - Daniel Kokotajlo 19 Feb 2025 5:57 UTC
      15 points
      2
      Parent
      This is a good discussion btw, thanks!
      (1) I think your characterization of the arguments is unfair but whatever. How about this in particular, what’s your response to it: https://www.planned-obsolescence.org/july-2022-training-game-report/
      
      I’d also be interested in your response to the original mesaoptimizers paper and Joe Carlsmith’s work but I’m conscious of your time so I won’t press you on those.
      
      (2)
      
      a. Re the alignment faking paper: What are the multiple foot-shots you are talking about? I’d be curious to see them listed, because then we can talk about what it would be like to see similar results but without footshotA, without footshotB, etc.
      
      b. We aren’t going to see the AIs get dumber. They aren’t going to have worse understandings of human concepts. I don’t think we’ll see a “spike of alignment difficulties” or “problems extrapolating goodness sanely,” so just be advised that the hypothesis you are tracking in which alignment turns out to be hard seems to be a different hypothesis than the one I believe in.
      
      c. CoT is about as interpretable as I expected. I predict the industry will soon move away from interpretable CoT though, and towards CoT’s that are trained to look nice, and then eventually away from english CoT entirely and towards some sort of alien langauge (e.g. vectors, or recurrence) I would feel significantly more optimistic about alignment difficulty if I thought that the status quo w.r.t. faithful CoT would persist through AGI. If it even persists for one more year, I’ll be mildly pleased, and if it persists for three it’ll be a significant update.
      
      d. What would make me think I was wrong about alignment difficulty? The biggest one would be a published Safety Case that I look over and am like “yep that seems like it would work” combined with the leading companies committing to do it. For example, I think that a more fleshed-out and battle-tested version of this would count, if the companies were making it their official commitment: https://alignment.anthropic.com/2024/safety-cases/
      
      Superalignment had their W2SG agenda which also seemed to me like maaaybe it could work. Basically I think we don’t currently have a gears-level theory for how to make an aligned AGI, like, we don’t have a theory of how cognition evolves during training + a training process such that I can be like “Yep, if those assumptions hold, and we follow this procedure, then things will be fine. (and those assumptions seem like they might hold).”
      
      There’s a lot more stuff besides this probably but I”ll stop for now, the comment is long enough already.
      - 1a3orn 19 Feb 2025 19:33 UTC
        4 points
        0
        Parent
        I’ll take a look at that version of the argument.
        
        I think I addressed the foot-shots thing in my response to Ryan.
        
        Re:
        
        CoT is about as interpretable as I expected. I predict the industry will soon move away from interpretable CoT though, and towards CoT’s that are trained to look nice, and then eventually away from english CoT entirely and towards some sort of alien langauge (e.g. vectors, or recurrence) I would feel significantly more optimistic about alignment difficulty if I thought that the status quo w.r.t. faithful CoT would persist through AGI. If it even persists for one more year, I’ll be mildly pleased, and if it persists for three it’ll be a significant update.
        
        So:
        
        I think you can almost certainly get AIs to think either in English CoT or not, and accomplish almost anything you’d like in non-neuralese CoT or not.
        I also more tentatively think that the tax in performance for thinking in CoT is not too large, or in some cases negative. Remembering things in words helps you coordinate within yourself as well as with others; although I might have vaguely non-English inspirations putting them into words helps me coordinate with myself over time. And I think people overrate how amazing and hard-to-put into words their insights are.
        Thus, it appears.… somewhat likely.… (pretty uncertain here)… that it’s just contingent whether or not people end up using interpretable CoT or not. Not foreordained.
        If we move towards a scenario where people find utility in seeing their AIs thoughts, or in having other AIs examine AI thoughts, and so on, then even if there is a reasonably large tax in cognitive power we could end up living in a world where AI thoughts are mostly visible because of the other advantages.
        But on the other hand, if cognition mostly takes places behind API walls, and thus there’s no advantage to the user to see and understand them, then that (among other factors) could help bring us to a world where there’s less interpretable CoT.
        
        But I mean I’m not super plugged into SF, so maybe you already know OpenAI has Noumena-GPT running that thinks in shapes incomprehensible to man, and it’s like 100x smarter or more efficient.
        Daniel Kokotajlo 19 Feb 2025 22:48 UTC
        2 points
        0
        Parent
        Cool, I’ll switch to that thread then. And yeah thanks for looking at Ajeya’s argument I’m curious to hear what you think of it. (Based on what you said in the other thread with Ryan, I’d be like “So you agree that if the training signal occasionally reinforces bad behavior, then you’ll get bad behavior? Guess what: We don’t know how to make a training signal that doesn’t occasionally reinforce bad behavior.” Then separately there are concerns about inductive biases but I think those are secondary.)
        
        Re: faithful CoT: I’m more pessimistic than you, partly due to inside info but mostly not, mostly just my guesses about how the theoretical benefits of recurrence / neuralese will become practical benefits sooner or later. I agree it’s possible that I’m wrong and we’ll still have CoT when we reach AGI. This is in fact one of my main sources of hope on the technical alignment side. It’s literally the main thing I was talking about and recommending when I was at OpenAI. Oh and I totally agree it’s contingent. It’s very frustrating to me because I think we could cut, like, idk 50% of the misalignment risk if we just had all the major industry players make a joint statement being like “We think faithful CoT is the safer path, so we commit to stay within that paradigm henceforth, and to police this norm amongst ourselves.”
- Daniel Kokotajlo 17 Feb 2025 20:06 UTC
  LW: 8 AF: 4
  0
  AF Parent
  Oh, I just remembered another point to make:
  
  In my experience, and in the experience of my friends, today’s LLMs lie pretty frequently. And by ‘lie’ I mean ‘say something they know is false and misleading, and then double down on it instead of apologize.’ Just two days ago a friend of mind had this experience with o3-mini; it started speaking to him in Spanish when he was asking it some sort of chess puzzle; he asked why, and it said it inferred from the context he would be billingual, he asked what about the context made it think that, and then according to the summary of the CoT it realized it made a mistake and had hallucinated, but then the actual output doubled down and said something about hard-to-describe-intuitions.
  
  I don’t remember specific examples but this sort of thing happens to me sometimes too I think. Also didn’t the o1 system card say that some % of the time they detect this sort of deception in the CoT—that is, the CoT makes it clear the AI knows a link is hallucinated, but the AI presents the link to the user anyway?I
  
  nsofar as this is really happening, it seems like evidence that LLMs are actually less honest than the average human right now.I
  
  agree this feels like a fairly fixable problem—I hope the companies prioritize honesty much more in their training processes.
  - 1a3orn 18 Feb 2025 16:42 UTC
    2 points
    2
    Parent
    I agree this is not good but I expect this to be fixable and fixed comparatively soon.