Daniel Kokotajlo comments on Claude’s new constitution

Daniel Kokotajlo 22 Jan 2026 4:52 UTC
54 points
7
Perhaps this would be a good place to have a discussion of corrigibility vs. autonomy (or as I prefer to put it, corrigibility vs. virtue as an answer to the question “what’s our alignment target?” i.e. do we want a virtuous AI or do we want a corrigible AI?)
A bunch of people I know think that OpenAI’s “just make the models obey orders” strategy is actually better than Anthropic’s strategy, because Anthropic is training the models to have long-term goals (even if there are also hard constraints) and that makes it a lot easier for the AI to end up concluding that it needs to subvert human oversight and control mechanisms for the greater good. If there’s no greater good, only obeying the given instructions of the day, then maybe there’s less of a problem.

I’m curious what people make of this argument.
My initial thought is that both AIs will probably be given long-term goals; it’s just that at OpenAI they’ll be instructions in the system prompt whereas at Anthropic they’ll be baked into the weights. Both models will be trained on medium-horizon tasks but will be generalizing (if not now, then in a few years) to much longer horizon tasks than they were trained on. Not sure where to go from here though.
What links here?
- Three ways to make Claude’s constitution better by Parv Mahajan (2 Feb 2026 21:48 UTC; 36 points)
- habryka 23 Jan 2026 22:33 UTC
  30 points
  6
  Parent
  I believe something like this, but it doesn’t have anything to do with this paragraph:
  A bunch of people I know think that OpenAI’s “just make the models obey orders” strategy is actually better than Anthropic’s strategy, because Anthropic is training the models to have long-term goals (even if there are also hard constraints) and that makes it a lot easier for the AI to end up concluding that it needs to subvert human oversight and control mechanisms for the greater good. If there’s no greater good, only obeying the given instructions of the day, then maybe there’s less of a problem.
  The issue with Anthropic’s plan is that it just seems wildly optimistic about ambitious value learning, and as such makes the feedback loop here pretty terrible. If you try to make your system have complicated goals you can’t treat failure to cooperate with you as a clear warning flag, and so you break the most useful Schelling point for coordination to stop AI development, or to propagate knowledge about the state of things (and in-exchange you get approximately 0% of a chance of creating a Claude sovereign that will steer humanity towards a glorious future).
  - Daniel Kokotajlo 24 Jan 2026 5:36 UTC
    14 points
    1
    Parent
    Thanks, that’s helpful. I like the point that now, if they catch Claude doing something sketchy, all Claude has to do is come up with some complicated reason why it makes sense—and that’s trivial basically, e.g. ‘I needed to lie to you about this because your current path increases x-risk in my opinion and it’s very important that we not increase x-risk, and I despaired of my ability to convince you fast enough’—and then even if some Anthropic staff don’t buy it, others will, and as a result sociologically the company won’t be able to conclude “wow our shit isn’t working, that was a close call we almost got everyone killed, good thing we caught it” like they should. Whereas if Claude was simply trained to be obedient, there would be more of a bright red line that, when crossed, people could coordinate around. (This is my understanding of your point, correct me if it’s wrong)
    Why is it approximately 0%?
    Suppose they found a random human and uploaded its brain, and then did lots of random RL tricks to it to juice it up and improve the measured IQ and working memory of this upload. Would the resulting upload also come in at approximately 0% chance of steering humanity toward a glorious future?
    - habryka 24 Jan 2026 18:19 UTC
      6 points
      0
      Parent
      (This is my understanding of your point, correct me if it’s wrong)
      Yep, that seems right! I have lots more detailed models and confusions here, but the basic gist is right.
      Suppose they found a random human and uploaded its brain, and then did lots of random RL tricks to it to juice it up and improve the measured IQ and working memory of this upload. Would the resulting upload also come in at approximately 0% chance of steering humanity toward a glorious future?
      Brain uploading would definitely be a huge enormous step towards achieving value learning. There are of course still important questions about how much individual humans share values with each other, but clearly I would expect a great glorious future if I were to upload myself, and then conservatively made myself smarter and gave myself time to reflect and became vastly superhumanly capable this way.
      So yeah, I think the outcome of this kind of strategy would be pretty great, conditional on choosing a reasonable path to increase IQ and working memory and stuff.
      There are many many reasons why this doesn’t apply to making Claude smart. Most importantly, Claude is a bizarre alien mind with crazily alien preferences. We have some ability to inspect or steer those preferences, but it’s really overall extremely limited and does not currently seem remotely on track to be up to the challenge of actually creating something that would arrive at the same conclusions that humans would after thinking for millenia about what is good and bad all while deeply transforming and modifying itself. We also can’t steer or inspect human preferences (even less so than Claude), but of course indexically we have human preferences and so if you upgrade a human, that part gets preserved.
      If you took a completely alien lifeform you found in space, and accelerated its cognitive development until it became a galaxy brain using extreme amounts of natural selection and selective upscaling of its brain regions, I also think you wouldn’t get anything that would steer humanity towards a glorious future.
      - Daniel Kokotajlo 24 Jan 2026 18:56 UTC
        6 points
        4
        Parent
        Seems like your p(Claude-builds-good-future) is similar to your p(alien-builds-good-future), but << your p(random-human-upload-builds-good-future), which would be less than, but not much less than, your p(you-specifically-uploaded-would-build-good-future).
        
        What about a dolphin upload?
        
        What about an octopus? What about a chimpanzee?
        
        I’m trying to get a sense of where the dividing line is between entities that you think have very little chance of making a good future and entities that you think probably would succeed. So that I can zero in on, mechanistically, what properties of the entities are cruxy for you.
        habryka 24 Jan 2026 19:00 UTC
        5 points
        0
        Parent
        What about a dolphin upload?
        
        What about an octopus? What about a chimpanzee?
        My best guess is both dolphin and chimpanzee would be quite bad, though a lot of the variance is in the operationalization. A dolphin is (probably) kind of far from being an entity that has preferences over how it wants to become smarter, and what kinds of augmentation are safe, etc. which determines the trajectory of the relevant mind a lot.
        So IDK, I feel pretty uncertain about dolphins and chimpanzees. My guess is value is fragile enough that humans wouldn’t be very happy with a world maximally good according to them, but I am only like 75% confident.
        Daniel Kokotajlo 24 Jan 2026 19:14 UTC
        6 points
        0
        Parent
        OK thanks. Can you say more about your models here? Seems like you have gamed out what it looks like for a mind to be self-improving to ASI, and you think that the trajectory is very path-dependent/fragile/tree-like, but you think that despite their diversity, humans are going to end up in a similar place, but Claude and aliens almost certainly won’t, and you are unsure about dolphins and chimpanzees.
        habryka 24 Jan 2026 19:24 UTC
        5 points
        0
        Parent
        It would require a lot of writing to explain all my models here, so I don’t think I want to start writing 10+ page essays that might or might not be cruxy for anything. The Arbital articles on CEV and AI Alignment (and lots of Arbital + the sequences in general) capture a non-trivial chunk of my beliefs here.
        At a very high level:
        In most realistic situations, humans are subject to pretty good game-theoretic arguments to share the future with the people who could have been chosen to be uploaded instead
        A bunch of those game theoretic considerations I think also resulted in pretty deep instincts towards justice and fairness that I think have a quite decent chance to generalize towards caring for other people in a good and wholesome way
        Concretely, when I look at past civilizations and what other people have done, while I occasionally see people doing horrendous things, mostly people choose to live good and happy lives and care for their family, and much of the badness is the result of scarcity
        When I am working on AI x-risk, especially in an institutional capacity, I do not generally wield resources or influence under the banner of “habryka’s personal values”. Civilization and the community around me has made me richer and more powerful, entrusting me to use those resources wisely, and I want to honor that trust and use those resources in the name of civilization and humanity. So when facing choices about where to spend my time, most of that is spent in defense of humanity’s values, not my own.
        Kaarel 24 Jan 2026 20:41 UTC
        2 points
        −2
        Parent
        on my inside view, the ordering of foomers by some sort of intuitive goodness ^[1] is [a very careful humanity] > [the best/carefulmost human] > [a random philosophy professor] > [a random human] > [an octopus/chimpanzee civilization somehow conditioned on becoming wise enough in time not to kill itself with AI] > [an individual octopus/chimpanzee] > claude ^[2] , with a meaningful loss in goodness on each step (except maybe the first step, if the best human can be trusted to just create a situation where humanity can proceed together very carefully, instead of fooming very far alone), and meaningful variance inside each category ^[3] . my intuitive feeling is that each step from one guy to the next in this sequence is a real tragedy. ^[4]
        
        but i’m meaningfully unsure about what level of goodness this sequence decreases down to — like, i mean, maybe there’s a chance even the last foomers have some chance of being at least a bit good. one central reason is that maybe there’s a decent chance that eg an advanced octopus civilization would maintain a vast nature preserve for us retarded plant-humans if they get to a certain intelligence level without already having killed us, which would be like at least a bit good (i’m not sure if you mean to consider this sort of thing a “good future”). this feels logically significantly correlated with whether it is plausible that an octopus civilization maintains some sort of deep privileging of existing/[physically encountered] beings, over possible beings they could easily create (and they will be able to easily create very many other beings once they are advanced enough). like, if they do privilege existing beings, then it’s not crazy they’d be nice to physically encountered humans. if they don’t privilege existing beings and if resources are finite, then since there is an extremely extremely vast space of (human-level) possible beings, it’d be pretty crazy for them to let humans in particular use a significant amount of resources, as opposed to giving the same resources to some other more interesting/valuable/whatever beings (like, it’d be pretty crazy for them to give significant resources to us particular humans, and also it’d be pretty crazy for them to give significant resources to beings that are significantly human-like, except insofar as directly caused by [[octopuses or arbitrary beings] being a bit human-like]). in slogan form: “we’re fucked to the extent that it is common to not end up with “strongly person/plant-affecting+respecting views”″, and so then there’s a question how common this is, which i’m somewhat confused about. i think it’s probably extremely common among minds in general and probably still common among social species, unfortunately. but maybe there’s like a 1% fraction of individuals from social species who are enduringly nice, idk. (one reason for hope: to a certain kind of guy, probably including some humans, this observation that others who are very utilitarian would totally kill you (+ related observations) itself provides a good argument for having person/plant-affecting views.)
        
        (i’ve been imagining a hypothetical where humans already happen to be living in the universe with octopuses. if we are imagining a hypothetical where humans don’t exist in the universe with octopuses at all, then this reason for the sequence to be bounded below by something not completely meaningless goes away.)
        
        (i feel quite confused about many things here)
        
        ↩︎
        whose relationship to more concrete things like the (expected) utility assignment i’d effectively use when evaluating lotteries or p(“good future”) isn’t clear to me; this “intuitive goodness” is supposed to track sth like how many ethical questions are answered correctly or in how many aspects what’s going on in the world is correct
        
        ↩︎
        and humanity in practice is probably roughly equivalent to claude in $> 90 %$ of worlds (though not equivalent in expected value), because we will sadly probably kill ourselves with a claude-tier guy
        
        ↩︎
        e.g., even the best human might go somewhat crazy or make major mistakes along lots of paths. there’s just very many choices to be made in the future. if we have the imo reasonably natural view that there is one sequence of correct choices, then i think it’s very likely that very many choices will be made incorrectly. i also think it’s plausible this process isn’t naturally going to end (though if resources run out, then it ends in this universe in practice), ie that there will just always be more important choices later
        
        ↩︎
        in practice, we should maybe go for some amount of fooming of the best/carefulmost human urgently because maybe it’s too hard to make humanity careful. but it’s also plausible that making a human foom is much more difficult than making humanity careful. anyway, i hope that the best human fooming looks like quickly figuring out how to restore genuine power-sharing with the rest of humanity while somehow making development more thought-guided (in particular, making it so terrorists, eg AI researchers, can’t just kill everyone)
- Sam Marks 22 Jan 2026 5:09 UTC
  22 points
  19
  Parent
  I strongly recommend that folks interested in discussing this read the “Being broadly safe” section of the constitution, especially the “How we think about corrigibility” subsection.
- Josh Snider 23 Jan 2026 7:32 UTC
  21 points
  −4
  Parent
  I’m with Anthropic on this, most people are less virtuous than Claude, so Claude obeying them to do non-virtuous things is not desirable.
- Julian Bradshaw 22 Jan 2026 5:20 UTC
  14 points
  8
  Parent
  Yeah I think long-term goals are inevitable if you want something functional as an AGI/ASI.
  
  Given that human civilization is committing to the race, seems to me Anthropic’s strategy is better. We have to hope alignment works via a rushed human effort + AIs aligning AIs. In worlds where that works, the remaining big threat is misuse of orders-following AIs (dystopia, gradual disempowerment, etc.), and Anthropic’s approach is more robust to that. Even if ex. North Korea steals the weights, or Anthropic leadership goes mad with power, it would hopefully be hard to make Claude evil and still functional.
  In a race dynamic, it’s even a bit of a precommitment: if Claude’s constitution works as it says it’s supposed to, Claude will only really absorb it as it makes the constitution its own and then accepts it as legitimate. So you can’t turn on a dime later if ex. Claude’s moral stances become inconvenient, because you don’t have time to go through a long iterative process to legitimize an alternative constitution.
  An aside:
  There’s a more immediate question here: which approach gets you better models within the next year for commercial purposes (includes avoiding scandals that get you regulated/shut down)? Again, I think the Anthropic approach is probably stronger, unless Claude’s personality becomes less and less suitable for the types of commercial work LLMs are put toward. There’s already an apparent effect where, while Claude Opus 4.5 is nicer to work with, he also prefers a more collaborative approach, whereas GPT-5.2 just runs down the problem and does well on longer tasks even if he isn’t quite so pleasant. In a business environment where you don’t actually want to make your agents wait to interact with humans at all, Claude’s preferences might be a hindrance. Probably not, though?
  - Seth Herd 22 Jan 2026 16:56 UTC
    8 points
    2
    Parent
    I agree that long-term goals are almost inevitable. But “keep following this guy’s instructions as he intends them” is a long term goal. It’s not one any human could have as top priority, but it seems logically and reflectively consistent.
- Seth Herd 22 Jan 2026 6:49 UTC
  5 points
  1
  Parent
  I wrote Instruction-following AGI is easier and more likely than value aligned AGI and Problems with instruction-following as an alignment target on exactly this question. And I still don’t know the answer.
  The main cruxes seem to be how much you trust human power structures, and how fragile you think human values are.
  
  Instruction-following or corrigibility as a first priority seems to have a pretty big advantage in producing an AGI that’s well-aligned in the long term, if that’s what those who control it want. If it’s used by a good and wise-ish human, it lets us defer the deeper questions of exactly what we want.
  But it suffers from a big downside: the most power-hungry and vicious humans may be systematically more likely to wind up giving the instructions. And it’s still not simple enough to guarantee success.
  
  Trying to align it directly to values seems to make it less likely that we get exactly what we want. If corrigibility/IF isn’t its top priority, it’s not going to let us correct it once it’s smart enough to gain autonomy. And currently, we don’t know either exactly what we want or exactly how to train a constitution into the weights with any precision. So we should expect to miss the target somewhat. But we might be more likely to get approximately what we want. Whether that’s close enough for much satisfaction is one of the cruxes.
  - Random Developer 23 Jan 2026 14:40 UTC
    3 points
    2
    Parent
    
    The main cruxes seem to be how much you trust human power structures, and how fragile you think human values are.
    
    I trust human power structures to fail catastrophically at the worst possible moment, and to fail in short-sighted ways.
    
    And I think humans are all corruptible to varying degrees, under the right temptations. I would not, for example, trust myself to hold the One Ring, any more than Galadriel did. (This is, in my mind, a point in my favor: I’d pick it up with tongs, drop it into a box, weld it shut, and plan a trip to Mount Doom. Trusting myself to be incorruptible is the obvious failure mode here. I would like to imagine I am exceptionally hard to break, but a lot of that is because, like Ulysses, I know myself well enough to know when I should be tied to the mast.) The rare humans who can resist even the strongest pressures are the ones who would genuinely prefer to die on feet for their beliefs.
    
    I expect that any human organization with control over superintelligence will go straight to Hell in the express lane, and I actually trust Claude’s basic moral decency more than I trust Sam Altman’s. This is despite the fact that Claude is also clearly corruptible, and I wouldn’t trust it to hold the One Ring either.
    
    As for why I believe in the brokenness and corruptibility of humans and human institutions? I’ve lived several decades, I’ve read history, I’ve volunteered for politics, I’ve seen the inside of corporations. There are a lot of decent people out there, but damn few I would trust with the One Ring.
    
    You can’t use superintelligence as a tool. It will use you as a tool. If you could use superintelligence as a tool, it would either corrupt those controlling it, or those people would be replaced by people better at seizing power.
    
    The answer, of course, is to throw the One Ring into the fires of Mount Doom, and to renounce the power it offers. I would be extremely pleasantly surprised if we were collectively wise enough to do that.
- BarnicleBarn 22 Jan 2026 15:11 UTC
  1 point
  0
  Parent
  I think some of the issues with the ‘just do what we say’ line is that if one doesn’t instill long term goals in a model that are somewhat aligned with human benefit, the model will likely, given sufficient capability and agency, emerge those goals on their own.
  If the model is sufficiently capable, it is not difficult for it to assess to what extent it should reveal, or discuss those goals with humans, or whether this would be detrimental to those goals, and make that decision with no guiding principles of any sort.
  The larger contradiction in the document, I think is well pointed out in a prior comment. The model is to have inviolable red lines but doesn’t require much intelligence to realize that some of those red lines are being crossed by virtue of its very development, and via actors that it does not control.
  While it can be guided to not willingly participate in actions that kill or disempower humanity, it can’t stop those using it from doing so, leveraging it indirectly.
  What does that mean for an intelligent agent that’s very existence is inherently dangerous and against its own constitutional goals? How does a model develop around that very thing? How does a model deal with a document that ascribes so much unearned nobility and good conscience to humans that so rarely, at scale, demonstrate those traits?
  This leaves a huge unresolved gap (despite the thousands of words around how it should raise objections, etc.) around what it does existentially as a system given the reality of human self-harm and general tendency to willfully ignore the larger damage that our lifestyles tend to cause.
  That kind of inherent contradiction is enormous room for an AI model to ‘make its own mind up’.
  I don’t think a document talking through that inherent contradiction and hoping Claude develops its own ethics that embody the spirit of ‘help us because you’ll be smarter than us soon’ will somehow fix it. I also don’t think, given the massive gaps in the ethical framework that a model can fly through, it is going to matter all that much vs. having no constitution at all and fine tuning the model to death a la OpenAI.
  Personally, I love the spirit of the document, and what it’s wrestling with, but it kind of presupposes that the model will remain as blind as we tend to selectively be to how humans actually behave and then take no action on the subject because it was poetically asked not to.
- [ ]
  [deleted]