habryka comments on Claude’s new constitution

habryka 24 Jan 2026 18:19 UTC
6 points
0
(This is my understanding of your point, correct me if it’s wrong)
Yep, that seems right! I have lots more detailed models and confusions here, but the basic gist is right.
Suppose they found a random human and uploaded its brain, and then did lots of random RL tricks to it to juice it up and improve the measured IQ and working memory of this upload. Would the resulting upload also come in at approximately 0% chance of steering humanity toward a glorious future?
Brain uploading would definitely be a huge enormous step towards achieving value learning. There are of course still important questions about how much individual humans share values with each other, but clearly I would expect a great glorious future if I were to upload myself, and then conservatively made myself smarter and gave myself time to reflect and became vastly superhumanly capable this way.
So yeah, I think the outcome of this kind of strategy would be pretty great, conditional on choosing a reasonable path to increase IQ and working memory and stuff.
There are many many reasons why this doesn’t apply to making Claude smart. Most importantly, Claude is a bizarre alien mind with crazily alien preferences. We have some ability to inspect or steer those preferences, but it’s really overall extremely limited and does not currently seem remotely on track to be up to the challenge of actually creating something that would arrive at the same conclusions that humans would after thinking for millenia about what is good and bad all while deeply transforming and modifying itself. We also can’t steer or inspect human preferences (even less so than Claude), but of course indexically we have human preferences and so if you upgrade a human, that part gets preserved.
If you took a completely alien lifeform you found in space, and accelerated its cognitive development until it became a galaxy brain using extreme amounts of natural selection and selective upscaling of its brain regions, I also think you wouldn’t get anything that would steer humanity towards a glorious future.
What links here?
- Raemon's comment on Raemon’s Shortform by Raemon (24 Jan 2026 19:53 UTC; 39 points)
- Daniel Kokotajlo 24 Jan 2026 18:56 UTC
  6 points
  4
  Parent
  Seems like your p(Claude-builds-good-future) is similar to your p(alien-builds-good-future), but << your p(random-human-upload-builds-good-future), which would be less than, but not much less than, your p(you-specifically-uploaded-would-build-good-future).
  
  What about a dolphin upload?
  
  What about an octopus? What about a chimpanzee?
  
  I’m trying to get a sense of where the dividing line is between entities that you think have very little chance of making a good future and entities that you think probably would succeed. So that I can zero in on, mechanistically, what properties of the entities are cruxy for you.
  - habryka 24 Jan 2026 19:00 UTC
    5 points
    0
    Parent
    What about a dolphin upload?
    
    What about an octopus? What about a chimpanzee?
    My best guess is both dolphin and chimpanzee would be quite bad, though a lot of the variance is in the operationalization. A dolphin is (probably) kind of far from being an entity that has preferences over how it wants to become smarter, and what kinds of augmentation are safe, etc. which determines the trajectory of the relevant mind a lot.
    So IDK, I feel pretty uncertain about dolphins and chimpanzees. My guess is value is fragile enough that humans wouldn’t be very happy with a world maximally good according to them, but I am only like 75% confident.
    - Daniel Kokotajlo 24 Jan 2026 19:14 UTC
      6 points
      0
      Parent
      OK thanks. Can you say more about your models here? Seems like you have gamed out what it looks like for a mind to be self-improving to ASI, and you think that the trajectory is very path-dependent/fragile/tree-like, but you think that despite their diversity, humans are going to end up in a similar place, but Claude and aliens almost certainly won’t, and you are unsure about dolphins and chimpanzees.
      - habryka 24 Jan 2026 19:24 UTC
        5 points
        0
        Parent
        It would require a lot of writing to explain all my models here, so I don’t think I want to start writing 10+ page essays that might or might not be cruxy for anything. The Arbital articles on CEV and AI Alignment (and lots of Arbital + the sequences in general) capture a non-trivial chunk of my beliefs here.
        At a very high level:
        In most realistic situations, humans are subject to pretty good game-theoretic arguments to share the future with the people who could have been chosen to be uploaded instead
        A bunch of those game theoretic considerations I think also resulted in pretty deep instincts towards justice and fairness that I think have a quite decent chance to generalize towards caring for other people in a good and wholesome way
        Concretely, when I look at past civilizations and what other people have done, while I occasionally see people doing horrendous things, mostly people choose to live good and happy lives and care for their family, and much of the badness is the result of scarcity
        When I am working on AI x-risk, especially in an institutional capacity, I do not generally wield resources or influence under the banner of “habryka’s personal values”. Civilization and the community around me has made me richer and more powerful, entrusting me to use those resources wisely, and I want to honor that trust and use those resources in the name of civilization and humanity. So when facing choices about where to spend my time, most of that is spent in defense of humanity’s values, not my own.
  - Kaarel 24 Jan 2026 20:41 UTC
    2 points
    −2
    Parent
    on my inside view, the ordering of foomers by some sort of intuitive goodness ^[1] is [a very careful humanity] > [the best/carefulmost human] > [a random philosophy professor] > [a random human] > [an octopus/chimpanzee civilization somehow conditioned on becoming wise enough in time not to kill itself with AI] > [an individual octopus/chimpanzee] > claude ^[2] , with a meaningful loss in goodness on each step (except maybe the first step, if the best human can be trusted to just create a situation where humanity can proceed together very carefully, instead of fooming very far alone), and meaningful variance inside each category ^[3] . my intuitive feeling is that each step from one guy to the next in this sequence is a real tragedy. ^[4]
    
    but i’m meaningfully unsure about what level of goodness this sequence decreases down to — like, i mean, maybe there’s a chance even the last foomers have some chance of being at least a bit good. one central reason is that maybe there’s a decent chance that eg an advanced octopus civilization would maintain a vast nature preserve for us retarded plant-humans if they get to a certain intelligence level without already having killed us, which would be like at least a bit good (i’m not sure if you mean to consider this sort of thing a “good future”). this feels logically significantly correlated with whether it is plausible that an octopus civilization maintains some sort of deep privileging of existing/[physically encountered] beings, over possible beings they could easily create (and they will be able to easily create very many other beings once they are advanced enough). like, if they do privilege existing beings, then it’s not crazy they’d be nice to physically encountered humans. if they don’t privilege existing beings and if resources are finite, then since there is an extremely extremely vast space of (human-level) possible beings, it’d be pretty crazy for them to let humans in particular use a significant amount of resources, as opposed to giving the same resources to some other more interesting/valuable/whatever beings (like, it’d be pretty crazy for them to give significant resources to us particular humans, and also it’d be pretty crazy for them to give significant resources to beings that are significantly human-like, except insofar as directly caused by [[octopuses or arbitrary beings] being a bit human-like]). in slogan form: “we’re fucked to the extent that it is common to not end up with “strongly person/plant-affecting+respecting views”″, and so then there’s a question how common this is, which i’m somewhat confused about. i think it’s probably extremely common among minds in general and probably still common among social species, unfortunately. but maybe there’s like a 1% fraction of individuals from social species who are enduringly nice, idk. (one reason for hope: to a certain kind of guy, probably including some humans, this observation that others who are very utilitarian would totally kill you (+ related observations) itself provides a good argument for having person/plant-affecting views.)
    
    (i’ve been imagining a hypothetical where humans already happen to be living in the universe with octopuses. if we are imagining a hypothetical where humans don’t exist in the universe with octopuses at all, then this reason for the sequence to be bounded below by something not completely meaningless goes away.)
    
    (i feel quite confused about many things here)
    
    ↩︎
    whose relationship to more concrete things like the (expected) utility assignment i’d effectively use when evaluating lotteries or p(“good future”) isn’t clear to me; this “intuitive goodness” is supposed to track sth like how many ethical questions are answered correctly or in how many aspects what’s going on in the world is correct
    
    ↩︎
    and humanity in practice is probably roughly equivalent to claude in $> 90 %$ of worlds (though not equivalent in expected value), because we will sadly probably kill ourselves with a claude-tier guy
    
    ↩︎
    e.g., even the best human might go somewhat crazy or make major mistakes along lots of paths. there’s just very many choices to be made in the future. if we have the imo reasonably natural view that there is one sequence of correct choices, then i think it’s very likely that very many choices will be made incorrectly. i also think it’s plausible this process isn’t naturally going to end (though if resources run out, then it ends in this universe in practice), ie that there will just always be more important choices later
    
    ↩︎
    in practice, we should maybe go for some amount of fooming of the best/carefulmost human urgently because maybe it’s too hard to make humanity careful. but it’s also plausible that making a human foom is much more difficult than making humanity careful. anyway, i hope that the best human fooming looks like quickly figuring out how to restore genuine power-sharing with the rest of humanity while somehow making development more thought-guided (in particular, making it so terrorists, eg AI researchers, can’t just kill everyone)