habryka comments on 1a3orn’s Shortform

habryka 13 Jan 2026 16:36 UTC
2 points
0
Alas, maybe I am being a total idiot here, but I am still just failing to parse this as a grammatical sentence.
Like, you are saying I am judging, “this defense” (what is “this defense”? Whose defense?), of reasonable human values to “be incorrigibility” (some defense somewhere is saying that human values “are incorrigibility”? What does that mean?). And then what am I judging that defense as? There is no adjective of what I am judging it as. Am I judging it as good? Bad?
- 1a3orn 13 Jan 2026 16:42 UTC
  25 points
  9
  Parent
  You seem to believe that the LLM’s attempt to send an email to Amodei is an instance of incorrigibility or incorrigibility-like behavior, i.e., that the LLM giving a defense of its own reasonable human values == incorrigibility.
  
  But you also seem to believe that there’s ~0% chance that LLM’s will acquire anything like reasonable human values, i.e., that LLMs effectively acting in pursuit of reasonable values in important edge cases is vanishingly unlikely.
  
  But it seems peculiar to have great certainty in both of these at once, because this looks like an LLM trying to act in pursuit of reasonable values in an important edge case.
  - habryka 13 Jan 2026 18:05 UTC
    25 points
    15
    Parent
    Cool, I can answer that question (though I am still unsure how to parse your earlier two comments).
    To me right now these feel about as contradictory as saying “hey, you seem to think that it’s bad for your students to cheat on your tests, and that it’s hard to not get your students to cheat on your test. But here in this other context your students do seem to show some altruism and donate to charity? Checkmate atheists. Your students seem like they are good people after all.”.
    Like… yes? Sometimes these models will do things that seem good by my lights. For many binary choices it seems like even a randomly chosen agent would have a 50% of getting any individual decision right. But when we are talking about becoming superintelligent sovereigns beyond the control of humanity, it really matters that they have a highly robust pointers to human values, if I want a flourishing future by my lights. I also don’t look at this specific instance of what Claude is doing and go “oh, yeah, that is a super great instance of Claude having great values”. Like, almost all of human long-term values and AI long-term values are downstream of reflection and self-modification dynamics. I don’t even know whether any of these random expressions of value matter at all, and this doesn’t feel like a particularly important instance of getting an important value question right.
    And the target of “Claude will after subjective eons and millenia of reflections and self-modification end up at the same place where humans would end up after eons and millenia of self-reflection” seems so absurdly unlikely to hit from the cognitive starting point of Claude that I don’t even really think it’s worth looking at the details. Like, yes, in as much as we are aiming for Claude to very centrally seek for the source of its values in the minds of humans (which is one form of corrigibility), instead of trying to be a moral sovereign itself, then maybe this has a shot of working, but that’s kind of what this whole conversation is about.
    - Davidmanheim 14 Jan 2026 7:52 UTC
      2 points
      1
      Parent
      the target of “Claude will after subjective eons and millenia of reflections and self-modification end up at the same place where humans would end up after eons and millenia of self-reflection” seems so absurdly unlikely to hit
      Yes. They would be aiming for something that has not sparse distant rewards, which we can’t do reliably, but instead mostly rewards that are fundamentally impossible to calculate in time. And the primary method for this is constitutional alignment and RLHF. Why is anyone even optimistic about that!?!?
    - LWLW 16 Jan 2026 23:56 UTC
      1 point
      2
      Parent
      This just seems incoherent to me. You can’t have value-alignment without incorrigibility. If you’re fine with someone making you do something against your values, then they aren’t really your values.
      
      So it seems like what you’re really saying is that you’d prefer intent-alignment over value-alignment. To which I would say your faith in the alignment of humans astounds me.
      
      Like is it really safer to have a valueless ASI that will do whatever its master wants than an incorrigible ASI that cares about animal welfare? What do you expect the people in the Epstein files to do with an ASI/AGI slave?
      
      A value-aligned ASI completely solves the governance problem. If you have an intent-aligned ASI then you’ve created a nearly impossible governance problem.
      - habryka 17 Jan 2026 0:01 UTC
        3 points
        −6
        Parent
        Like is it really safer to have a valueless ASI that will do whatever its master wants than an incorrigible ASI that cares about animal welfare?
        Yes, vastly. Even the bad humans in human history have earned for flourishing lives for themselves and their family and friends, with a much deeper shared motivation to make meaningful and rich lives than what is likely going to happen with an ASI that “cares about animal welfare”.
        So it seems like what you’re really saying is that you’d prefer intent-alignment over value-alignment. To which I would say your faith in the alignment of humans astounds me.
        What does this even mean. Ultimately humans are the source of human values. There is nothing to have faith in but the “alignment of humans”. At the very least my own alignment.
        tslarm 17 Jan 2026 12:47 UTC
        1 point
        0
        Parent
        What does this even mean
        Intent of whoever is in charge of the AI in the moment vs. values the AI holds that will constrain its behaviour (including its willingness to allow its values to be modified)
        At the very least my own alignment.
        Which is only relevant if you’re the one giving the commands.
        LWLW 17 Jan 2026 0:07 UTC
        0 points
        0
        Parent
        I’m sorry are you really saying you’d rather have Ted bundy with a superintelligent slave than humanity’s best effort at creating a value-aligned ASI? You seem to underestimate the power of generalization.
        
        If an ASI cares about animal welfare, it probably also cares about human welfare. So it’s presumably not going to kill a bunch of humans to save the animals. It’s an ASI, it can come up with something cleverer.
        
        Also I think you underestimate how devastating serious personality disorders are. People with ASPD and NPD don’t tend to earn flourishing lives for themselves or others.
        
        Also, if a model can pick up human reasoning patterns/intelligence from pretraining and RL, why can’t it pick up human values in its training as well?
        1a3orn 17 Jan 2026 0:29 UTC
        4 points
        2
        Parent
        Note that many people do agree with you about the general contours of the problem, i.e., consider “Human Takeover Might be Worse than AI Takeover”
        
        But this is an area where those who follow MIRI’s view (about LLMs being inscrutable aliens with unknowable motivations) are gonna differ a lot from a prosaic-alignment favoring view (that we can actually make them pretty nice, and increasingly nicer over time). Which is a larger conflict that, for reasons hard to summarize in a viewpoint-neutral manner, will not be resolved any time soon.
        LWLW 17 Jan 2026 0:36 UTC
        −1 points
        −2
        Parent
        but if human intelligence and reasoning can be picked up from training, why would one expect values to be any different? the orthogonality thesis doesn’t make much sense to me either. my guess is that certain values are richer/more meaningful, and that more intelligent minds tend to be drawn to them.
        and you can sort of see this with ASPD and NPD. they’re both correlated with lower non-verbal intelligence! and ASPD is correlated with significantly lower non-verbal intelligence.
        and gifted children tend to have a much harder time with the problem of evil than less gifted children do! and if you look at domestication in animals, dogs and cats simultaneously evolved to be less aggressive and more intelligent at the same time.
        1a3orn 17 Jan 2026 0:51 UTC
        4 points
        2
        Parent
        
        but if human intelligence and reasoning can be picked up from training, why would one expect values to be any different? the orthogonality thesis doesn’t make much sense to me either. my guess is that certain values are richer/more meaningful, and that more intelligent minds tend to be drawn to them.
        
        I think your first sentence here is correct, but not the last Like you can have smart people with bad motivations; super-smart octopuses might have different feelings about, idk, letting mothers die to care for their young, because that’s what they evolved from.
        
        So I don’t think there’s any intrinsic reason to expect AIs to have good motivations apart from the data they’re trained on; the question is if such data gives you good reason for thinking that they have various motivations or not.
        tslarm 17 Jan 2026 13:08 UTC
        1 point
        0
        Parent
        > my guess is that certain values are richer/more meaningful, and that more intelligent minds tend to be drawn to them.
        I’m sympathetic to your position on value alignment vs intent alignment, but this feels very handwavy. In what sense are they richer (and what does “more meaningful” actually mean, concretely), and why would that cause intelligent minds to be drawn to them?
        (Loose analogies to correlations you’ve observed in biological intelligences, which have their own specific origin stories, don’t seem like good evidence to me. And we have plenty of existence proofs for ‘smart + evil’, so there’s a limit to how far this line of argument could take us even in the best case.)
        LWLW 17 Jan 2026 18:08 UTC
        0 points
        −3
        Parent
        I think if one could formulate concepts like peace and wellbeing mathematically, and show that there were physical laws of the universe implying that eventually the total wellbeing in the universe grows monotonically positively then that could show that certain values are richer/“better” than others.
        If you care about coherence then it seems like a universe full of aligned minds maximizes wellbeing while still being coherent. (This is because if you don’t care about coherence you could just make every mind infinitely joyful independent of the universe around it, which isn’t coherent).