Seth Herd comments on Terrified Comments on Corrigibility in Claude’s Constitution

Seth Herd 16 Mar 2026 18:34 UTC
7 points
−2
You aren’t mentioning the misalignment/misgeneralization/goodharting risks. If not for those, yes just having a good model would be preferable.

It appears to me that anyone who seriously thinks about those risks winds up thinking “yeah that could happen at at least a double-digit percentag” (up to 99%).

You might think that humans in charge would likely be worse, but you’ve got to actually make that argument.

I have no idea which is worse at this point despite thinking about this a fair amount.
- cousin_it 16 Mar 2026 21:27 UTC
  25 points
  22
  Parent
  Hmm, if you mean that the “morality” path is beset by technical problems while the “corrigibility” path simply puts humans in charge and is more problem-free, then I’m not sure that’s the case. To me it feels like both paths have technical problems, and in fact many of the same problems. So it makes some sense to compare them modulo technical problems, what will happen if either path works as stated. And the danger of the corrigibility path just feels overwhelming to me then.
  The only way I’d be happy with the corrigibility path is if the corrigibility button was somehow wielded by all of humanity, across countries and classes and all that. It would be my favorite outcome if it were up to me. But none of the big labs seem interested in that. They’re more like “Anthropic has much more in common with the Department of War than we have differences” (recent quote from Dario Amodei). When you read such things, the question of “corrigibility by whom” really begins to loom large.
  - Seth Herd 18 Mar 2026 18:29 UTC
    6 points
    3
    Parent
    If I thought the two were equally easy/likely to work, I would be with you. Value alignment is far better if you can get it.
    I’m not sure that corrigibility or instruction-following really is easier than value alignment, but it does seem pretty likely it’s at least somewhat easier. Figuring out exactly what you want for the entire rest of time does seem both harder to figure out and convey than essentially “do what this guy says.”
    
    To me the dangers of the corrigibility path seem slightly less extreme than the dangers of the value alignment path. Humans are frequently generous when that generosity costs them little. To a human in charge of a dominant ASI, everything is easy.
    I don’t know of any real attempts to compare the likely ease of the two. My latest is at Problems with instruction-following as an alignment target but it’s far from comprehensive; it focuses on problems with IF/corrigibility, but I think the problems with value alignment are even more severe.
    
    I’d be happy with anything that keeps humanity alive in decent conditions, ideally including me and mine. How to get that is highly unclear. So we should keep clarifying it.
    - cousin_it 18 Mar 2026 18:58 UTC
      46 points
      19
      Parent
      Humans are frequently generous when that generosity costs them little.
      Oh my god oh my god oh my god oh my god.
      People are so persistently wrong about this. I’m maybe more tired of responding to this argument than any other argument in the world. For example, here in a sibling reply to Zack:
      I’ve seen the argument so many times now that the powerful will have some nonzero sense of charity and can spare like 1% of their wealth to “give everyone a moon” as Scott puts it. I don’t know if you subscribe to this argument too, but in any case it’s wrong. Charity isn’t the only nonzero urge that powerful people have. The urge to lord it over others will also be there. If huge power disparity exists, it will manifest itself in bad things more than it’ll manifest itself in charity. Sure, some powerless people will end up in nice charity zones, but many others will end up in other zones run by someone less nice.
      Or in a past thread:
      In my view, the problem is not that some users are evil. The problem is that AI increases power imbalance, and increasing power imbalance creates evil. “Power corrupts”. A future where some entities (AIs or AI-empowered governments or corporations or rich individuals etc) have absolute, root-level power over many people is almost guaranteed to be a dark future.
      Or in another past thread:
      Being forced to play out a war? Getting people’s minds modified so they behave like house elves from HP? Selective breeding? Selling some of your poor people to another rich person who’d like to have them? It’s not even like I’m envisioning something specific that’s dark, I just know that a world where some human beings have absolute root-level power over many others is gonna be dark. Let’s please not build such a world.
      Or in another past thread:
      For example, if large numbers of people end up in inescapable servitude. I think such outcomes are actually typical in case of many near-misses at alignment, including the particular near-miss that’s becoming more probable day by day: if the powerful successfully align the AI to themselves, and it enables them to lord it over the powerless forever. To believe that the powerful will be nice to the powerless of their own accord, given our knowledge of history, is very rose-colored thinking.
      Or in another past thread:
      altruistic urges aren’t the only “nonzero urges” that people have. People also have an urge to power, an urge to lord it over others. And for a lot of people it’s much stronger than the altruistic urge. So a world where most people are at the whim of “nonzero urges” of a handful of superpowerful people will be a world of power abuse, with maybe a little altruism here and there. And if you think people will have exit rights from the whims of the powerful, unfortunately history shows that it won’t necessarily be so.
      Or in another past thread:
      The new balance of power will be more similar to what we had before firearms, when the powerful were free to treat most people really badly. And even worse, because this time around they won’t even need our labor.
      Or in another past thread:
      If there’s a small class of people with immense power over billions of have-nothings that can do nothing back, sure, some of the superpowerful will be more than zero altruistic. But others won’t be, and overall I expect callousness and abuse of power to much outweigh altruism. Most people are pretty corruptible by power, especially when it’s power over a distinct outgroup, and pretty indifferent to abuses of power happening to the outgroup; all history shows that. Bigger differences in power will make it worse if anything.
      Why do people think they’ll be given a moon? Why???
      - abstractapplic 21 Mar 2026 23:19 UTC
        9 points
        4
        Parent
        Why do people think they’ll be given a moon? Why???
        Because they’d give everyone a moon, and they typical-mind.
        (Plus probably some other reasons)
      - Seth Herd 18 Mar 2026 20:55 UTC
        5 points
        −1
        Parent
        I’ve been reading your other exchanges.
        Your level of frustration is not helpful nor I think justified.. These are complex important issues and we need to work together to solve them, not yell at each other.
        It’s not 1% of their wealth, it’s .0001%. And I don’t need a moon.
        
        What is your better plan? I don’t like this one either!
        cousin_it 18 Mar 2026 22:52 UTC
        10 points
        4
        Parent
        My point is will they also have .0002% wish to be your lord or something.
        As for the better plan, yeah that’s a lot to ask. Most of my thoughts these days lean toward “democratic AI”, something whose power is either spread out among all the world’s people across borders etc, sidestepping governments and existing power structures, or else something centralized that wants its power to be spread out like this.
        Of course an approach like this won’t solve all the world’s problems. We’ll still have power struggles between people, and also “crash space” type problems where people modify themselves into something bad; maybe these need some patches by fiat as well. But at least it won’t create the extra problem of huge power concentration, which I really feel is underestimated.
        Seth Herd 18 Mar 2026 23:10 UTC
        3 points
        −8
        Parent
        It sounds like your plan is pretty much the standard value-aligned AGI that’s aligned to something like human values in general, so that everyone gets what they want on average? Or something in that ballpark?
        
        One big questions are how do you achieve that technically. That’s where I think it’s harder than the instruction-following variant of corrigibility. I hope it’s not. The second is how you achieve it practically. What person or organization is going deliberately hand the future to a value-aligned AGI?
        
        One answer is: Anthropic seems like they might be considering doing just that. Maybe it works, or at least sort-of works, where it’s not an ideal future but at least we survive in some form for a while.
        WRT the default plan of IF/corrigible alignment:
        
        Yes, anyone in charge with a negative sadism-empathy balance will lead to a fate worse than death. And someone around zero could produce a fate barely worth living.
        But I think most humans have more empathy than sadism. More people give a little to charity than spit on the homeless for fun. I can call Sunday Samday for the rest of eternity if all we need is some ego-stroking in return for tiny amounts of generosity.
        
        The point of my plan is it’s mostly what people will do anyway, so we can focus on helping them not totally fuck up alignment and get us all killed.
        
        A better plan is a lot to ask. But that’s what I’m trying to come up with, because I want us to live and there’s still time to work.
        Viliam 2 Apr 2026 15:30 UTC
        3 points
        2
        Parent
        But I think most humans have more empathy than sadism.
        People who end up in positions of power are not necessarily like most humans.
        More people give a little to charity than spit on the homeless for fun.
        In your WEIRD bubble, sure. In other times and places, people used to burn cats for fun. And empathy used to be limited to one’s peers.
        andrew sauer 2 Apr 2026 15:56 UTC
        6 points
        0
        Parent
        People still do things in the same ethical ballpark as cat-burning, except on an incomprehensibly large industrial scale and for the sake of marginal food preferences.
        We look down on peasants for burning cats today, but the tragic irony is that their society was far better overall on animal welfare than ours in the modern day, though for practical reasons rather than moral ones.
        CronoDAS 21 Mar 2026 21:24 UTC
        −1 points
        4
        Parent
        
        But I think most humans have more empathy than sadism. More people give a little to charity than spit on the homeless for fun. I can call Sunday Samday for the rest of eternity if all we need is some ego-stroking in return for tiny amounts of generosity.
        
        Would you be okay with a future in which young women, including your daughters and granddaughters, would be expected to ritually offer a gift of her virginity to the local Robot Lord on her 18th birthday, which he would almost never choose to “accept”? 😈
        andrew sauer 2 Apr 2026 16:02 UTC
        4 points
        0
        Parent
        Damn straight. People need to understand the implications of this shit. “Oh let’s hope the separate caste which controls the entire universe and which we can’t hope to contest in any possible way is nice to us!!!”
        Open. A. History book.
        Your scenario is relatively low on the awfulness scale, even.
      - CronoDAS 21 Mar 2026 20:59 UTC
        3 points
        0
        Parent
        Usually slaves and/or other people in an underclass at least had their own living quarters separate from where the lords lived? The idea is that when humanity becomes astronomically rich, the equivalent to “the shed in the backyard the slave sleeps in” ends up being a whole moon rather than, well, a shed in the backyard.
        
        (It’s also noteworthy that most slave societies in the past weren’t rich enough that the slave population lived at or above subsistence level and reproduced enough to maintain its population level; for example, relatively few slaves in the ancient Roman Empire were born into slavery. The slave states in the pre-Civil War USA were an exception—there was much that a plantation slave had to suffer, but a significant risk of death by starvation or exposure was not something they usually had to deal with.)
        
        Disclaimer: This is an explanation, not an endorsement of the underlying prediction.
      - MichaelDickens 20 Mar 2026 3:51 UTC
        2 points
        0
        Parent
        This is a great argument. I have no clue whether it’s correct, but it made me think. I would like to see some harder evidence on the question but I’m not sure what kinds of evidence would be useful.
- 2001zhaozhao 17 Mar 2026 0:03 UTC
  4 points
  0
  Parent
  At that point it changes to an argument about:
  - How likely is it that an AI that takes over the world will keep humans around and give them good, morally desirable lives
  - How likely is it that a human elite (however large or small it is) that takes over the world would do the same to humans outside of that elite
  - How much the fact that the elites themselves are human and have their preferences satisfied changes the equation in favor of the second case
  and of course, the likelihood for each to happen if we focus on corrigibility vs morality