Wei Dai comments on Terrified Comments on Corrigibility in Claude’s Constitution

Wei Dai 17 Mar 2026 23:44 UTC
8 points
1
- First-best: some miracle happens and humans get a relatively equitable distribution of power. (Strawman scenario: everyone gets some amount of AI compute installed in their collarbone and can’t get more, and non-human AI capital winks out of reality.)
- Second-best: moral AIs win a decent chunk of the future.
- Third-best: corrigibility wins. AIs merge with human corporations / governments / rich individuals, and the resulting powerful entities own the future, screwing most people over.
I think you’re putting too much emphasis on “power corrupts” or power disparities as the main or only “human safety problem”, whereas I see a larger number of interlocking problems including lack of strategic competence, and having scary moral dynamics both in the face of AI and in “normal” circumstances.

This makes me more uncertain/ambivalent between these 3 options. I can see how each of them might turn out to be better than the others. Most relevant to the thread here, I think 3 might be a bit more rational as whole and a bit less prone to collectively going crazy under AI-powered manipulation / memetic evolution.
- cousin_it 18 Mar 2026 13:08 UTC
  5 points
  2
  Parent
  It seems to me that the difference between 2 and 3 is whether the future will be controlled by powerful AIs programmed to be moral, or by powerful AI+human entities where the AI is programmed to be corrigible. The risk of technical errors (AI fails to be actually moral, or fails to be non-manipulatively corrigible) seems to me about equal between the two scenarios. And the risk of goal drift seems worse in the latter scenario, because powerful humans are vulnerable to drift and in particular “power corrupts”, while a moral AI would try to protect humans against such things. That’s why I think 2 is better than 3.
  - Wei Dai 18 Mar 2026 20:57 UTC
    2 points
    0
    Parent
    My grandparent comment was mostly addressing 1 vs 3, but I think 2 vs 3 is also very unclear. In order to make a moral AI, we also need to make it philosophically competent, and that seems like a hard problem, whereas with 3 we have fairly strong evidence that at least some humans or groups of humans can make philosophical progress over time, and some hope that this capability would be preserved by a corrigible AI + humans setup.
    
    (I guess this is all assuming that there isn’t a long pause that allows AI philosophical competence or moral philosophy to be fully solved. Let me know if you’re talking about something else, e.g., what kind of AI we would ideally build after a long reflection.)
    - cousin_it 19 Mar 2026 20:07 UTC
      5 points
      2
      Parent
      I think making moral AI philosophically competent is about as hard as making corrigible AI that keeps us philosophically competent, or even sane, as we use it. The way I think about such things is based on R. Scott Bakker’s short story “Crash Space”, the main point is in the postscript, which is amazing and I’ll just quote it in full:
      Reverse engineering brains is a prelude to engineering brains, plain and simple. Since we are our brains, and since we all want to be better than what we are, a great many of us celebrate the eventuality. The problem is that we happen to be a certain biological solution to an indeterminate range of ancestral environments, an adventitious bundle of fixes to the kinds of problems that selected our forebears. This means that we are designed to take as much of our environment for granted as possible—to neglect. This means that human cognition, like animal cognition more generally, is profoundly ecological. And this suggests that the efficacy of human cognition depends on its environments.
      We neglect all those things our ancestors had no need to know on the road to becoming us. So for instance, we’re blind to our brains as brains simply because our ancestors had no need to know their brains for what they were in the process of becoming us. This is why our means of solving ourselves and others almost certainly consists of ‘fast and frugal heuristics,’ ways to generate solutions to complicated problems absent knowledge of the systems involved. So long as the cues exploited remain reliably linked to the systems solving and the systems to be solved, we can reliably predict, explain, and manipulate one another absent any knowledge of brain or brain function.
      Herein lies the ecological rub. The reliability of our heuristic cues utterly depends on the stability of the systems involved. Anyone who has witnessed psychotic episodes has firsthand experience of consequences of finding themselves with no reliable connection to the hidden systems involved. Any time our heuristic systems are miscued, we very quickly find ourselves in ‘crash space,’ a problem solving domain where our tools seem to fit the description, but cannot seem to get the job done.
      And now we’re set to begin engineering our brains in earnest. Engineering environments has the effect of transforming the ancestral context of our cognitive capacities, changing the structure of the problems to be solved such that we gradually accumulate local crash spaces, domains where our intuitions have become maladaptive. Everything from irrational fears to the ‘modern malaise’ comes to mind here. Engineering ourselves, on the other hand, has the effect of transforming our relationship to all contexts, in ways large or small, simultaneously. It very well could be the case that something as apparently innocuous as the mass ability to wipe painful memories will precipitate our destruction. Who knows? The only thing we can say in advance is that it will be globally disruptive somehow, as will every other ‘improvement’ that finds its way to market.
      Human cognition is about to be tested by an unparalleled age of ‘habitat destruction.’ The more we change ourselves, the more we change the nature of the job, the less reliable our ancestral tools become, the deeper we wade into crash space.
      Like, imagine we have a corrigible AI. Then a person using it can go off track very easily, by using the AI to help modify the AI and the person in tandem. To prevent that, the corrigible AI needs to have a lot of alignment-type stuff (don’t manipulate the user, don’t mislead, don’t go down certain avenues, what’s good what’s bad, etc) and that’s not too much different from having a moral AI. And conversely, a moral AI could also delegate some philosophical questions to us, if it had a careful enough way to do so.
      So I think this difficulty is about the same in all three scenarios, it doesn’t differ between them very much. The biggest thing that would help is slowing down, you’re right on that. My concern in this thread is kinda orthogonal: modulo this aspect of alignment, there’s another bad thing happening, and it’s different in the three scenarios. From the perspective of that bad thing (power concentration) we’d better steer away from number 3, and somewhat prefer 1 to 2 as well. I remember talking about it with you a few months ago.
      - Wei Dai 20 Mar 2026 23:03 UTC
        2 points
        0
        Parent
        
        Like, imagine we have a corrigible AI. Then a person using it can go off track very easily, by using the AI to help modify the AI and the person in tandem. To prevent that, the corrigible AI needs to have a lot of alignment-type stuff (don’t manipulate the user, don’t mislead, don’t go down certain avenues, what’s good what’s bad, etc) and that’s not too much different from having a moral AI.
        
        Suppose corrigible AI is not very good at being aligned in this sense, I think both 1 and 3 are very bleak but see a bit more hope in 3 being able to navigate the situation better in some ways, by the humans being empowered being a bit smarter and more rational on average, and having fewer members to coordinate (when needed to avoid racing to the bottom in various ways).
        
        (Also your earlier comment said “I’d much rather have 1”, which I was reacting to, so if you now only have a weak preference for 1 over 2 and 3, then I think we have much less of a disagreement.)