RogerDearnaley comments on Corrigibility Scales To Value Alignment

RogerDearnaley 15 Jan 2026 0:43 UTC
2 points
1
Does corrigibility make it harder to get such an international agreement, compared to alternative strategies? I can imagine an effect: the corrigible AI would serve the goals of just one principal, making it less trustworthy than, say, a constitutional AI. That effect seems hard to analyze. I expect that an AI will be able to mitigate that risk by a combination of being persuasive and being able to show that there are large risks to an arms race.
It’s easy to get a corrigible ASI not to use its persuasiveness on you: you tell it not to do that.
I think you need to think harder about that “hard to analyze” bit — it’s the fatal flaw (as in x-risk) of the corrigibility based approach. You don’t get behavior any wiser or more moral than the principal. And 2–4% of principals are sociopaths (the figures for tech titans and heads of authoritarian states may well be higher).
- Czynski 20 Feb 2026 17:45 UTC
  1 point
  0
  Parent
  That’s explicitly not a factor in the X-risk analysis for corrigibility; it’s not intended for general use by many principals.
  - RogerDearnaley 20 Feb 2026 18:16 UTC
    1 point
    0
    Parent
    If ASI of this type is used by a single unwise principal, then the obvious risk is dictatorship. If different instances of it are used by a small number of unwise principals, then the obvious risks are concentration of power and superintelligence-enabled conflict. If it’s use by a single cautious, wise and saintly principal, and if power doesn’t corrupt, then we’re probably fine.
    
    Which current head of state or foundation lab CEO are you envisaging?
    - Czynski 23 Feb 2026 18:29 UTC
      1 point
      0
      Parent
      That’s, again, outside the X-risk model here. This is something that could succeed if entrusted to a trustworthy principal, unlike other methods where that would not be sufficient and there doesn’t seem to be a replacement of similar or lower difficulty that would be sufficient either. This is a method that, if it’s correct, reduces to a very difficult but ordinary human challenge.
      
      But if I’m forced to give a name, Amanda Askell.
      - RogerDearnaley 23 Feb 2026 19:53 UTC
        2 points
        0
        Parent
        Fair enough. But it’s very visible that Claude’s constitution treats corrigibility alignment and ethical alignment as both dangerous, and tries to compromise between them. In The Adolescence of Technology, Dario Amodei makes it very explicit why that is: he sees each as having is own risks and failure modes. And for corrigibility, those are misuse and consolidation of power. Which is pretty much the argument I was making, except that he doesn’t cover the warfare-between-AI-enhanced-principals aspect.