RogerDearnaley comments on Corrigibility Scales To Value Alignment

RogerDearnaley 20 Feb 2026 18:16 UTC
1 point
0
If ASI of this type is used by a single unwise principal, then the obvious risk is dictatorship. If different instances of it are used by a small number of unwise principals, then the obvious risks are concentration of power and superintelligence-enabled conflict. If it’s use by a single cautious, wise and saintly principal, and if power doesn’t corrupt, then we’re probably fine.

Which current head of state or foundation lab CEO are you envisaging?
- Czynski 23 Feb 2026 18:29 UTC
  1 point
  0
  Parent
  That’s, again, outside the X-risk model here. This is something that could succeed if entrusted to a trustworthy principal, unlike other methods where that would not be sufficient and there doesn’t seem to be a replacement of similar or lower difficulty that would be sufficient either. This is a method that, if it’s correct, reduces to a very difficult but ordinary human challenge.
  
  But if I’m forced to give a name, Amanda Askell.
  - RogerDearnaley 23 Feb 2026 19:53 UTC
    2 points
    0
    Parent
    Fair enough. But it’s very visible that Claude’s constitution treats corrigibility alignment and ethical alignment as both dangerous, and tries to compromise between them. In The Adolescence of Technology, Dario Amodei makes it very explicit why that is: he sees each as having is own risks and failure modes. And for corrigibility, those are misuse and consolidation of power. Which is pretty much the argument I was making, except that he doesn’t cover the warfare-between-AI-enhanced-principals aspect.