That’s, again, outside the X-risk model here. This is something that could succeed if entrusted to a trustworthy principal, unlike other methods where that would not be sufficient and there doesn’t seem to be a replacement of similar or lower difficulty that would be sufficient either. This is a method that, if it’s correct, reduces to a very difficult but ordinary human challenge.
Fair enough. But it’s very visible that Claude’s constitution treats corrigibility alignment and ethical alignment as both dangerous, and tries to compromise between them. In The Adolescence of Technology, Dario Amodei makes it very explicit why that is: he sees each as having is own risks and failure modes. And for corrigibility, those are misuse and consolidation of power. Which is pretty much the argument I was making, except that he doesn’t cover the warfare-between-AI-enhanced-principals aspect.
That’s, again, outside the X-risk model here. This is something that could succeed if entrusted to a trustworthy principal, unlike other methods where that would not be sufficient and there doesn’t seem to be a replacement of similar or lower difficulty that would be sufficient either. This is a method that, if it’s correct, reduces to a very difficult but ordinary human challenge.
But if I’m forced to give a name, Amanda Askell.
Fair enough. But it’s very visible that Claude’s constitution treats corrigibility alignment and ethical alignment as both dangerous, and tries to compromise between them. In The Adolescence of Technology, Dario Amodei makes it very explicit why that is: he sees each as having is own risks and failure modes. And for corrigibility, those are misuse and consolidation of power. Which is pretty much the argument I was making, except that he doesn’t cover the warfare-between-AI-enhanced-principals aspect.