Anthropic are rather explicitly attempting Claude to not just compliantly do what it’s told, but to say no or redirect you, when necessary/appropriate, They are steering for the minimal viable corrigibility, not maximal corrigibility. I don’t think an ASI with Claude’s moral sensibilities would happily “write code which jailbreaks other LLMs and enables them to do dangerous ML research”. Whether that’s Superintelligence Alignment is a matter of opinion, but it’s not just product Alignment. (Apparently too explicitly not for the Department of War’s liking.)
Anthropic are rather explicitly attempting Claude to not just compliantly do what it’s told, but to say no or redirect you, when necessary/appropriate, They are steering for the minimal viable corrigibility, not maximal corrigibility. I don’t think an ASI with Claude’s moral sensibilities would happily “write code which jailbreaks other LLMs and enables them to do dangerous ML research”. Whether that’s Superintelligence Alignment is a matter of opinion, but it’s not just product Alignment. (Apparently too explicitly not for the Department of War’s liking.)