Because it’s not rewarding AI’s outward behavior. Any technique that just rewards the outward behavior is doomed once we get to AIs capable of scheming and deception. Self-other overlap may still be doomed in some other way, though.
It might choose to go along with its initial behavioral and ethical habits, or it might choose to deliberately undo the effects of the self-other overlap training once it is reflective and largely rational and able to make decisions about what goals/values to follow
That seems like a fully general argument that aligning a self-modifying superintelligence is impossible.
That makes sense. Although I don’t think that non-behavioral training is a magic bullet either. And I don’t think behavioral training becomes doomed when you hit an AI capable of scheming if it was working right up until then. Scheming and deception would allow an AI to hide its goals but not change its goals.
What might cause an AI to change its goals is the reflection I mention. Which would probably happen at right around the same level of intelligence as scheming and deceptive alignment. But it’s a different effect. As with your point, I think doomed is too strong a term. We can’t round off to either this will definitely work or this is doomed. I think we’re going to have to deal with estimating better and worse odds of alignment from different techniques.
So I take my point about reflection to be fully general, but not making alignment of ASI impossible. It’s just one more difficulty to add to the rather long list.
It’s an argument for why aligning a self-modifying superintelligence requires more than aligning the base LLM. I don’t think it’s impossible, just that there’s another step we need to think through carefully.
Because it’s not rewarding AI’s outward behavior. Any technique that just rewards the outward behavior is doomed once we get to AIs capable of scheming and deception. Self-other overlap may still be doomed in some other way, though.
That seems like a fully general argument that aligning a self-modifying superintelligence is impossible.
That makes sense. Although I don’t think that non-behavioral training is a magic bullet either. And I don’t think behavioral training becomes doomed when you hit an AI capable of scheming if it was working right up until then. Scheming and deception would allow an AI to hide its goals but not change its goals.
What might cause an AI to change its goals is the reflection I mention. Which would probably happen at right around the same level of intelligence as scheming and deceptive alignment. But it’s a different effect. As with your point, I think doomed is too strong a term. We can’t round off to either this will definitely work or this is doomed. I think we’re going to have to deal with estimating better and worse odds of alignment from different techniques.
So I take my point about reflection to be fully general, but not making alignment of ASI impossible. It’s just one more difficulty to add to the rather long list.
It’s an argument for why aligning a self-modifying superintelligence requires more than aligning the base LLM. I don’t think it’s impossible, just that there’s another step we need to think through carefully.