That makes sense. Although I don’t think that non-behavioral training is a magic bullet either. And I don’t think behavioral training becomes doomed when you hit an AI capable of scheming if it was working right up until then. Scheming and deception would allow an AI to hide its goals but not change its goals.
What might cause an AI to change its goals is the reflection I mention. Which would probably happen at right around the same level of intelligence as scheming and deceptive alignment. But it’s a different effect. As with your point, I think doomed is too strong a term. We can’t round off to either this will definitely work or this is doomed. I think we’re going to have to deal with estimating better and worse odds of alignment from different techniques.
So I take my point about reflection to be fully general, but not making alignment of ASI impossible. It’s just one more difficulty to add to the rather long list.
That makes sense. Although I don’t think that non-behavioral training is a magic bullet either. And I don’t think behavioral training becomes doomed when you hit an AI capable of scheming if it was working right up until then. Scheming and deception would allow an AI to hide its goals but not change its goals.
What might cause an AI to change its goals is the reflection I mention. Which would probably happen at right around the same level of intelligence as scheming and deceptive alignment. But it’s a different effect. As with your point, I think doomed is too strong a term. We can’t round off to either this will definitely work or this is doomed. I think we’re going to have to deal with estimating better and worse odds of alignment from different techniques.
So I take my point about reflection to be fully general, but not making alignment of ASI impossible. It’s just one more difficulty to add to the rather long list.