AIs smart enough to help with alignment are capable enough that they’ll realize they are misaligned. Therefore, they will not help with alignment.
When I think about getting misaligned AIs to help with alignment research and other tasks, I’m normally not imagining that the AIs are unaware that they are misaligned. I’m imagining that we can get them to do useful work anyway. See here and here.
You might be interested in the Redwood Research reading list, which contains lots of analyses of these questions and many others.
To be clear, I do suspect any AI smart enough to solve alignment is also smart enough to escape control and kill us. I’m not planning to go into great detail on control until after a deeper dive on the subject, though. Thanks for the reading material!
The argument in this post seems to be:
When I think about getting misaligned AIs to help with alignment research and other tasks, I’m normally not imagining that the AIs are unaware that they are misaligned. I’m imagining that we can get them to do useful work anyway. See here and here.
You might be interested in the Redwood Research reading list, which contains lots of analyses of these questions and many others.
To be clear, I do suspect any AI smart enough to solve alignment is also smart enough to escape control and kill us. I’m not planning to go into great detail on control until after a deeper dive on the subject, though. Thanks for the reading material!