I feel like you’re still coming up with arguments when you should be running the simulation. Try to predict what would actually happen if we made “another adversarial machine that checks the results”.
Let me guess. The machines coordinate themselves to mislead the human team into believing that they are aligned? If that’s it, I am really not impressed
The adversarial machine would get destroyed by the AGI whose outputs its testing or us if it doesn’t protect us (or trade/coordinate with the first AGI), so it seems like it’s motivated to protect us.
I feel like you’re still coming up with arguments when you should be running the simulation. Try to predict what would actually happen if we made “another adversarial machine that checks the results”.
Let me guess. The machines coordinate themselves to mislead the human team into believing that they are aligned? If that’s it, I am really not impressed
The adversarial machine would get destroyed by the AGI whose outputs its testing or us if it doesn’t protect us (or trade/coordinate with the first AGI), so it seems like it’s motivated to protect us.