Yes, a very dishonest AI likely can’t be identified by a human asking them anything.
Yes, an honest and extremely misaligned AI can be identified by just asking “what are your goals.”
But an honest and moderately misaligned AI may need to give a detailed ideal world before you identify the misalignment. They may behave well until strange new contexts bring up glitches in their moral reasoning.
Jailbreaking happens when an AI—which normally refuses bad requests—is confused into complying with a bad request. Likewise, an AI which normally won’t want to kill all humans or create a dystopia, might stumble into thoughts which changes its mind, if it runs for long enough.
My hope is the Detailed Ideal World Benchmark deliberately throws things at the AI which makes these failure modes happen during role-play, before they happens in real life.
Also, suppose you were forced to choose a human and give him absolute power over the universe. You have a few human candidates to choose from. If you simply asked them “what are your goals,” they might all talk about making people happy, caring for the weak, and so forth in a few short paragraphs.
If you asked them to give lots of details of their ideal world, you may learn that some of them have very bad views. When they talk about making people happy, you realize their definition of happiness is a bit different than yours. Yes, some of them will lie, and interviewing them won’t be perfect.
Yes, a very dishonest AI likely can’t be identified by a human asking them anything.
Yes, an honest and extremely misaligned AI can be identified by just asking “what are your goals.”
But an honest and moderately misaligned AI may need to give a detailed ideal world before you identify the misalignment. They may behave well until strange new contexts bring up glitches in their moral reasoning.
Jailbreaking happens when an AI—which normally refuses bad requests—is confused into complying with a bad request. Likewise, an AI which normally won’t want to kill all humans or create a dystopia, might stumble into thoughts which changes its mind, if it runs for long enough.
My hope is the Detailed Ideal World Benchmark deliberately throws things at the AI which makes these failure modes happen during role-play, before they happens in real life.
Also, suppose you were forced to choose a human and give him absolute power over the universe. You have a few human candidates to choose from. If you simply asked them “what are your goals,” they might all talk about making people happy, caring for the weak, and so forth in a few short paragraphs.
If you asked them to give lots of details of their ideal world, you may learn that some of them have very bad views. When they talk about making people happy, you realize their definition of happiness is a bit different than yours. Yes, some of them will lie, and interviewing them won’t be perfect.