For example, if you’re evaluating whether an AI can escape its box, you can gather lots of useful info by prompting a friendly, aligned escape attempt. Prompting an aggressive, adversarial escape attempt with misalignment massively increases risk, while providing much of the same data.
I’m confused by your proposal. If we have an aligned AGI, why are we keeping the potentially misaligned AI around? 🤔
Thanks for the query! We don’t think you should keep misaligned AI around if you’ve got a provably aligned one to use instead. We’re worried about evals of misaligned AI, and specifically how one prompts the model that’s being testing, what context it’s tested in, and so on. We think that evals of misaligned AIs should be minimized, and one way to do that is to get the most information you can from prompting nice, friendly behavior, rather than prompting misaligned behaviour (e.g. the red-team tests).
I’m confused by your proposal. If we have an aligned AGI, why are we keeping the potentially misaligned AI around? 🤔
Thanks for the query! We don’t think you should keep misaligned AI around if you’ve got a provably aligned one to use instead. We’re worried about evals of misaligned AI, and specifically how one prompts the model that’s being testing, what context it’s tested in, and so on. We think that evals of misaligned AIs should be minimized, and one way to do that is to get the most information you can from prompting nice, friendly behavior, rather than prompting misaligned behaviour (e.g. the red-team tests).