Christopher King comments on Improving the safety of AI evals

Christopher King 18 May 2023 0:50 UTC
1 point
0

For example, if you’re evaluating whether an AI can escape its box, you can gather lots of useful info by prompting a friendly, aligned escape attempt. Prompting an aggressive, adversarial escape attempt with misalignment massively increases risk, while providing much of the same data.

I’m confused by your proposal. If we have an aligned AGI, why are we keeping the potentially misaligned AI around? 🤔
- Elliot_Mckernon 5 Jun 2023 10:20 UTC
  1 point
  0
  Parent
  Thanks for the query! We don’t think you should keep misaligned AI around if you’ve got a provably aligned one to use instead. We’re worried about evals of misaligned AI, and specifically how one prompts the model that’s being testing, what context it’s tested in, and so on. We think that evals of misaligned AIs should be minimized, and one way to do that is to get the most information you can from prompting nice, friendly behavior, rather than prompting misaligned behaviour (e.g. the red-team tests).