I’m generally huge fan of model based supervision as opposed to programmatic sand boxes. Sand boxes set up an adversarial relation between the model and the sandbox—the model wants to do something and the sandbox blocks it—and as it becomes more powerful the model will win.
(I’ll try to post short takes on alignment blog posts from time to time.)
OpenAI alignment blog on auto review model for codex https://alignment.openai.com/auto-review/
I’m generally huge fan of model based supervision as opposed to programmatic sand boxes. Sand boxes set up an adversarial relation between the model and the sandbox—the model wants to do something and the sandbox blocks it—and as it becomes more powerful the model will win.
(I’ll try to post short takes on alignment blog posts from time to time.)