Marius Hobbhahn comments on Announcing Apollo Research

Marius Hobbhahn 15 Jun 2023 9:44 UTC
LW: 5 AF: 2
1
AF
All of the above but in a specific order.
1. Test if the model has components of deceptive capabilities with lots of handholding with behavioral evals and fine-tuning.
2. Test if the model has more general deceptive capabilities (i.e. not just components) with lots of handholding with behavioral evals and fine-tuning.
3. Do less and less handholding for 1 and 2. See if the model still shows deception.
4. Try to understand the inductive biases for deception, i.e. which training methods lead to more strategic deception. Try to answer questions such as: can we change training data, technique, order of fine-tuning approaches, etc. such that the models are less deceptive?
5. Use 1-4 to reduce the chance of labs deploying deceptive models in the wild.