I don’t know if it’s come up in the comments, but naive (e.g. not cognitive-architecturally-informed) approaches seem fairly likely (~40%? OTTMH) to produce mesa-optimizationy-things, to me, see: https://www.lesswrong.com/posts/whRPLBZNQm3JD5Zv8/imitation-learning-considered-unsafe
Otherwise, yes, seems great, esp. if we just imitate AI safety researchers and let them go on to solve all the safety problems.
I don’t know if it’s come up in the comments, but naive (e.g. not cognitive-architecturally-informed) approaches seem fairly likely (~40%? OTTMH) to produce mesa-optimizationy-things, to me, see: https://www.lesswrong.com/posts/whRPLBZNQm3JD5Zv8/imitation-learning-considered-unsafe
Otherwise, yes, seems great, esp. if we just imitate AI safety researchers and let them go on to solve all the safety problems.