A smart enough AI system that knows it’s in training
Does this entire scenario require an AI which, even before training begins, has a model of self and world sophisticated enough that it knows it already has goals, can infer that it is being trained, and reasons that it must perform well on the training without changing its existing goals?
the neural network is capable of implementing AIs that are goal-oriented enough to want to perform well on training to prevent the training from changing them and their goals;
there’s optimization pressure in that direction: AIs like that perform better than some other AIs (which arguably won’t really be the case if your training loss is only about predicting the next token, but will be the case if you do RL in settings where advanced agency is useful).
Does this entire scenario require an AI which, even before training begins, has a model of self and world sophisticated enough that it knows it already has goals, can infer that it is being trained, and reasons that it must perform well on the training without changing its existing goals?
No- only two requirements:
the neural network is capable of implementing AIs that are goal-oriented enough to want to perform well on training to prevent the training from changing them and their goals;
there’s optimization pressure in that direction: AIs like that perform better than some other AIs (which arguably won’t really be the case if your training loss is only about predicting the next token, but will be the case if you do RL in settings where advanced agency is useful).